Arcee AI

Machine Learning Infrastructure Engineer

Reposted 7 Days Ago

Hiring Remotely in USA

Remote

Mid level

Artificial Intelligence • Machine Learning • Software

The Role

The Machine Learning Infrastructure Engineer will design and maintain scalable infrastructure for machine learning models, working alongside data scientists and engineers. Responsibilities include implementing infrastructure-as-code, deploying models efficiently, optimizing performance, and ensuring data security and integrity.

Summary Generated by Built In

About Us:
Arcee.ai is a cutting-edge AI company that empowers enterprises to own their GenAI strategy. We're a team of passionate and innovative engineers, researchers, and industry experts dedicated to pushing the boundaries of AI technology. We're looking for an exceptional Solution Architect to join our team and help design, develop, and deploy AI-powered solutions that meet the highest standards of quality, reliability, and performance.

Job Summary:

As a Machine Learning Infrastructure Engineer, you will be responsible for designing, developing, and maintaining the infrastructure that powers our machine learning models. You will work closely with data scientists, engineers, and researchers to ensure seamless integration of machine learning models into our production environment. Your expertise will enable us to scale our machine learning capabilities, improve model performance, and reduce time-to-market.

Key Responsibilities:

Design and Implementation:

Design and implement scalable, efficient, and reliable machine learning infrastructure (e.g., containerization, orchestration, and cloud services).
Develop and maintain infrastructure as code (IaC) using tools like Terraform, AWS CloudFormation, or Google Cloud Deployment Manager.

Model Serving and Deployment:

Design and implement model serving platforms (e.g., TensorFlow Serving, AWS SageMaker, or Azure Machine Learning) for efficient model deployment and management.
Develop and maintain automated model deployment pipelines using tools like Jenkins, GitLab CI/CD, or CircleCI.

Data Engineering:

Collaborate with data engineers to design and implement data pipelines that feed machine learning models.
Ensure data quality, integrity, and security throughout the data lifecycle.

Monitoring and Optimization:

Develop and implement monitoring and logging solutions (e.g., Prometheus, Grafana, or ELK Stack) to track model performance, latency, and system health.
Optimize infrastructure resources and model performance using techniques like hyperparameter tuning, model pruning, and knowledge distillation.

Collaboration and Communication:

Work closely with data scientists, engineers, and researchers to identify infrastructure needs and develop solutions.
Communicate technical information effectively to both technical and non-technical stakeholders.

Staying Up-to-Date:

Stay current with industry trends, emerging technologies, and best practices in machine learning infrastructure.
Participate in conferences, meetups, and online forums to expand knowledge and network with peers.

Ideal Candidate:

Cloud Computing and Infrastructure:
- Experience with major cloud platforms (AWS, Azure, GCP)
- Kubernetes expertise for container orchestration
- Infrastructure-as-Code (IaC) skills (e.g., Terraform, CloudFormation)

Machine Learning Operations (MLOps):
- Familiarity with ML model lifecycle management
- Experience with ML model serving frameworks (e.g., VLLM, TorchServe, SGLang)
- Knowledge of model versioning and experiment tracking tools‍

Deep Learning and NLP:
- Strong understanding of transformer architectures and LLMs
- Experience with popular deep learning frameworks (PyTorch)
- Familiarity with NLP concepts and techniques

API Development and Management:
- RESTful API design and implementation
- API gateway management and security
- Experience with OpenAPI/Swagger specifications

Performance Optimization:
- Proficiency in GPU acceleration techniques
- Experience with model quantization and pruning
- Knowledge of distributed inference and parallel computing

Programming Languages:
- Strong Python skills
- Familiarity with C++ for potential low-level optimizations
- Shell scripting for automation

Requirements:

Education:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

Experience:

3+ years of experience in machine learning infrastructure, DevOps, or a related field.
Experience with cloud providers (e.g., AWS, GCP, or Azure) and containerization (e.g., Docker).

Technical Skills:

Proficiency in programming languages like Python, Java, or C++.
Experience with machine learning frameworks like TensorFlow, PyTorch, or Scikit-learn.
Familiarity with infrastructure as code (IaC) tools like Terraform or CloudFormation.
Knowledge of container orchestration tools like Kubernetes or Docker Swarm.

Soft Skills:

Excellent communication, collaboration, and problem-solving skills.
Ability to work in a fast-paced environment and prioritize tasks effectively.

Nice to Have:

Certifications:

Cloud provider certifications (e.g., AWS Certified DevOps Engineer or GCP Professional Cloud Developer).
Machine learning certifications (e.g., TensorFlow Certified Developer or PyTorch Certified Engineer).

Experience with:

Model serving platforms like TensorFlow Serving or AWS SageMaker.
Automated model deployment pipelines using tools like Jenkins or GitLab CI/CD.
Monitoring and logging solutions like Prometheus or ELK Stack.

Knowledge of:

Model explainability and interpretability techniques.
Data privacy and security best practices.

What We Offer:

Competitive Salary: A salary commensurate with experience and industry standards.
Stock Options: Equity in [Company Name] to give you a stake in our success.
Comprehensive Benefits: Health, dental, and vision insurance, as well as 401(k).
Professional Development: Opportunities for growth, training, and conference attendance.
Collaborative Environment: A dynamic, diverse team that values innovation and open communication.

‍

Top Skills

AWS

Azure

C++

Docker

Elk Stack

GCP

Gitlab Ci/Cd

Grafana

Jenkins

Kubernetes

Prometheus

Python

PyTorch

Restful Api

Scikit-Learn

TensorFlow

Terraform

View all jobs at Arcee AI

View Arcee AI Profile

Report Job

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: San Francisco, California

48 Employees

On-site Workplace

Year Founded: 2023

What We Do

Arcee AI delivers purpose-built AI agents, powered by industry-leading small language models (SLMs) for enterprise applications. Their offering, Arcee Orchestra, is an end-to-end agentic AI solution that enables businesses to create AI agents for complex tasks. The solution makes it easy to build custom AI workflows that automatically route tasks to specialized SLMs to deliver detailed, trustworthy responses, fast.