About Us:
Arcee.ai is a cutting-edge AI company that empowers enterprises to own their GenAI strategy. We're a team of passionate and innovative engineers, researchers, and industry experts dedicated to pushing the boundaries of AI technology. We're looking for an exceptional Solution Architect to join our team and help design, develop, and deploy AI-powered solutions that meet the highest standards of quality, reliability, and performance.
Job Summary:
As a Machine Learning Infrastructure Engineer, you will be responsible for designing, developing, and maintaining the infrastructure that powers our machine learning models. You will work closely with data scientists, engineers, and researchers to ensure seamless integration of machine learning models into our production environment. Your expertise will enable us to scale our machine learning capabilities, improve model performance, and reduce time-to-market.
Key Responsibilities:
Design and Implementation:
- Design and implement scalable, efficient, and reliable machine learning infrastructure (e.g., containerization, orchestration, and cloud services).
- Develop and maintain infrastructure as code (IaC) using tools like Terraform, AWS CloudFormation, or Google Cloud Deployment Manager.
Model Serving and Deployment:
- Design and implement model serving platforms (e.g., TensorFlow Serving, AWS SageMaker, or Azure Machine Learning) for efficient model deployment and management.
- Develop and maintain automated model deployment pipelines using tools like Jenkins, GitLab CI/CD, or CircleCI.
Data Engineering:
- Collaborate with data engineers to design and implement data pipelines that feed machine learning models.
- Ensure data quality, integrity, and security throughout the data lifecycle.
Monitoring and Optimization:
- Develop and implement monitoring and logging solutions (e.g., Prometheus, Grafana, or ELK Stack) to track model performance, latency, and system health.
- Optimize infrastructure resources and model performance using techniques like hyperparameter tuning, model pruning, and knowledge distillation.
Collaboration and Communication:
- Work closely with data scientists, engineers, and researchers to identify infrastructure needs and develop solutions.
- Communicate technical information effectively to both technical and non-technical stakeholders.
Staying Up-to-Date:
- Stay current with industry trends, emerging technologies, and best practices in machine learning infrastructure.
- Participate in conferences, meetups, and online forums to expand knowledge and network with peers.
Ideal Candidate:
Cloud Computing and Infrastructure:
- Experience with major cloud platforms (AWS, Azure, GCP)
- Kubernetes expertise for container orchestration
- Infrastructure-as-Code (IaC) skills (e.g., Terraform, CloudFormation)
Machine Learning Operations (MLOps):
- Familiarity with ML model lifecycle management
- Experience with ML model serving frameworks (e.g., VLLM, TorchServe, SGLang)
- Knowledge of model versioning and experiment tracking tools
Deep Learning and NLP:
- Strong understanding of transformer architectures and LLMs
- Experience with popular deep learning frameworks (PyTorch)
- Familiarity with NLP concepts and techniques
API Development and Management:
- RESTful API design and implementation
- API gateway management and security
- Experience with OpenAPI/Swagger specifications
Performance Optimization:
- Proficiency in GPU acceleration techniques
- Experience with model quantization and pruning
- Knowledge of distributed inference and parallel computing
Programming Languages:
- Strong Python skills
- Familiarity with C++ for potential low-level optimizations
- Shell scripting for automation
Requirements:
Education:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
Experience:
- 3+ years of experience in machine learning infrastructure, DevOps, or a related field.
- Experience with cloud providers (e.g., AWS, GCP, or Azure) and containerization (e.g., Docker).
Technical Skills:
- Proficiency in programming languages like Python, Java, or C++.
- Experience with machine learning frameworks like TensorFlow, PyTorch, or Scikit-learn.
- Familiarity with infrastructure as code (IaC) tools like Terraform or CloudFormation.
- Knowledge of container orchestration tools like Kubernetes or Docker Swarm.
Soft Skills:
- Excellent communication, collaboration, and problem-solving skills.
- Ability to work in a fast-paced environment and prioritize tasks effectively.
Nice to Have:
Certifications:
- Cloud provider certifications (e.g., AWS Certified DevOps Engineer or GCP Professional Cloud Developer).
- Machine learning certifications (e.g., TensorFlow Certified Developer or PyTorch Certified Engineer).
Experience with:
- Model serving platforms like TensorFlow Serving or AWS SageMaker.
- Automated model deployment pipelines using tools like Jenkins or GitLab CI/CD.
- Monitoring and logging solutions like Prometheus or ELK Stack.
Knowledge of:
- Model explainability and interpretability techniques.
- Data privacy and security best practices.
What We Offer:
- Competitive Salary: A salary commensurate with experience and industry standards.
- Stock Options: Equity in [Company Name] to give you a stake in our success.
- Comprehensive Benefits: Health, dental, and vision insurance, as well as 401(k).
- Professional Development: Opportunities for growth, training, and conference attendance.
- Collaborative Environment: A dynamic, diverse team that values innovation and open communication.
Top Skills
What We Do
Arcee AI delivers purpose-built AI agents, powered by industry-leading small language models (SLMs) for enterprise applications. Their offering, Arcee Orchestra, is an end-to-end agentic AI solution that enables businesses to create AI agents for complex tasks. The solution makes it easy to build custom AI workflows that automatically route tasks to specialized SLMs to deliver detailed, trustworthy responses, fast.