About Us:
Arcee.ai is a cutting-edge AI company that empowers enterprises to own their GenAI strategy. We're a team of passionate and innovative engineers, researchers, and industry experts dedicated to pushing the boundaries of AI technology. We're looking for an exceptional Solution Architect to join our team and help design, develop, and deploy AI-powered solutions that meet the highest standards of quality, reliability, and performance.
About the role:
We’re looking for a Platform Infrastructure Engineer with a deep focus on Kubernetes and AWS EKS to build and scale our multi-tenant, multi-cluster infrastructure that hosts our SAAS products, enterprise products, and AI models. In this role, you’ll collaborate closely with a small, agile team to automate infrastructure provisioning, streamline deployment pipelines, and ensure the reliability and scalability of our platform. You’ll leverage tools like ArgoCD, Atlantis, Terraform, Terragrunt, Grafana observability stack, and work with deploying and orchestrating GPUs to drive a GitOps-first approach and cultivate operational excellence.
What you’ll do:
- Architect, deploy, and maintain Kubernetes clusters on AWS EKS in a multi-tenant, multi-cluster environment that is portable to other cloud providers and VPCs.
- Own our Infrastructure as Code practices using Terraform and Terragrunt, ensuring consistency and repeatability
- Implement and manage GitOps workflows with ArgoCD to enhance delivery pipelines
- Set up, configure, and maintain Atlantis for automated Terraform workflow management
- Collaborate with developers, DevOps, and product teams to improve deployment speeds and system reliability
- Take part in writing and reviewing technical documentation, providing best practices and guidance for the broader engineering team
- Troubleshoot and resolve issues across infrastructure and networking.
- Help deploy, orchestrate, and monitor our GPUs
What we’re seeking:
- Experience deploying and orchestrating a Grafana Observability Stack (Alloy, Mimir, Loki, Tempo, Grafana) or similar monitoring solution.
- Experience deploying and orchestrating GPUs.
- Proven experience with Kubernetes in production, with readiness to tackle multi-cloud.
- Hands-on expertise with Terraform and Terragrunt for Infrastructure as Code
- Familiarity with GitOps methodologies and ArgoCD for continuous deployment
- Experience managing multi-tenant, multi-cluster environments at scale
- Strong scripting and automation skills (e.g., Python, Bash, Go)
- Solid understanding of networking concepts and cloud infrastructure (AWS preferred, other cloud providers acceptable)
- Clear communication, problem-solving mindset, and the ability to work effectively in a small, fast-moving team
Equal Opportunity
We are an Equal Opportunity Employer, offering equal opportunity to all regardless of race, religion, gender identity, sexual orientation, age, citizenship, marital status, disability, and more. We would like to remind candidates that the listed qualifications for each role are not hard requirements, and we encourage them to apply if they feel they would be a good fit.
Compensation
We offer competitive salaries, equity, and benefits. We base our salaries on location, role, and level as well as consideration of the candidate’s experience and overall qualifications.
Top Skills
What We Do
Arcee AI delivers purpose-built AI agents, powered by industry-leading small language models (SLMs) for enterprise applications. Their offering, Arcee Orchestra, is an end-to-end agentic AI solution that enables businesses to create AI agents for complex tasks. The solution makes it easy to build custom AI workflows that automatically route tasks to specialized SLMs to deliver detailed, trustworthy responses, fast.