Platform Infrastructure Engineer

Posted 23 Hours Ago
Hiring Remotely in USA
Remote
Senior level
Artificial Intelligence • Machine Learning • Software
The Role
The Platform Infrastructure Engineer will build, scale, and maintain multi-tenant, multi-cluster infrastructure on AWS EKS, automate infrastructure provisioning, and streamline deployment pipelines. They will leverage tools like ArgoCD and Terraform, manage GitOps workflows, and monitor GPUs, collaborating closely with developers and DevOps teams to enhance system reliability and performance.
Summary Generated by Built In

About Us:
Arcee.ai is a cutting-edge AI company that empowers enterprises to own their GenAI strategy. We're a team of passionate and innovative engineers, researchers, and industry experts dedicated to pushing the boundaries of AI technology. We're looking for an exceptional Solution Architect to join our team and help design, develop, and deploy AI-powered solutions that meet the highest standards of quality, reliability, and performance.


About the role:

We’re looking for a Platform Infrastructure Engineer with a deep focus on Kubernetes and AWS EKS to build and scale our multi-tenant, multi-cluster infrastructure that hosts our SAAS products, enterprise products, and AI models. In this role, you’ll collaborate closely with a small, agile team to automate infrastructure provisioning, streamline deployment pipelines, and ensure the reliability and scalability of our platform. You’ll leverage tools like ArgoCD, Atlantis, Terraform, Terragrunt, Grafana observability stack, and work with deploying and orchestrating GPUs to drive a GitOps-first approach and cultivate operational excellence. 

What you’ll do:

  • Architect, deploy, and maintain Kubernetes clusters on AWS EKS in a multi-tenant, multi-cluster environment that is portable to other cloud providers and VPCs.
  • Own our Infrastructure as Code practices using Terraform and Terragrunt, ensuring consistency and repeatability
  • Implement and manage GitOps workflows with ArgoCD to enhance delivery pipelines
  • Set up, configure, and maintain Atlantis for automated Terraform workflow management
  • Collaborate with developers, DevOps, and product teams to improve deployment speeds and system reliability
  • Take part in writing and reviewing technical documentation, providing best practices and guidance for the broader engineering team
  • Troubleshoot and resolve issues across infrastructure and networking.
  • Help deploy, orchestrate, and monitor our GPUs


What we’re seeking:

  • Experience deploying and orchestrating a Grafana Observability Stack (Alloy, Mimir, Loki, Tempo, Grafana) or similar monitoring solution.
  • Experience deploying and orchestrating GPUs.
  • Proven experience with Kubernetes in production, with readiness to tackle multi-cloud.
  • Hands-on expertise with Terraform and Terragrunt for Infrastructure as Code
  • Familiarity with GitOps methodologies and ArgoCD for continuous deployment
  • Experience managing multi-tenant, multi-cluster environments at scale
  • Strong scripting and automation skills (e.g., Python, Bash, Go)
  • Solid understanding of networking concepts and cloud infrastructure (AWS preferred, other cloud providers acceptable)
  • Clear communication, problem-solving mindset, and the ability to work effectively in a small, fast-moving team 

Equal Opportunity

We are an Equal Opportunity Employer, offering equal opportunity to all regardless of race, religion, gender identity, sexual orientation, age, citizenship, marital status, disability, and more. We would like to remind candidates that the listed qualifications for each role are not hard requirements, and we encourage them to apply if they feel they would be a good fit.

Compensation

We offer competitive salaries, equity, and benefits. We base our salaries on location, role, and level as well as consideration of the candidate’s experience and overall qualifications.

Top Skills

Bash
Go
Kubernetes
Python
The Company
HQ: San Francisco, California
48 Employees
On-site Workplace
Year Founded: 2023

What We Do

Arcee AI delivers purpose-built AI agents, powered by industry-leading small language models (SLMs) for enterprise applications. Their offering, Arcee Orchestra, is an end-to-end agentic AI solution that enables businesses to create AI agents for complex tasks. The solution makes it easy to build custom AI workflows that automatically route tasks to specialized SLMs to deliver detailed, trustworthy responses, fast.

Similar Jobs

Dropbox Logo Dropbox

Senior Infrastructure Software Engineer, Search Platform

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
Remote
United States
2500 Employees
196K-265K Annually
Remote
United States
630 Employees
Remote
USA
47 Employees

Bestow Logo Bestow

Platform Engineer II

Big Data • Fintech • Information Technology • Insurance • Software
Remote
Hybrid
2 Locations
160 Employees
115K-130K Annually

Similar Companies Hiring

Hedra Thumbnail
Software • News + Entertainment • Marketing Tech • Generative AI • Enterprise Web • Digital Media • Consumer Web
San Francisco, CA
14 Employees
HERE Thumbnail
Software • Logistics • Internet of Things • Information Technology • Computer Vision • Automotive • Artificial Intelligence
Amsterdam, NL
6000 Employees
True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account