Head of ML Infrastructure

Posted 18 Days Ago
Be an Early Applicant
Palo Alto, CA
Senior level
Artificial Intelligence • Healthtech
The Role
Lead the design and operation of orchestration platforms for Large Language Models, optimizing infrastructure, managing cloud strategies, and leading a team of engineers.
Summary Generated by Built In

About Us:

Hippocratic AI has developed a safety-focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically improve healthcare accessibility and health outcomes in the world by bringing deep healthcare expertise to every human. No other technology has the potential to have this level of global impact on health. 

Why Join Our Team:

  • Innovative Mission: We are developing a safe, healthcare-focused large language model (LLM) designed to revolutionize health outcomes on a global scale.

  • Visionary Leadership: Hippocratic AI was co-founded by CEO Munjal Shah, alongside a group of physicians, hospital administrators, healthcare professionals, and artificial intelligence researchers from leading institutions, including El Camino Health, Johns Hopkins, Stanford, Microsoft, Google, and NVIDIA.

  • Strategic Investors: We have raised a total of $278 million in funding, backed by top investors such as Andreessen Horowitz, General Catalyst, Kleiner Perkins, NVIDIA’s NVentures, Premji Invest, SV Angel, and six health systems.

  • World-Class Team: Our team is composed of leading experts in healthcare and artificial intelligence, ensuring our technology is safe, effective, and capable of delivering meaningful improvements to healthcare delivery and outcomes.

Position Overview:


We are seeking a highly skilled and innovative Head of ML Infrastructure to lead the design, development, and operation of our orchestration platform for a heterogeneous constellation of Large Language Models (LLMs). The ideal candidate will have deep expertise in infrastructure orchestration, multi-cloud environments, and tools such as Kubernetes and Terraform. This role is critical to ensuring that our AI systems are scalable, reliable, and seamlessly integrated into our broader technology ecosystem.

Key Responsibilities:

Orchestration Platform Development:


• Architect and implement an advanced orchestration platform to manage a diverse set of LLMs efficiently.
• Design solutions to optimize performance, scalability, and availability across various deployment environments.


Infrastructure Management:


• Utilize Kubernetes, Terraform, and other Infrastructure as Code (IAC) tools to automate and manage ML infrastructure.
• Collaborate with DevOps and cloud engineering teams to ensure seamless integration with CI/CD pipelines.
• Establish robust monitoring, logging, and alerting systems for ML infrastructure.


Multi-Cloud Strategy:

• Design and execute strategies to leverage multiple cloud providers for cost optimization, redundancy, and compliance.
• Manage cloud-native services to support model deployment and orchestration at scale.

Performance Optimization:


• Work closely with ML engineers to fine-tune model deployment strategies, focusing on latency, throughput, and fault tolerance.
• Conduct capacity planning and develop tools for model lifecycle management.

Leadership & Collaboration:

• Lead a team of infrastructure engineers, fostering a culture of innovation, collaboration, and excellence.
• Act as a bridge between ML research, engineering, and operations teams to align infrastructure capabilities with business needs.
• Stay abreast of emerging technologies and methodologies in ML infrastructure and orchestration.

Qualifications:

Technical Skills:

• Proven experience in building and managing ML infrastructure platforms, particularly for LLMs or other advanced AI systems.
• Expertise in Kubernetes, Terraform, and other IAC tools.
• Deep understanding of multi-cloud architectures (e.g., AWS, Azure, Google Cloud) and hybrid cloud solutions.
• Strong programming skills in Python, Go, or a similar language, with experience in building automation and orchestration tools.
• Familiarity with modern ML frameworks and tools (e.g., TensorFlow, PyTorch, Hugging Face).

Leadership & Communication:

  • Demonstrated success in leading infrastructure teams and managing large-scale projects

  • Excellent problem-solving and decision-making skills.

Strong communication skills, with the ability to convey complex technical ideas to non-technical stakeholders.

Education & Experience:

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (or equivalent work experience).

  • 8+ years of experience in infrastructure engineering, with at least 3 years in a leadership

Top Skills

AWS
Azure
Go
GCP
Hugging Face
Kubernetes
Python
PyTorch
TensorFlow
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Palo Alto, California
97 Employees
On-site Workplace
Year Founded: 2023

What We Do

Hippocratic AI’s mission is to develop the first safety-focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically improve healthcare accessibility and health outcomes in the world by bringing deep healthcare expertise to every human. No other technology has the potential to have this level of global impact on health.
The company was co-founded by CEO Munjal Shah, alongside a group of physicians, hospital administrators, healthcare professionals, and artificial intelligence researchers from El Camino Health, Johns Hopkins, Washington University in St. Louis, Stanford, Google, Microsoft, Meta and NVIDIA. Hippocratic AI has received a total of $137 million in funding and is backed by leading investors, including General Catalyst, Andreessen Horowitz, Premji Invest, SV Angel, NVentures (Nvidia Venture Capital), and Greycroft. For more information on Hippocratic AI: www.HippocraticAI.com.

Similar Jobs

Anduril Logo Anduril

Senior Software Engineer, RAIL

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
Costa Mesa, CA, USA
4500 Employees
168K-252K Annually

Anduril Logo Anduril

Firmware Engineer

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
Costa Mesa, CA, USA
4500 Employees
138K-207K Annually
Hybrid
San Francisco, CA, USA
289097 Employees

Vannevar Labs Logo Vannevar Labs

Manager, Software Engineering (Collection)

Artificial Intelligence • Machine Learning • Software • Defense
Remote
3 Locations
130 Employees

Similar Companies Hiring

Stepful Thumbnail
Software • Healthtech • Edtech • Artificial Intelligence
New York, New York
60 Employees
HERE Technologies Thumbnail
Software • Logistics • Internet of Things • Information Technology • Computer Vision • Automotive • Artificial Intelligence
Amsterdam, NL
6000 Employees
True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account