Software Engineer, Training Infrastructure

Reposted 2 Days Ago
Be an Early Applicant
Mountain View, CA
189K-350K
Senior level
Artificial Intelligence
The Role
As a Software Engineer for Training Infrastructure at Google DeepMind, you'll design and implement infrastructure for large scale deep learning training systems, collaborating on research challenges and guiding junior members.
Summary Generated by Built In

Snapshot

Artificial Intelligence could be one of humanity’s most useful inventions. At Google DeepMind, we’re a team of scientists, engineers, machine learning experts and more, working together to advance the state of the art in artificial intelligence. We use our technologies for widespread public benefit and scientific discovery, and collaborate with others on critical challenges, ensuring safety and ethics are the highest priority.

You will join a ~10 size research engineering team working in Gemini. The team embeds in high priority / strategic research efforts, accelerating experimental iteration by improving the quality and capability of the tools and technology available to build large scale training systems. The team's core expertise is in reinforcement learning infrastructure and methods, distributed systems and accelerators.

The team is distributed across US, Canada, France and UK, and operates very collaboratively supporting efforts across Google DeepMind.

About us

Artificial Intelligence could be one of humanity’s most useful inventions. At Google DeepMind, we’re a team of scientists, engineers, machine learning experts and more, working together to advance the state of the art in artificial intelligence. We use our technologies for widespread public benefit and scientific discovery, and collaborate with others on critical challenges, ensuring safety and ethics are the highest priority.

The Role

This is an individual contributor position in which you will collaborate with the other engineers in the team on unblocking progress in critical research challenges, from the scoping of technical roadmaps, and the design and implementation of new infrastructure, to the design, execution and analysis of experiments. 

As an experienced Software Engineer you will naturally gravitate around infra-heavy tasks, take responsibility for improving performance and efficiency of mid and post-training workloads, and be a role model for more junior team members. 

You will align to Gemini priorities, flexibly ramping up new research problem spaces and effectively working with a broad range of collaborators across the organization. You will work with the team TL  to steer the team's direction, and select new efforts to engage with.

Key Responsibilities

  • Translate research requirements into technical roadmaps in collaboration with the other teams members
  • Execute and lead on the implementation and documentation of research infra 
  • Learn about the research problem space the team works in, upskill and be able to contribute to the efforts research agenda
  • Support growth of more junior team members
  • Add to the team culture, and be a role model of sustainability and excellence

About You

  • Bachelor's degree or equivalent practical experience.
  • 8 years of experience in software development, and with data structures/algorithms.
  • The ideal candidate will have 5 years of experience building, testing, and supporting software in research .
  • Proven track record of building large scale infra for research in Deep Learning, with profound understanding of:
    • Accelerators (e.g. Jax & XLA stack) & performance profiling and optimization
    • Analysis and debugging of training behavior
    • Distributed systems, resilience and performance

Experience with Reinforcement Learning a plus.

  • You communicate clearly both verbally and in writing, and are comfortable with working in a team distributed across time-zones:
    • You are a good technical writer, and produce clear and succinct design docs
    • You contribute constructively to an asynchronous design process
  • You can produce impactful work quickly: you are equally at ease with producing library-quality code as well as whipping out prototypes to unblock quick iteration of research ideas
  • You are comfortable moving around projects, supporting team-members as required, quickling ramping up on new problems, and working with a broad and diverse set of collaborators across engineering and research

Application Deadline: May 31st, 2025

The US base salary range for this full-time position is between $189,000 - $350,000 + bonus + equity + benefits. Your recruiter can share more about the specific salary range for your targeted location during the hiring process.

At Google DeepMind, we value diversity of experience, knowledge, backgrounds and perspectives and harness these qualities to create extraordinary impact. We are committed to equal employment opportunities regardless of sex, race, religion or belief, ethnic or national origin, disability, age, citizenship, marital, domestic or civil partnership status, sexual orientation, gender identity, pregnancy, or related condition (including breastfeeding) or any other basis as protected by applicable law. If you have a disability or additional need that requires accommodation, please do not hesitate to let us know.

  

Top Skills

Accelerators
Deep Learning
Distributed Systems
Jax
Reinforcement Learning
Xla Stack
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
1,218 Employees
On-site Workplace
Year Founded: 2010

What We Do

We’re a team of scientists, engineers, machine learning experts and more, working together to advance the state of the art in artificial intelligence. We use our technologies for widespread public benefit and scientific discovery, and collaborate with others on critical challenges, ensuring safety and ethics are the highest priority.

Our long term aim is to solve intelligence, developing more general and capable problem-solving systems, known as artificial general intelligence (AGI).

Guided by safety and ethics, this invention could help society find answers to some of the world’s most pressing and fundamental scientific challenges.

We have a track record of breakthroughs in fundamental AI research, published in journals like Nature, Science, and more.Our programs have learned to diagnose eye diseases as effectively as the world’s top doctors, to save 30% of the energy used to keep data centres cool, and to predict the complex 3D shapes of proteins - which could one day transform how drugs are invented.

Similar Jobs

Snap Inc. Logo Snap Inc.

Staff Software Engineer, Machine Learning Infrastructure, AI Training Platform, 9+ Years of Experience

Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
Hybrid
4 Locations
5000 Employees
195K-343K Annually

Scale AI Logo Scale AI

Software Engineer, ML Infrastructure - Training Platform

Artificial Intelligence • Big Data • Machine Learning
2 Locations
523 Employees
160K-226K Annually
San Francisco, CA, USA
9 Employees

Nuro Logo Nuro

Senior/Staff Software Engineer, ML Infrastructure, Distributed Training

Artificial Intelligence • Automotive • Information Technology • Robotics
Mountain View, CA, USA
908 Employees
167K-303K

Similar Companies Hiring

Stepful Thumbnail
Software • Healthtech • Edtech • Artificial Intelligence
New York, New York
60 Employees
HERE Technologies Thumbnail
Software • Logistics • Internet of Things • Information Technology • Computer Vision • Automotive • Artificial Intelligence
Amsterdam, NL
6000 Employees
True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account