Lead Cluster Operations Support Engineer (m/f/d)

Posted 2 Days Ago
Be an Early Applicant
Madrid, Comunidad de Madrid
Senior level
Software
Why does Thoughtworks exist? To create an extraordinary impact on the world through our culture & technology excellence.
The Role
The Lead Cluster Operations Support Engineer will provide 24x7 support for GPU cluster training, collaborating with Machine Learning and Infrastructure Engineers. Responsibilities include improving tooling, assessing model training readiness, and addressing issues related to infrastructure and data science during training workloads.
Summary Generated by Built In

This team will provide 24x7 white-glove support to people using large blocks of GPUs (6,000+ contiguous GPUs) for a short period of time (eg: 6-weeks, 12-weeks etc) to perform Managed Post Training (MPT). This includes helping with preparation, 24x7 support during training to ensure full utilization of the GPU clusters and off-boarding. The team is in three timezones with hand-off protocols to enable 24x7 support: US, Europe and India. While you can be a specialist in Infra and cluster operations, you need to know enough about ML.

Job responsibilities

  • You will help shape and iterate this new white glove model training support service on large GPU clusters.
  • You will work in a collaborative team with Machine Learning Engineers and Infrastructure Engineers.
  • You will contribute to accelerator development: find gaps in the tooling, or needed automation, or patterns we would develop accelerators to make the next round of this more efficient and faster. Eg: We need to improve observability, or we need to automate user onboarding, or we need to bring in a new tool which everyone seems to want to use etc. This will probably involve a combination of Terraform/Pulumi, Helm Charts, Python and Shell Scripts.
  • You will help assess the model training readiness and data preparation.
  • You will provide model training support rotating daytime weekend shifts - with pagers, to any issues they may encounter. These can range from infrastructure issues to data sciences issues or anything in between: eg: GCP changed a configuration in GKE that affects the training.
  • You will facilitate collaborative problem solving within the team by actively listening, communicating effectively and mentoring other engineers.
  • You will proactively identify and address challenges related to the white glove service for continued pre training, proposing solutions and implementing improvements.

Job qualifications

Technical Skills
  • Deep expertise Kubernetes administration and debugging at scale.
  • Deep knowledge of managing large clusters with 1000s of nodes with K8s.
  • Knowledge of running training workloads on 1000s of GPUs.
  • Knowledge of working with the Lustre filesystem is a plus.
  • Knowledge of working with NVIDIA NeMo Framework (Docker image for model training).
  • Knowledge of working with NVIDIA NeMo NIMs (Docker images for inference).
  • Underlying Cloud: GCP, AWS, Azure.
  • Terraform / Pulumi, Helm Charts, Linux, other Infrastructure-as-code tools.
  • Nice to have: Run:ai, TrueFoundry, Huggingface platform etc (can provide training).
  • Knowledge of working with HPC technologies such as Slurm is a bonus.

Professional Skills

  • You will be part of a high value client facing white glove service, where a high level of professionalism is required.
  • You understand the importance of stakeholder management and can easily liaise between clients and other key stakeholders throughout projects, ensuring buy-in and gaining trust along the way.
  • You are resilient in ambiguous situations and can adapt your role to approach challenges from multiple perspectives.
  • You don’t shy away from risks or conflicts, instead you take them on and skillfully manage them.
  • You are eager to coach, mentor and motivate others and you aspire to influence teammates to take positive action and accountability for their work.
  • You enjoy influencing others and always advocate for technical excellence while being open to change when needed.
  • You have an insatiable curiosity and a drive to learn new things.

Other things to know

Learning & Development

There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.

About Thoughtworks

Thoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. For 30+ years, we’ve delivered extraordinary impact together with our clients by helping them solve complex business problems with technology as the differentiator. Bring your brilliant expertise and commitment for continuous learning to Thoughtworks. Together, let’s be extraordinary.

#LI-Remote

Top Skills

Kubernetes
Python
Shell
The Company
HQ: Chicago, IL
7,674 Employees
Hybrid Workplace
Year Founded: 1993

What We Do

We are a leading global technology consultancy that integrates strategy, design and software engineering to enable enterprises and technology disruptors across the globe to thrive as modern digital businesses.

Why Work With Us

As technologists, we have a unique role to play in how technology should benefit all of society, pursuing a more equitable future. Part of that role is to continuously educate ourselves on the issues that matter to the causes we believe in. We recognize our privilege and strive to see the world from the perspective of the most vulnerable.

Gallery

Gallery

Similar Jobs

UL Solutions Logo UL Solutions

Power Electronics Test Engineer

Automotive • Professional Services • Software • Consulting • Energy • Chemical • Renewable Energy
Hybrid
Madrid, Comunidad de Madrid, ESP
15000 Employees

UL Solutions Logo UL Solutions

Junior Power Electronics Test Engineer

Automotive • Professional Services • Software • Consulting • Energy • Chemical • Renewable Energy
Hybrid
Madrid, Comunidad de Madrid, ESP
15000 Employees

Mondelēz International Logo Mondelēz International

DevOps - Continuous Testing Lead - MEU

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Hybrid
4 Locations
90000 Employees

SailPoint Logo SailPoint

Solution Architect - German

Artificial Intelligence • Cloud • Sales • Security • Software • Cybersecurity • Data Privacy
Remote
Hybrid
3 Locations
2461 Employees

Similar Companies Hiring

Stepful Thumbnail
Software • Healthtech • Edtech • Artificial Intelligence
New York, New York
60 Employees
HERE Technologies Thumbnail
Software • Logistics • Internet of Things • Information Technology • Computer Vision • Automotive • Artificial Intelligence
Amsterdam, NL
6000 Employees
True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account