Operations Engineer - GPU Cloud & HPC Support (m/f/d)

Posted 6 Days Ago
Be an Early Applicant
London, Greater London, England
Mid level
Information Technology • Business Intelligence • Consulting
The Role
The Operations Engineer will ensure the reliability and performance of GPU-accelerated HPC infrastructure by monitoring systems, troubleshooting issues, managing incidents, and improving operational procedures. Responsibilities include on-call support, documentation, and collaboration with the Infrastructure team to enhance system functionality.
Summary Generated by Built In

Job Description

We are a leading cloud company specializing in GPU-accelerated high-performance computing (HPC) clusters, leveraging NVIDIA GPUs and advanced networking technologies. Our solutions empower industries like artificial intelligence (AI), machine learning, and scientific research to tackle complex computational challenges. As a company committed to innovation, performance, and service, we are expanding our Operations team and looking for motivated engineers to support our GPU-powered infrastructure and maintain our standards of excellence. 
We are seeking a highly skilled Operations Engineer to join our Support & Operations team, focused on ensuring the reliability, performance, and availability of our GPU-accelerated HPC infrastructure. This role involves continuous monitoring, troubleshooting, and system optimization in collaboration with our Infrastructure teams. The ideal candidate will have experience in infrastructure operations, Linux administration, Ansible automation, version control, NVIDIA GPU technology, and high-performance systems. 
 

YOUR RESPONSIBILITIES: 

  • System Monitoring and Health: Proactively monitor the health, performance, and availability of GPU-accelerated HPC clusters, utilizing various monitoring tools to ensure consistent system functionality. 

  • Issue Identification and Resolution: Quickly identify potential issues, respond to alerts, and troubleshoot to resolve system outages and performance bottlenecks, minimizing downtime. 

  • On-Call Support: Rotation: Participate in a 24/7 on-call rotation, providing critical support for infrastructure issues and customer escalations. 

  • Incident Management and RCA: Manage incidents, perform root cause analysis, and work to implement long-term solutions for infrastructure stability. 

  • Documentation and SOP Creation: Develop and maintain clear documentation for operational procedures, troubleshooting guides, and best practices in HPC cluster management. 

  • Operational SOPs: Create and regularly update standard operating procedures for tasks such as upgrades, system patching, and hardware replacements. 

  • Collaborative System Enhancements: Work with the Infrastructure team on deployments, upgrades, and system enhancements, aligning with operational best practices to ensure optimal performance. 

  • Support for Customer Technical Issues: Provide Customer Support with assistance in resolving complex technical issues related to GPU-accelerated infrastructure and HPC systems, maintaining high customer satisfaction. 

  

YOUR QUALIFICATIONS:  

  • Experience: Minimum 3 years of experience in infrastructure operations, system administration, or technical support, ideally within HPC or GPU-accelerated environments. 

  • Linux Administration: Advanced Linux system administration skills, with experience in managing, optimizing, and troubleshooting Linux-based environments. 

  • Ansible Proficiency: At least 1 year of experience with Ansible, with proven capability in both reading and writing playbooks, and a demonstrated ability to create new automations within Ansible. 

  • Networking Skills: Strong troubleshooting skills with high-performance networking technologies, such as InfiniBand, RDMA, or similar protocols. 

  • Monitoring and Management Tools: Hands-on experience with monitoring tools Grafana, Prometheus and system management for large-scale infrastructure. 

  • Incident Management: Demonstrated experience with incident management and handling escalations.  

  • Version Control: Proficiency in version control tools like GitHub, for code management, collaboration, and tracking changes across team members. 

  

 NICE TO HAVES: 

  • GPU Technology and HPC Expertise: Familiarity with NVIDIA GPU technology, HPC architectures, storage solutions, and high-performance file systems. 

  • Scripting and Automation: Scripting (Python, Bash) for task automation and operational efficiency. 

  • Python Knowledge: Python coding skills are beneficial, as the majority of our codebase is Python-based, enabling seamless integration within our team. 

  • Performance Tuning: Understanding of best practices in performance tuning for HPC and GPU-accelerated systems. 

  • CI/CD Pipelines: Experience with continuous integration and continuous deployment (CI/CD) practices and tools to support infrastructure automation. 

WHAT WE OFFER

With us, you will work towards the future of HPC: From new, sustainable building methods for data centers to cooling concepts to software solutions for accelerated compute. 

Your approaches count: In official exchange formats or spontaneously at the coffee machine. At Northern Data, it's the best idea that counts - not the hierarchy. We’re looking forward to getting your inputs!

You make the difference in the company: Unlike in established corporations, at Northern Data you will really help shape things. From implementing new departments, to optimizing processes and culture. 

Best-in-class partners: The best work with Northern Data. This means a knowledge and time advantage from which your career and our customers benefit equally.

Green by heart: Sustainability is at the core of Northern Data. With us, you actively work on the carbon neutrality of datacenters worldwide. Beginning with our infrastructure and continuing with the solutions for our clients, we work towards a green future.

Home Office facts: Work with our international and virtual team flexible from home. And of course, your hardware wishes will be fulfilled to make your ideas for next level HPC come true.

Your wellness matters: At Northern Data we have regular wellbeing initiatives that are designed to promote wellness, diversity, inclusion, and much more, ensuring a supportive and enriching environment for our global team.

Top Skills

Ansible
Linux
The Company
Hessen
124 Employees
On-site Workplace

What We Do

At Northern Data Group, we believe unlimited High Performance Computing (HPC) will unlock unprecedented opportunities for research and development, business, and ultimately human progress.

We power innovation through market-leading HPC infrastructure, operating across our three business divisions: Taiga Cloud, Ardent Data Centers and Peak Mining.

Our global organization is rapidly becoming a world leader for GPU-based solutions by designing and operating ultra-efficient green HPC infrastructure.

We uniquely combine intelligent and sustainable data centers, cutting-edge hardware and self-developed software for various HPC applications including Generative AI, Machine Learning and Bitcoin Mining.

We operate from large-scale custom data centers and proprietary containerized data centers for ultimate site selection flexibility

Similar Jobs

Integral Ad Science Logo Integral Ad Science

Staff Software Engineer, Publica

AdTech • Big Data • Digital Media • Marketing Tech
Easy Apply
London, Greater London, England, GBR
900 Employees
Easy Apply
3 Locations
1100 Employees
Easy Apply
London, Greater London, England, GBR
1100 Employees

Cloudflare Logo Cloudflare

Senior Specialist Solutions Engineer, Application Performance and Security, EMEA

Cloud • Information Technology • Security • Software • Cybersecurity
Hurlands, Guildford, Surrey, England, GBR
3900 Employees

Similar Companies Hiring

Silverfort Thumbnail
Security • Sales • Information Technology • Cybersecurity • Automation
GB
357 Employees
Jobba Trade Technologies, Inc. Thumbnail
Software • Professional Services • Productivity • Information Technology • Cloud
Chicago, IL
45 Employees
InCommodities Thumbnail
Renewable Energy • Machine Learning • Information Technology • Energy • Automation • Analytics
Austin, TX
234 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account