High Performance Computing (HPC) Engineer

Posted 8 Days Ago
Be an Early Applicant
Palo Alto, CA
Mid level
Artificial Intelligence • Software
The Role
As a High Performance Computing Engineer, manage GPU clusters, implement distributed training, optimize performance, and provide technical support in deep learning initiatives.
Summary Generated by Built In

Headquartered in Silicon Valley, we are a newly established start-up, where a collective of visionary scientists, engineers, and entrepreneurs are dedicated to transforming the landscape of biology and medicine through the power of Generative AI. Our team comprises leading minds and innovators in AI and Biological Science, pushing the boundaries of what is possible. We are dreamers who reimagine a new paradigm for biology and medicine.


We are committed to decoding biology holistically and enabling the next generation of life-transforming solutions. As the first mover in pan-modal Large Biological Models (LBM), we are pioneering a new era of biomedicine, with our LBM training leading to ground-breaking advancements and a transformative approach to healthcare. Our exceptionally strong R&D team and leadership in LLM and generative AI position us at the forefront of this revolutionary field. With headquarters in Silicon Valley, California, and a branch office in Paris, we are poised to make a global impact. Join us as we embark on this journey to redefine the future of biology and medicine through the transformative power of Generative AI.

Job Description

  • GPU Cluster Management: Design, deploy, and maintain high-performance GPU clusters, ensuring their stability, reliability, and scalability. Monitor and manage cluster resources to maximize utilization and efficiency.
  • Distributed/Parallel Training: Implement distributed computing techniques to enable parallel training of large deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization to achieve faster convergence and reduced training times.
  • Performance Optimization: Fine-tune GPU clusters and deep learning frameworks to achieve optimal performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
  • Deep Learning Framework Integration: Collaborate with data scientists and machine learning engineers to integrate distributed training capabilities into GenBio AI’s model development and deployment frameworks. 
  • Scalability and Resource Management: Ensure that the GPU clusters can scale effectively to handle increasing computational demands. Develop resource management strategies to prioritize and allocate computing resources based on project requirements. 
  • Troubleshooting and Support: Troubleshoot and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and resolve technical challenges efficiently.
  • Documentation: Create and maintain documentation related to GPU cluster configuration, distributed training workflows, and best practices to ensure knowledge sharing and seamless onboarding of new team members.

Job Requirements:

  • Master’s or Ph.D. degree in computer science, or a related field with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
  • 2+ years proven experience in managing GPU clusters, including installation, configuration, and optimization.
  • Strong expertise in distributed deep learning and parallel training techniques.
  • Proficiency in popular deep learning frameworks like PyTorch, Megatron-LM, DeepSpeed, etc.
  • Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
  • Knowledge of performance profiling and optimization tools for HPC and deep learning.
  • Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes)
  • Strong background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes)

Join us as we embark on this journey to redefine the future of biology and medicine.

We are an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Top Skills

AWS
Cuda
Cudnn
Deepspeed
Docker
GCP
Gpu
Kubernetes
Megatron-Lm
Python
PyTorch
Slurm
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Palo Alto, CA
29 Employees
On-site Workplace
Year Founded: 2024

What We Do

GenBio.AI, Inc. (GenBio AI) is an innovative global startup dedicated to developing the world's first AI-driven Digital Organism, an integrated system of multiscale foundation models for predicting, simulating, and programming biology at all levels.

Our goal is to achieve comprehensive, actionable empirical understandings of the mechanisms underlying all organismal physiologies and diseases. This will pave the way for a new paradigm in drug design, bio-engineering, personalized medicine, and fundamental biomedical research, all powered by Generative Biology.

Our founding team consists of world-renowned scientists and researchers in AI and Biology from prestigious institutions such as CMU, MBZUAI, WIS, alongside prominent financial investors.

GenBio AI, a true global effort from day one, is establishing offices in Palo Alto, Paris, and Abu Dhabi.

Similar Jobs

Capital One Logo Capital One

Manager, Software Engineering, Full Stack (People Leader)

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
2 Locations
55000 Employees
211K-241K Annually

Crusoe Energy Systems Logo Crusoe Energy Systems

Staff Network Engineer, Deployment

Cloud • Greentech • Other • Energy
Hybrid
San Francisco, CA, USA
667 Employees
195K-230K Annually

Crusoe Energy Systems Logo Crusoe Energy Systems

Senior+ Network Engineer, Deployment

Cloud • Greentech • Other • Energy
Hybrid
San Francisco, CA, USA
667 Employees
175K-205K Annually

Similar Companies Hiring

True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees
Caliola Engineering Thumbnail
Software • Machine Learning • Hardware • Defense • Data Privacy • App development • Aerospace
Colorado Springs, CO
53 Employees
Red 6 Thumbnail
Virtual Reality • Software • Hardware • Defense • Aerospace
Orlando, Florida
113 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account