High Performance Computing Engineer

Posted 5 Days Ago
Be an Early Applicant
Toronto, ON
Hybrid
150K-250K Annually
Senior level
Artificial Intelligence • Machine Learning
The Role
The Senior High Performance Computing Engineer will manage and operate high-end GPU clusters, including hardware deployment, operations, troubleshooting, and configuring network switches. They will work with a variety of infrastructure technologies and help maintain a high-performing datacenter environment.
Summary Generated by Built In

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.


About The Role


We are looking for a Senior High Performance Computing Engineer to help us operate the GPUs, network and filesystem in our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA deepops, Ethernet networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration. 


You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

A day in the life:

  • Manage private large high-end GPU clusters
  • Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting
  • Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
  • Configure and maintain MAAS, Ceph, Slurm and Kubernetes
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
  • Configure and maintain network, e.g. Layer 3 networking
  • Learn about new tools and deploy them

You might be a great fit if you have:

  • Strong background in high performance computing
  • Experience with with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

The ability to solve problems and to learn new techniques is key.

Top Skills

Python
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Santa Clara,, CA
21 Employees
On-site Workplace
Year Founded: 2023

What We Do

We are transforming how stories are told, knowledge is learned, and insights are gathered

Similar Jobs

Capital One Logo Capital One

Associate, Software Engineer - Mobile

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
Toronto, ON, CAN
55000 Employees

Magna International Logo Magna International

Software Development Student

Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Hybrid
Toronto, ON, CAN
171000 Employees

General Motors Logo General Motors

Controller Modelling Developer - Virtual Prototyping

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Remote
Hybrid
2 Locations
165000 Employees

Magna International Logo Magna International

Area Leader - Weld Engineering

Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Hybrid
St. Thomas, ON, CAN
171000 Employees

Similar Companies Hiring

Stepful Thumbnail
Software • Healthtech • Edtech • Artificial Intelligence
New York, New York
60 Employees
HERE Technologies Thumbnail
Software • Logistics • Internet of Things • Information Technology • Computer Vision • Automotive • Artificial Intelligence
Amsterdam, NL
6000 Employees
True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account