Senior RAS Architect - Datacenter CPU and SOC

Posted 23 Hours Ago
Be an Early Applicant
Santa Clara, CA
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Role
The Senior RAS Architect will lead the design and testing of resilience features for data-center CPUs, focusing on RAS requirements, benchmarking, and risk assessment. The role involves collaboration with architecture and engineering teams, analyzing logs, developing testing methodologies, and assessing new hardware features.
Summary Generated by Built In

For two decades, we have pioneered visual computing, the art and science of computer graphics - with our invention of the GPUs, the engine of modern AI technologies, the field has expanded to encompass AI-powered video games, social networking and web search, IC & other product design, medical diagnosis, and scientific research. Today, visual computing is the critical computing engine for deep learning-based AI including ChatGPT, becoming increasingly central to how people entertain and interact, and there has never been a more exciting time to join us to enable visual computing and AI to the next chapter. We are looking for one Architect to drive key aspects of RAS/Resilience features for our next-generation products for AI Applications. We are expecting you to bring deep knowledge and experience in RAS/Resilience testing, characterization, analysis, benchmarking, and risk assessment of CPU, DRAM, and AI cluster systems.

What you’ll be doing:

  • Drive Memory RAS requirements for data-center CPU.

  • Own the system RAS/Resilience models, Benchmarking and Risk assessment.

  • Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.

  • Opportunity to work closely and partner with architecture, hardware, software, and product engineering teams through the product development lifecycle.

  • Be ready to be challenged to assess new hardware features and architect manufacturing RAS tests, flows, methodologies.

  • You'll nurture a deep understanding of NVIDIA's AI hardware and software architecture.

What we need to see:

  • BS or higher in EE, CE, CS, Mathematics, or equivalent experience.

  • 12+ years proven hands-on experiences in design, testing, benchmarking, and risk assessment of system RAS / Resiliency features of large Compute or AI or HPC systems.

  • Proficient in Compute System RAS/Resilience model theory and methodology.

  • Proficient in HPC or AI system architecture and Cluster Interconnect technologies.

  • Proficient in using test equipment, Linux commands and benchmark utilities to test and trouble-shoot compute system RAS & Resiliency features.

  • Strong problem-solving and trouble-shooting expertise; and institutionalizing root-cause analysis.

  • Self-initiative, strong interpersonal skills, and flexibility to adapt to new technologies.

  • Solid Knowledge and/or Experience in HPC or MLPerf benchmarking is a plus.

NVIDIA is widely considered to be one of the technology world’s most desirable employers! We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

The base salary range is 224,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Top Skills

AI
Cpu
Dram
Hpc
Linux
Ras
Resilience
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Santa Clara, CA
21,960 Employees
On-site Workplace
Year Founded: 1993

What We Do

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”

Similar Jobs

ZS Logo ZS

Data Science Associate Consultant

Artificial Intelligence • Healthtech • Professional Services • Analytics • Consulting
Hybrid
4 Locations
13000 Employees
143K-157K Annually

BAE Systems, Inc. Logo BAE Systems, Inc.

Systems Test Engineer

Aerospace • Hardware • Information Technology • Security • Software • Cybersecurity • Defense
Hybrid
San Diego, CA, USA
40000 Employees
127K-215K Annually
Easy Apply
Hybrid
San Francisco, CA, USA
860 Employees
Hybrid
San Francisco, CA, USA
289097 Employees

Similar Companies Hiring

True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees
Caliola Engineering Thumbnail
Software • Machine Learning • Hardware • Defense • Data Privacy • App development • Aerospace
Colorado Springs, CO
52 Employees
Red 6 Thumbnail
Virtual Reality • Software • Hardware • Defense • Aerospace
Orlando, Florida
113 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account