Senior SRE Software Engineer, Storage and Data

Reposted 7 Days Ago
Be an Early Applicant
Taipei City
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Role
The Senior SRE Software Engineer will ensure reliability of storage infrastructures, develop automation tools, and collaborate with cross-functional teams to optimize performance and availability of NVIDIA's DGX Cloud platform.
Summary Generated by Built In

SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.

SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing:

  • Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.

  • Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.

  • Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.

  • Implement monitoring and alerting systems to proactively identify and address issues.

  • Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.

  • Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.

  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.

What We Need To See:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.

  • Proven experience in storage system administration and site reliability engineering.

  • Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.

  • Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java

  • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.

  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana. 

Ways to stand out from the crowd:

  • Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.

  • Strong Linux and network troubleshooting skills by running various commands and tools.

  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..

  • Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.

  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the most desirable employers in the world. We have some of the most brilliant and talented people in the world working for us. If you are creative, autonomous and love a challenge, we want to hear from you. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Top Skills

Ansible
Aws S3
Bash
Chef
Elastic
Git
Go
Grafana
Influxdb
Java
Linux
Prometheus
Puppet
Python
Restful Api
Terraform
Yaml
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Santa Clara, CA
21,960 Employees
On-site Workplace
Year Founded: 1993

What We Do

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”

Similar Jobs

Morningstar Logo Morningstar

Sales Support and Pre-sales Consultant

Enterprise Web • Fintech • Financial Services
Hybrid
Taipei City, TWN
12700 Employees

Motive Logo Motive

Manufacturing Test Engineer

Artificial Intelligence • Fintech • Hardware • Information Technology • Sales • Software • Transportation
Easy Apply
Taipei City, TWN
3600 Employees

UL Solutions Logo UL Solutions

Senior Project Engineer

Automotive • Professional Services • Software • Consulting • Energy • Chemical • Renewable Energy
Hybrid
Bei Tou Qu, Taipei City, TWN
15000 Employees
Hybrid
Taipei City, TWN
289097 Employees

Similar Companies Hiring

True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees
Caliola Engineering Thumbnail
Software • Machine Learning • Hardware • Defense • Data Privacy • App development • Aerospace
Colorado Springs, CO
53 Employees
Red 6 Thumbnail
Virtual Reality • Software • Hardware • Defense • Aerospace
Orlando, Florida
113 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account