Site Reliability Engineer - Machine Learning Systems

Posted 2 Days Ago
Be an Early Applicant
Singapore
Mid level
Artificial Intelligence • Cloud • Fintech • Healthtech • Biotech
The Role
The Site Reliability Engineer for Machine Learning Systems will ensure the efficient operation of ML systems, improve resource management, oversee disaster recovery, and enhance service stability. Responsibilities include building tools for monitoring the ML infrastructure and providing global system support.
Summary Generated by Built In

Description

Responsibilities

  • Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference.
  • Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios.
  • Responsible for resource management and planning, cost and budget, including computing and storage resources.
  • Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement.
  • Build software tools, products and systems to monitor and manage the mL infrastructure and services efficiently.
  • Be part of the global team roster that ensures system and business on-call support.
Requirements

Qualifications:

  • Bachelor's degree or above, majoring in Computer Science, computer engineering or related fields;
  • Strong proficiency in at least one programming language such as Go/Python/Shell in Linux environment;
  • Strong hands-on experience with Kubernetes and containers skills, and have ≥3 years of relevant operation and maintenance experience;

Preferred Qualifications

  • Possess excellent logical analysis ability, able to reasonably abstract and split business logic, a strong sense of responsibility, good learning ability, communication ability, self-driven and good team spirit;
  • Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time.
  • Engage in the operation and maintenance of large-scale ML distributed systems;
  • Experience in operation and maintenance of GPU servers.

Top Skills

Go
Python
Shell
The Company
HQ: San Jose, CA
55 Employees
Hybrid Workplace
Year Founded: 2015

What We Do

We are HireIO, the Workforce Solutions Provider who tomorrow’s tech giants count on to be connected with today’s tech genius. We help create an impact on the tech community by partnering with teams and professionals who specialize in FinTech, Cloud/SaaS, healthcare, biotech, A.I., and any emerging technologies, to grow from new opportunities and support equal opportunity

Similar Jobs

Doodle Labs Logo Doodle Labs

Senior Software Engineer/Software QA Engineer

Aerospace • Hardware • Internet of Things • Robotics • Wearables • App development • Automation
Hybrid
Singapore, SGP
50 Employees

WISE Logo WISE

Platform Integrations Engineering Lead

Fintech • Mobile • Payments • Software • Financial Services
Hybrid
Singapore, SGP
6000 Employees

WISE Logo WISE

Regional Solutions Engineering Manager - APAC

Fintech • Mobile • Payments • Software • Financial Services
Hybrid
Singapore, SGP
6000 Employees

WISE Logo WISE

Senior Solutions Engineer

Fintech • Mobile • Payments • Software • Financial Services
Hybrid
Singapore, SGP
6000 Employees

Similar Companies Hiring

Stepful Thumbnail
Software • Healthtech • Edtech • Artificial Intelligence
New York, New York
60 Employees
HERE Technologies Thumbnail
Software • Logistics • Internet of Things • Information Technology • Computer Vision • Automotive • Artificial Intelligence
Amsterdam, NL
6000 Employees
True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account