Site Reliability Engineer

Posted 5 Days Ago
Be an Early Applicant
Taiwan
Mid level
Artificial Intelligence • Gaming
The Role
The Site Reliability Engineer will monitor and optimize the production system, respond to faults, collaborate with the business and R&D teams to resolve issues, and maintain thorough documentation. The role emphasizes continuous improvement of operations and maintenance processes within a cloud computing environment.
Summary Generated by Built In

Description

Aethir is the only Enterprise-grade AI-focused GPU-as-a-service provider in the market. Its decentralized cloud computing infrastructure allows GPU providers (containers) to meet Enterprise clients who need powerful GPU chips for professional AI/ML tasks. Thanks to a constantly growing network of over 40,000 top-shelf GPUs, including 3,000 NVIDIA H100s, Aethir is able to provide enterprise-grade GPU computing wherever it’s needed, at scale.

Backed by leading Web3 investors like Framework Ventures, Merit Circle, Hashkey, Animoca Brands, Sanctor Capital, Infinity Ventures Crypto (IVC), and others, with over $130M in funds raised for the ecosystem, Aethir is paving the way for the future of decentralized computing.

We are looking for an operations and maintenance development engineer (SRE) to join our new headquarters in Kuala Lumpur, Malaysia, who will play a critical role in monitoring, troubleshooting, and optimizing our production system to ensure the highest levels of performance and stability for our AI and gaming customers worldwide.

Responsibilities

  • Monitor, Review, and Respond to Faults: Take on the responsibility of monitoring, reviewing, responding to faults, troubleshooting, resolving, and subsequently optimizing the production system.
  • System Architecture and Performance: Continuously monitor and review the system architecture, process logic, system performance, stability, and other technical areas and indicators to ensure their rationality.
  • Coordination with Business Team: Drive the business team in resolving any issues related to operations and maintenance.
  • Production Failure Response: Respond promptly to production failures, acting as the overall coordinator for resolution.
  • Collaborative Problem-Solving: Organize relevant R&D, operations and maintenance, and product teams to collaboratively investigate and resolve problems.
  • Failure Response Time: Responsible for the failure response time and resolution time, ensuring timely resolution of issues.
  • Case Studies and Optimization: Conduct case studies on production issues and follow up with optimizations to improve system performance and stability.
  • Documentation: Maintain comprehensive documentation of system architecture, processes, and troubleshooting procedures.
  • Continuous Improvement: Identify areas for improvement in the operations and maintenance processes and implement necessary changes.
Requirements
  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • Experience in operations and maintenance development, preferably in a cloud computing or AI-focused environment.
  • Strong understanding of system architecture, performance monitoring, and troubleshooting methodologies.
  • Excellent communication and collaboration skills.
  • Ability to work in a fast-paced, startup environment.
  • Proficiency in Kubernetes (K8S), CI/CD, and Docker.
  • Expertise in AWS (VPC, S3, EC2, etc.) or Python (one of the two).
  • Responsible for building the operations and maintenance infrastructure platform and handling core business operations.
  • Management experience is a plus, but not required.
  • Prior experience working in structured environments such as Huawei, ZTE, or banking institutions is preferred.
Benefits
  • Hypergrowth Startup Environment
  • Fantastic Career Progression Opportunities
  • Work within a Global and Local Team
  • Collaborative and innovative work environment with opportunities to contribute to cutting-edge projects.

Top Skills

Kubernetes
Python
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
13 Employees
Remote Workplace
Year Founded: 2021

What We Do

Aethir builds Decentralized Cloud Infrastructure (DCI) for Gaming and AI companies.

Similar Jobs

Kyndryl Logo Kyndryl

Site Reliability Engineer

Cloud • Information Technology • Consulting
New Taipei City, TWN
46070 Employees

Kyndryl Logo Kyndryl

Site Reliability Engineer

Cloud • Information Technology • Consulting
New Taipei City, TWN
46070 Employees

Kyndryl Logo Kyndryl

Site Reliability Engineer

Cloud • Information Technology • Consulting
New Taipei City, TWN
46070 Employees

NVIDIA Logo NVIDIA

Senior SRE Software Engineer, Storage and Data

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Taipei City, TWN
21960 Employees

Similar Companies Hiring

Stepful Thumbnail
Software • Healthtech • Edtech • Artificial Intelligence
New York, New York
60 Employees
HERE Technologies Thumbnail
Software • Logistics • Internet of Things • Information Technology • Computer Vision • Automotive • Artificial Intelligence
Amsterdam, NL
6000 Employees
True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account