Staff Site Reliability Engineer

Posted 9 Days Ago
Be an Early Applicant
Mountain View, CA
Mid level
Artificial Intelligence • Information Technology • Machine Learning • Natural Language Processing • Software
The AI copilot that takes the friction out of work.
The Role
As a Staff Site Reliability Engineer, you will ensure the health, performance, and capacity of the Moveworks AI infrastructure and services, design solutions to enhance operational efficiency, and collaborate with various teams to build scalable systems. You will develop tools for deployment and monitoring while advocating best practices in system reliability.
Summary Generated by Built In
Who We Are 

Moveworks is the universal AI copilot for search and automation across all your business applications. We give employees one place to go to find information and get support while reducing costs for your business. The Moveworks Copilot is powered by an industry-leading Reasoning Engine that uses a combination of public and proprietary language models to understand employee queries, then build and execute multi-step plans that achieve them. It does this by linking into systems (like ITSM, HRIS, ERP, identity management, and more) with native and custom-built integrations that turn natural language into powerful automations for employees.  

The world’s most innovative brands like Databricks, Broadcom, Hearst, and Palo Alto Networks trust Moveworks to eliminate repetitive support issues, deliver instant knowledge, and empower employees to work faster across applications.

Founded in 2016, Moveworks has raised $315 million in funding, at a valuation of $2.1 billion, thanks to our award-winning product and team. In 2023, we were included in the Forbes Cloud 100 list as well as the Forbes AI 50 for the fifth consecutive year. We were also recognized by the 2023 Edison Awards for AI Optimized Productivity, and were included on Fast Company's Most Innovative Companies list for 2024! 

Moveworks has over 500 employees in six offices around the world, and is backed by some of the world's most prominent investors, including Kleiner Perkins, Lightspeed, Bain Capital Ventures, Sapphire Ventures, Iconiq, and more.

Come join one of the most innovative teams on the planet!

What You Will Do

As a site reliability engineer, you will be an owner of and be responsible for overall health, performance, and capacity of the Moveworks AI infrastructure and services. In addition to helping engineering teams with resolving operational issues, you will also design and implement solutions, tools and practices that help us improve operational efficiency and product SLA. This role is a blend of SRE, infrastructure, and software development.

We’re building a team that indexes on moving fast, solving challenging product/engineering problems and providing value to our customers. To be successful, you'll be partnering with and enabling machine learning, search, product, data, and full stack teams to design and build fault tolerant and scalable infrastructure, services and features. This is an opportunity to play an integral role at the fastest-growing AI startup in its space.

  • Design, develop, and evolve site reliability and chaos engineering for Moveworks infrastructure and services.
  • Closely work with machine learning, search, product, infrastructure, data, and frontend teams to understand their infrastructure and operational needs and build solutions that are optimal, fault tolerant, and scalable.
  • Author and advocate for reliability through best distributed system design patterns (error handling, retries, rate limiting, circuit breaking, etc.). Participate in design discussions and ensure operational readiness of infrastructure, services, and features.
  • Design and build tools, libraries, and frameworks that allow engineering teams to rapidly deploy and scale Moveworks infrastructure and applications.
  • Review and participate in application performance analysis / tuning and capacity planning.
  • Setup and maintain monitoring, metrics, and reporting systems for observability and actionable alerting. 
  • Define internal and customer-facing key SLA metrics, implement solutions and practices with different teams to improve those metrics.
  • Own the engineering on-call process and setup. Drive discussions for outages, root cause analysis, and action items.
  • Participate in on-call rotation for second-tier escalation (at Moveworks, each engineer participates in the team specific first-tier on-call rotation). Help diagnose and resolve complex operational issues.

What You Bring To The Table

  • 7+ years of experience in authoring and operating complex distributed infrastructure and applications
  • Strong experience with container orchestration platform like Kubernetes and cloud infrastructure like AWS / GCP / Azure
  • Very high proficiency with Unix/Linux, TCP/IP, DNS, load balancers, autoscaling, file systems and different types of data stores.
  • Software development proficiency with Python, Golang, Java, or C++
  • Experience working across teams and implementing solutions, tools, and practices to improve observability, reliability, and scalability
  • Desire to work at a startup pace in a small company with a high degree of ownership 
  • Strong motivation, gumption, and an appetite for continuous, incremental changes and completing challenging projects fast
  • High level of curiosity about engineering outside of your immediate discipline and an incessant desire to learn
  • BS+ in computer science or a related field

Compensation Range: $227,000 - $290,000

*Our compensation package includes a market competitive salary, equity for all full time roles, exceptional benefits, and, for applicable roles, commissions or bonus plans. 
Ultimately, in determining pay, final offers may vary from the amount listed based on geography, the role’s scope and complexity, the candidate’s experience and expertise, and other factors.

Moveworks Is An Equal Opportunity Employer
*Moveworks is proud to be an equal opportunity employer. We provide employment opportunities without regard to age, race, color, ancestry, national origin, religion, disability, sex, gender identity or expression, sexual orientation, veteran status, or any other characteristics protected by law.

Top Skills

Site Reliability Engineering
The Company
HQ: Mountain View, CA
485 Employees
Hybrid Workplace
Year Founded: 2016

What We Do

The Moveworks Copilot unifies every business system, giving employees one place to go to find information and automate tasks, increasing employee productivity by simplifying work. Powered by a genAI infrastructure that leverages the world’s most advanced LLMs and our proprietary MoveLM models, the Moveworks Copilot understands employee requests, devises intelligent plans, then executes actions to get work done across application boundaries.

The world’s most recognizable brands like Databricks, Broadcom, Hearst, and Palo Alto Networks trust Moveworks to automate repetitive support issues, to provide a universal search interface, and for common use cases across different applications. Learn more at Moveworks.com.

Why Work With Us

Company culture is difficult to distill. Sure, we could tell you about our 5-star rating on Glassdoor, or about how we were named to the Forbes AI 50 list. But the truth is that no talking point can capture what it’s like to work alongside this team every day, as we continue to push the boundaries of what’s possible with enterprise AI.

Gallery

Gallery

Similar Jobs

Crusoe Energy Systems Logo Crusoe Energy Systems

Senior/Staff Site Reliability Engineer

Cloud • Greentech • Other • Energy
Hybrid
San Francisco, CA, USA
667 Employees
180K-225K Annually

Crunchyroll Logo Crunchyroll

Staff Site Reliability Engineer - Data Engineering, Platform

Digital Media • eCommerce • Gaming • Mobile • News + Entertainment
Remote
San Francisco, CA, USA
1200 Employees
191K-239K Annually

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer - Remote

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
San Francisco, CA, USA
3000 Employees
173K-242K Annually

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer , Cloud Platform - Remote

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
San Francisco, CA, USA
3000 Employees
173K-242K Annually

Similar Companies Hiring

InCommodities Thumbnail
Renewable Energy • Machine Learning • Information Technology • Energy • Automation • Analytics
Austin, TX
234 Employees
RunPod Thumbnail
Software • Infrastructure as a Service (IaaS) • Cloud • Artificial Intelligence
Charlotte, North Carolina
53 Employees
Hedra Thumbnail
Software • News + Entertainment • Marketing Tech • Generative AI • Enterprise Web • Digital Media • Consumer Web
San Francisco, CA
14 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account