Lead Site Reliability Engineer

Posted 9 Hours Ago
Hiring Remotely in USA
Remote
Senior level
Automotive • Software
The Role
Lead and mentor SRE teams to enhance platform reliability, optimize software delivery, and manage Kubernetes infrastructure while resolving incidents and driving operational excellence.
Summary Generated by Built In

Roadie, a UPS company, is a leading logistics and delivery platform that helps businesses tackle the complexities of modern retail with unmatched delivery coverage, flexibility and visibility. Reaching 97% of U.S. households across more than 30,000 zip codes — from urban hubs to rural communities — Roadie provides seamless, scalable solutions that meet a variety of delivery needs. 

With a network of more than 310,000 independent drivers nationwide, Roadie offers flexible delivery solutions that make complex logistics challenges easy, including solutions for local same-day delivery, delivery of big and bulky items, ship-from-store and DC-to-door. 

Roadie is seeking a Lead Site Reliability Engineer to join our growing Technical Operations Team. We are looking for a leader with a proven track record of managing high-performing SRE teams in high-availability, mission-critical environments. The ideal candidate is a strategic problem solver with deep expertise in site reliability best practices, DevOps principles, AWS and GCP, Kubernetes, and automation. You will play a key role in driving reliability, scalability, and operational excellence across our platform.

What You'll Do

  • Lead and mentor teams focused on enhancing platform reliability, optimizing uptime, and improving software delivery, observability, and infrastructure operation
  • Architect, maintain, and optimize production and non-production Kubernetes clusters (EKS), as well as Elasticsearch (ES), MSK, RDS, and ElastiCache (Redis) clusters
  • Design, deploy, and manage monitoring and logging solutions using Prometheus, Loki, Thanos, Grafana, OpenTelemetry, and New Relic
  • Strategize and collaborate with cross-functional teams to proactively identify bottlenecks, optimize resource utilization, and prevent system failures
  • Define and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to drive reliability improvements
  • Automate and streamline operational tasks, reducing toil and increasing efficiency across engineering teams
  • Plan and forecast service capacity and demand, optimize costs, and fine-tune system performance
  • Lead troubleshooting initiatives, post mortems, and resolve production and non-production incidents, ensuring high availability and performance
  • Participate in and manage a 24/7 on-call rotation, responding to incidents and driving post-mortem improvements
  • Willingness to work non-standard hours to facilitate production upgrades or deployments on occasion

Technology We're Using Now

  • Python, Ruby on Rails, Golang
  • React/Redux, Objective-C and Swift, Android
  • Postgres, Redshift, Redis, Kafka
  • AWS/GCP
  • Docker/Kubernetes
  • OpenTelemetry/Prometheus/Thanos/Loki/Grafana/New Relic/Sentry
  • Git/CircleCI
  • ArgoCD

What You Bring

  • 6+ Years in various SRE roles
  • 6+ Years in various DevOPS/System Engineering roles
  • 3+ Years in leading and managing SRE teams
  • 6+ Years of experience building and managing production Kubernetes infrastructure
  • 7+ Years experience with popular scripting languages (Python, Ruby, Bash, etc.)
  • Experience with Infrastructure as code such as Terraform or Crossplane
  • Experience with CI/CD Development tools (CircleCI, etc.)
  • Experience with GitOPS Tools (ArgoCD)
  • Experience using a broad range of AWS technologies (RDS, ElasticSearch, VPC, EKS, S3, CloudFront, MSK, Elasticache, CloudWatch, etc.)
  • Experience developing and maintaining YAML templating systems (Helm charts, Kustomize, etc)
  • Must be able to work independently, be self-motivated and handle multiple priorities
  • Comfortable working in a fast-paced agile environment

Finally, a willingness to admit what you don’t know, and learn what you need to learn quickly.

Why Roadie? 

  • Competitive compensation packages 
  • 100% covered health insurance premiums for yourself
  • 401k with company match
  • Tuition and student loan repayment assistance (that’s right - Roadie will contribute directly to your existing student loans!) 
  • Flexible work schedule with unlimited PTO 
  • Monthly 3-day weekends
  • Monthly WFH stipend 
  • Paid sabbatical leave- tenured team members are given time to rest, relax, and explore
  • The technology you need to get the job done

This role is not eligible for Visa sponsorship. Applicants must be authorized to work for any employer in the U.S.

Top Skills

Android
Argocd
AWS
CircleCI
Docker
GCP
Git
Go
Grafana
Kafka
Kubernetes
Loki
New Relic
Objective-C
Opentelemetry
Postgres
Prometheus
Python
React/Redux
Redis
Redshift
Ruby On Rails
Sentry
Swift
Terraform
Thanos
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Atlanta, GA
260 Employees
On-site Workplace
Year Founded: 2014

What We Do

Roadie is the nation’s first “on the way” crowdsourced delivery platform. Roadie works with consumers, small businesses and big global brands across virtually every industry to provide a faster, cheaper, more scalable solution for scheduled, same-day and urgent delivery. With more than 200,000 active drivers nationwide, Roadie reaches more than 11,000 cities and 20,000 zip codes – the largest local same-day delivery footprint in the nation.

Similar Jobs

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer, Network - Remote

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
Hybrid
2 Locations
3000 Employees
148K-236K Annually

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer, Engineering Enablement - REMOTE

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
United States
3000 Employees
139K-215K Annually

MongoDB Logo MongoDB

Lead, Site Reliability Engineer, Fabric

Big Data • Cloud • Software • Database
Remote
Hybrid
5 Locations
5550 Employees
147K-289K Annually

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer, Observability - Remote

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
Hybrid
2 Locations
3000 Employees
148K-236K Annually

Similar Companies Hiring

True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees
Caliola Engineering Thumbnail
Software • Machine Learning • Hardware • Defense • Data Privacy • App development • Aerospace
Colorado Springs, CO
53 Employees
Red 6 Thumbnail
Virtual Reality • Software • Hardware • Defense • Aerospace
Orlando, Florida
113 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account