Senior Site Reliability Engineer - GCP Focussed

Posted 8 Days Ago
3 Locations
Remote
116K-198K Annually
Senior level
Cloud • Information Technology • Software
The Role
The role involves administering large-scale cloud infrastructure, supporting ML platforms, automating processes, and ensuring system reliability with a focus on GCP.
Summary Generated by Built In

About the Role


We are seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) to join our dynamic team. The ideal candidate will have a strong background in managing large-scale, data-intensive production-grade systems and infrastructure, with deep experience in cloud observability, automation, and reliability engineering at scale. A solid understanding of public cloud services—especially Google Cloud Platform (GCP)—is essential.


At the core of this role is the administration and maintenance of cloud infrastructure, including on-call support, monitoring, automation, deployment, the establishment of CI/CD pipelines, and the formulation of reusable cloud infrastructure templates via infrastructure as code (IaC) methodologies.


You will apply these SRE principles to design and implement scalable, automated infrastructure supporting ML model training, real-time inference APIs, and analytics workloads across platforms like Vertex AI, BigQuery, and Dataproc. You’ll work closely with ML and data teams to ensure production systems are observable, performant, and fault-tolerant — embedding reliability into every stage of the pipeline.


This role involves working in a remote environment, requiring excellent communication skills and the ability to solve complex problems independently and creatively.


Work Location: US-Remote, Canada-Remote

Key Responsibilities:

  • Administer and optimize cloud-native databases and storage platforms, including Google Cloud Storage (GCS), Cloud SQL, Spanner, and Firestore.
  • Support and maintain machine learning and analytics platforms, including Vertex AI, Generative AI, BigQuery, Looker, and Dataproc, ensuring scalable and reliable infrastructure for data pipelines and model workflows.
  • Implement and manage cloud observability using OpenTelemetry and native GCP tools to enable real-time monitoring, distributed tracing, and incident resolution.
  • Support and maintain large-scale applications, computer systems, and networks in production environments.
  • Administer and troubleshoot Linux-based systems, including core networking protocols such as TCP/IP, HTTP, MAIL protocols, DNS, and manage components like content delivery networks (CDNs) and load balancers.
  • Manage and operate GCP services, including Kubernetes Engine (GKE), Compute Engine (GCE), Networking, Security, CI/CD pipelines, and other common Cloud technologies.
  • Build and maintain cloud infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible, and Helm Charts.
  • Develop and deploy services using Python, Golang, or Java, and implement CI/CD pipelines to ensure consistent, reliable delivery of applications and infrastructure components.

Qualifications:

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering, including hands-on operational support and participation in on-call rotations.
  • Proven track record of managing large-scale applications, distributed systems, and networked services in production.

Must Have: (Important)

  • Minimum 5+ years of hands-on experience in cloud environments
  • Deep understanding of Google Cloud Platform (GCP) — especially GKE, GCE, networking, and security
  • Strong troubleshooting and debugging skills across systems and networks
  • Cloud-native databases and storage — including Google Cloud Storage (GCS), Cloud SQL, Spanner, and Firestore
  • Machine Learning and AI platforms — such as Vertex AI, Generative AI tools, BigQuery, Looker, and DataProc
  • Cloud observability and monitoring — hands-on experience with OpenTelemetry, tracing, metrics, and distributed logging systems

The following information is required by pay transparency legislation in the following states: CA, CO, HI, NY, and WA. This information applies only to individuals working in these states.

 

·       The anticipated starting pay range for Colorado is: $116,100 - $170,280.

·       The anticipated starting pay range for the states of Hawaii and New York (not including NYC) is: $123,600 - $181,280.

·       The anticipated starting pay range for California, New York City and Washington is: $135,300 - $198,440.


Unless already included in the posted pay range and based on eligibility, the role may include variable compensation in the form of bonus, commissions, or other discretionary payments. These discretionary payments are based on company and/or individual performance and may change at any time. Actual compensation is influenced by a wide array of factors including but not limited to skill set, level of experience, licenses and certifications, and specific work location. Information on benefits  offered is here.


#LI-VM1

#LI-Remote

#LI-USA

#LI-Canada


About Rackspace Technology

We are the multicloud solutions experts. We combine our expertise with the world’s leading technologies — across applications, data and security — to deliver end-to-end solutions. We have a proven record of advising customers based on their business challenges, designing solutions that scale, building and managing those solutions, and optimizing returns into the future. Named a best place to work, year after year according to Fortune, Forbes and Glassdoor, we attract and develop world-class talent. Join us on our mission to embrace technology, empower customers and deliver the future.

 

 

More on Rackspace Technology

Though we’re all different, Rackers thrive through our connection to a central goal: to be a valued member of a winning team on an inspiring mission. We bring our whole selves to work every day. And we embrace the notion that unique perspectives fuel innovation and enable us to best serve our customers and communities around the globe. We welcome you to apply today and want you to know that we are committed to offering equal employment opportunity without regard to age, color, disability, gender reassignment or identity or expression, genetic information, marital or civil partner status, pregnancy or maternity status, military or veteran status, nationality, ethnic or national origin, race, religion or belief, sexual orientation, or any legally protected characteristic. If you have a disability or special need that requires accommodation, please let us know.

 

 


Top Skills

Ansible
Compute Engine (Gce)
Go
Google Cloud Platform (Gcp)
Helm
Java
Kubernetes Engine (Gke)
Opentelemetry
Python
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Antonio, TX
7,509 Employees
On-site Workplace
Year Founded: 1998

What We Do

At Rackspace Technology, we accelerate the value of the cloud during every phase of digital transformation. By managing apps, data, security and multiple clouds, we are the best choice to help customers get to the cloud, innovate with new technologies and maximize their IT investments. As a recognized Gartner Magic Quadrant leader, we are uniquely positioned to close the gap between the complex reality of today and the promise of tomorrow. Passionate about customer success, we provide unbiased expertise, based on proven results, across all the leading technologies. And across every interaction worldwide, we deliver Fanatical Experience TM — the best customer service experience in the industry. Rackspace has been honored by Fortune, Forbes, Glassdoor and others as one of the best places to work.

Similar Jobs

Samsara Logo Samsara

Senior Software Engineer II - Routing

Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
Easy Apply
Remote
Canada
2800 Employees
143K-185K Annually

Dropbox Logo Dropbox

Workday Integrations Specialist

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
Remote
Canada
2500 Employees
109K-147K Annually

Cisco Meraki Logo Cisco Meraki

Lead ML Engineer – Meraki Assurance, Remote, Canada

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
Canada
3000 Employees
155K-240K Annually
Easy Apply
Remote
Canada
3000 Employees
134K-195K Annually

Similar Companies Hiring

True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees
Caliola Engineering Thumbnail
Software • Machine Learning • Hardware • Defense • Data Privacy • App development • Aerospace
Colorado Springs, CO
53 Employees
Red 6 Thumbnail
Virtual Reality • Software • Hardware • Defense • Aerospace
Orlando, Florida
113 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account