Staff Site Reliability Engineer

Posted Yesterday
Be an Early Applicant
Sunnyvale, CA
Senior level
Software • Cybersecurity
The Role
Seeking a proactive Staff Site Reliability Engineer for production support, performance monitoring, automation, and cross-functional collaboration in a complex SaaS environment.
Summary Generated by Built In

Location: Onsite, Sunnyvale, California (5 days a week in the office)Onwards Together!

Illumio, the pioneer and market leader of Zero Trust segmentation, prevents breaches from becoming cyber disasters. Illumio protects critical applications and valuable digital assets with proven segmentation technology purpose-built for the Zero Trust security model. Illumio ransomware mitigation and segmentation solutions see risk, isolate attacks, and secure data across cloud-native apps, hybrid and multi-clouds, data centers, and endpoints, enabling the world’s leading organizations to strengthen their cyber resiliency and reduce risk.  Illuminate the future with Illumio and join a team that’s passionate about developing cutting-edge security solutions that protect the world's most critical assets. 

Our Team's Vision:

Our Engineering team is driven by a culture that thrives on visionary leadership, autonomy, and ownership, creating a dynamic synergy that drives us forward in the ever-evolving landscape of cybersecurity. 

When you join our team, you become part of the leader in Zero Trust Segmentation. You'll work with a cutting-edge technology stack that spans operating systems, distributed applications, and immersive UI/visualization tools.  

We're shaping the future of cybersecurity. And together, we will continue to build world-class products—led by people with different perspectives, backgrounds, and a commitment to innovation in a time when the world faces its greatest cybersecurity threats in history. 

Your Impact: 

We are seeking a skilled and proactive Product SRE (Site Reliability Engineer) to join our team and take ownership of debugging, troubleshooting, and resolving production escalations in a complex SaaS environment. The ideal candidate will have a deep understanding of AWS and Azure cloud platforms, application performance, and operational excellence, with a passion for automation and continuous improvement.

  1. Production Support:

    • Investigate and resolve production incidents and escalations to ensure minimal downtime and impact to customers.

    • Work closely with engineering and support teams to troubleshoot application and infrastructure issues.

  2. Performance Monitoring and Optimization:

    • Proactively monitor application health, performance, and reliability using modern observability tools.

    • Analyze trends in system behavior and suggest performance improvements.

  3. Automation and Tooling:

    • Develop and maintain automation scripts and tools to improve operational efficiency and incident resolution.

    • Create and enhance runbooks to streamline troubleshooting and reduce mean time to resolution (MTTR).

  4. Root Cause Analysis (RCA):

    • Conduct thorough post-incident reviews to identify root causes and implement preventive measures.

    • Drive a culture of continuous improvement by documenting lessons learned and improving system designs.

  5. Cross-Functional Collaboration:

    • Partner with software engineers, QA, and product teams to improve application stability and user experience.

    • Act as a bridge between development and operations, ensuring smooth and reliable service delivery.

Your Toolkit:

  • Bachelor's degree in Computer Science, Engineering, or related field; or equivalent work experience

  • 8+ years of relevant SRE experience.

  • Cloud Expertise:
    • Strong hands-on experience with AWS and Azure
    • Familiarity with Kubernetes and containerized environments.
    • Knowledge of networking concepts, such as DNS, load balancing, and firewalls.
  • Troubleshooting Skills:
    • Proficient in diagnosing and resolving complex issues in SaaS environments, including performance bottlenecks and application errors.
  • Programming and Scripting:
    • Proficiency in at least one programming language (e.g., Python, Go, Java) and scripting languages (e.g., Bash, PowerShell).
  • Monitoring and Observability:
    • Experience with tools like Datadog, New Relic, Prometheus, Grafana, ELK, or Azure Monitor.
  • Automation and Configuration Management:
    • Familiarity with tools like Ansible, Terraform, or CloudFormation.
  • Database Experience:
    • Knowledge of debugging and optimizing relational databases (e.g., PostgreSQL, MySQL) and caching systems (e.g., Redis, Memcached).
  • Incident Management:
    • Experience with incident management tools and processes, including conducting RCAs and improving on-call processes.

Compensation:

$ 192,000 USD - $ 230,000 USD

The pay range for this job level is a general guideline only and not a guarantee of compensation or salary. Additional factors considered in extending an offer include responsibilities of the job, education, location, experience, knowledge, skills, abilities, and internal equity, alignment with market data, or applicable laws. 

At Illumio we offer a wide range of benefits to our eligible team members. Our benefit programs vary by location and can include Medical, Dental, Vision Coverage – Health and Dependent Savings Accounts – Life and Disability Programs – Paid Parental Leave – Voluntary Benefit Programs – Company Sponsored Wellness Program – Wellness Reimbursement Program - Retirement Savings – Equity Opportunities – Paid time off and Paid Holidays – Employee Incentive Program. #LI-KD1 #LI-ONSITE

Our Commitment: 

Illumio believes that an environment of unique backgrounds, experiences, viewpoints, and individual contributions drives our success and makes us stronger together. We are dedicated to creating and maintaining a diverse culture and emphasizing inclusion and belonging.   

All official job offers from our company are extended directly by our recruitment team and will be sent through an official DocuSign document for your review and signature. Please be aware that we do not ask for any personal information in the process of extending offers of employment, such as financial details or social security numbers. Upon acceptance of any offer, we will request such information as part of the onboarding process prior to or on your first day of employment, and only after completing a background check through an authorized third-party vendor. If you receive any communication asking for personal details outside of these processes, please contact us immediately to verify the authenticity of the request. Your security is important to us, and we are committed to a safe and transparent hiring experience. 

Top Skills

Ansible
AWS
Azure
Azure Monitor
Bash
CloudFormation
Datadog
Elk
Go
Grafana
Java
Kubernetes
Memcached
MySQL
New Relic
Postgres
Powershell
Prometheus
Python
Redis
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Sunnyvale, CA
552 Employees
On-site Workplace
Year Founded: 2013

What We Do

Illumio, the Zero Trust Segmentation company, prevents breaches from spreading and turning into cyber disasters. Illumio protects critical applications and valuable digital assets with proven segmentation technology purpose-built for the Zero Trust security model. Illumio ransomware mitigation and segmentation solutions see risk, isolate attacks, and secure data across cloud-native apps, hybrid and multi-clouds, data centers, and endpoints, enabling the world’s leading organizations to strengthen their cyber resiliency and reduce risk.  

Similar Jobs

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer, Network - Remote

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
Hybrid
2 Locations
3000 Employees
148K-236K Annually

NVIDIA Logo NVIDIA

Senior Staff Site Reliability Engineer - CDN

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Remote
2 Locations
21960 Employees

Illumio Logo Illumio

Staff Site Reliability Engineer

Software • Cybersecurity
Sunnyvale, CA, USA
552 Employees

NVIDIA Logo NVIDIA

Principal Staff Site Reliability Engineer - CDN

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Remote
2 Locations
21960 Employees

Similar Companies Hiring

True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees
Caliola Engineering Thumbnail
Software • Machine Learning • Hardware • Defense • Data Privacy • App development • Aerospace
Colorado Springs, CO
53 Employees
Red 6 Thumbnail
Virtual Reality • Software • Hardware • Defense • Aerospace
Orlando, Florida
113 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account