Staff Engineer (Site Reliability)

Posted 10 Days Ago
Be an Early Applicant
India
Senior level
Information Technology
The Role
The Staff Engineer (Site Reliability) will provide strategic leadership for SRE practices, architect complex systems for scalability, manage incident responses, and foster cross-functional collaboration. The role involves capacity planning, mentorship for SRE teams, and influencing organizational best practices. The ideal candidate will have extensive experience in AWS environments, observability tools, and programming, ensuring system reliability and performance.
Summary Generated by Built In

Responsibilities:

  • Strategic Leadership: Define and drive the strategic direction for SRE practices and reliability engineering within the organization, influencing both technical and operational strategies.
  • Advanced System Architecture: Architect and implement complex systems and solutions, addressing high-impact and cross-team challenges with a focus on scalability, reliability, and performance.
  • High-Level Incident Management: Lead major incident response efforts and postmortem analyses, ensuring thorough investigations and comprehensive resolution strategies to improve overall system resilience.
  • Cross-Functional Collaboration: Partner with engineering, operations, and product teams to embed reliability and performance best practices into all aspects of system design and development.
  • Innovation and Improvement: Drive innovation in reliability engineering practices, introducing new tools, technologies, and methodologies to enhance system performance and operational efficiency.
  • Strategic Capacity Planning: Oversee long-term capacity planning and forecasting, aligning resource allocation with business goals and scaling needs to ensure continuous service reliability.
  • Mentorship and Leadership: Provide guidance and mentorship to senior and junior SREs, fostering a culture of learning and professional development within the SRE team.
  • Organizational Impact: Contribute to and influence organizational policies, procedures, and best practices related to system reliability, ensuring alignment with broader business objectives and industry standards.

Requirements:

  • 8+ years of experience as an SRE in AWS environments within medium to large-scale organizations.
  • 6+ years of hands-on experience with observability tools, including Prometheus, New Relic, Grafana, or similar.
  • Exceptional proficiency in programming, with expertise in Python, Groovy, and Bash.
  • Extensive experience managing database technologies, both SQL and NoSQL.
  • 5+ years of experience in designing and building infrastructure deployment pipelines using Git, Terraform, Helm, Jenkins/Jenkins X/ArgoCD, or similar tools.
  • Advanced expertise in designing and managing production environments in AWS, including services such as VPCs, EKS, IAM, AMI, EC2, CloudWatch, CloudTrail, Control Tower, GuardDuty, MSK, S3, Glacier, Gateways, Direct Connect, Route 53, RDS, ALBs, Autoscaling, and more.
  • Deep knowledge of Linux systems and a range of protocols and technologies, including HTTP, REST, TCP/IP, SSL, DNS, SMTP, SSH, NTP, Load Balancing, SQL/NoSQL, Message Brokers, Nginx, Vault, ELK, and others.
  • Hands-on experience with Kubernetes and a variety of container and cloud-native technologies.
  • Proven ability to manage 24/7 on-call rotations, develop runbooks, establish support procedures, and proactively monitor systems across multiple geographic locations.
  • Ability to excel under pressure in complex, high-stakes environments.

Innovation Lives Here


You go all in no matter what you do, and so do we. At Lytx, we’re powered by cutting-edge technology and Happy People. You want your work to make a positive impact in the world, and that’s what we do. Join our diverse team of hungry, humble and capable people united to make a difference.

Together, we help save lives on our roadways.

Find out how good it feels to be a part of an inclusive, collaborative team. We’re committed to delivering an environment where everyone feels valued, included and supported to do their best work and share their voices.

Lytx, Inc. is proud to be an equal opportunity/affirmative action employer and maintains a drug-free workplace. We’re committed to attracting, retaining and maximizing the performance of a diverse and inclusive workforce. EOE/M/F/Disabled/Vet.

Top Skills

Bash
Groovy
Python
The Company
Framingham, MA
790 Employees
On-site Workplace
Year Founded: 1998

What We Do

Learn how Lytx video telematics can help you improve safety, efficiency, and DOT compliance in your fleet. Start improving your fleet operations today.

Similar Jobs

EchoStar Logo EchoStar

Staff Engineer - Site Reliability Engineer

Aerospace • Cloud • Digital Media • Information Technology • Mobile • News + Entertainment • Retail
Bengaluru, Karnataka, IND
14500 Employees

Experian Logo Experian

Site Reliability Engineer

Big Data • Marketing Tech • Analytics
Hyderabad, Telangana, IND
16292 Employees
India
70000 Employees
Remote
India
740 Employees

Similar Companies Hiring

Silverfort Thumbnail
Security • Sales • Information Technology • Cybersecurity • Automation
GB
357 Employees
Jobba Trade Technologies, Inc. Thumbnail
Software • Professional Services • Productivity • Information Technology • Cloud
Chicago, IL
45 Employees
InCommodities Thumbnail
Renewable Energy • Machine Learning • Information Technology • Energy • Automation • Analytics
Austin, TX
234 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account