Stacklok

Site Reliability Engineer II (SRE)

Posted 9 Days Ago

Be an Early Applicant

London, Greater London, England

Mid level

Security

The Role

The Site Reliability Engineer II at Stacklok will focus on automation, monitoring, and incident response, contributing to feature shipping and best practices for reliability and performance within a product team. This role also emphasizes collaboration within the SRE guild to ensure consistent practices and a robust SaaS platform.

Summary Generated by Built In

Stacklok is an innovative software supply chain security startup founded by Kubernetes co-founder, Craig McLuckie and Sigstore founder, Luke Hinds. Our mission is to make it easier to securely develop software. With our deep expertise in open source technologies and commitment to enhancing software security, we are seeking highly skilled and motivated individuals to join our team. This is a rare opportunity to join a startup at an early stage, and to be part of a team that is committed to building something truly innovative and impactful. Learn more about Stacklok’s mission, virtues, and leadership, HERE.

Location

This is a hybrid role that requires on-site work at our London office three (3) days a week. Our office is conveniently located in WeWork at 1 Mark Square, London, EC2A 4EG.

Elevator Pitch

Stacklok Cloud is a comprehensive security platform that combines open source package intelligence with a policy platform built on the open source project, Minder, allowing developers to securely consume open source software while enabling security teams to effectively manage and maintain a robust security posture across the entire software supply chain.

As Stacklok Cloud is delivered to major companies across the world, ensuring its scalability, security, performance, and reliability is essential. We’re seeking a Site Reliability Engineer II to contribute to initiatives within a product team, focusing on automation, monitoring, configuration management, continuous delivery, and incident response. This role involves applying both SRE and software engineering expertise to ship new features and serve as a resource for best practices in reliability, performance metrics, and system resilience. Additionally, participation in Stacklok's SRE guild will be integral, collaborating with peers to drive consistent practices in automation, observability, and reliability across all products, fostering a seamless and high-performing SaaS platform.

Join our team of exceptionally talented engineers and become part of a groundbreaking field that tackles critical challenges for developers and the OSS community. Contribute to an open source strategy that focuses on building and expanding an ecosystem for diverse OSS tools, and help shape the future of open source development with innovative and impactful work.

Success In The Role: 6-12 Months Expectations

Acclimatize to the Team: Familiarize yourself with our engineering processes. Build connections with team members, immerse yourself in our company culture, understand our virtues, and learn the way we work and collaborate. Understand Our Products and Services: Develop a strong grasp of Stacklok’s products and services, our platform vision and goals, both immediate and future, to enable alignment between your contributions and our objectives.
Deep Dive Into Stacklok Cloud Architecture: Become comfortable with the current infrastructure-as-code environment using Terraform to deploy SaaS software to Kubernetes on AWS. Apply tools like Argo CD for continuous delivery, Helm for managing Kubernetes packages and Github Actions for workflow automation.
Proficiency in Go and Python: Develop proficiency in Go, our primary programming language, focusing on best practices, idiomatic design patterns, and effective error handling, and unit testing. Demonstrate intermediate knowledge of Python, specifically in leveraging its capabilities for automation, scripting, and building internal tools.
Hybrid Contribution: As part of the SRE team, you’ll primarily balance contributions between product reliability improvements and company-wide infrastructure enhancements, advancing Stacklok’s platform underpinnings. Additionally, you'll make direct contributions to feature development, further enhancing the capabilities of our products and services.

Technical Guidance and Documentation: Support production infrastructure by contributing to and maintaining comprehensive documentation, including playbooks and architectural diagrams, to ensure team alignment.
On-Call Rotation Responsibilities: Responsible for on-call duties every 5-6 weeks with a 2-week on-call rotation. During each rotation, you will alternate with the other engineer on-call between primary and secondary roles. The primary role involves leading incident resolution and communication, while the secondary role provides support with troubleshooting and monitoring.

In This Role You Will Have The Opportunity To:

Shape The Future of Stacklok Cloud: As a site reliability engineer, you’ll play a key role in supporting and enhancing our platform’s reliability and performance. Your focus will include regular platform upgrades and the instrumentation and monitoring of production systems. You’ll help advance our platform and shape strategies for the future of software supply chain security.
Embrace an Automate Everything Mindset: Contribute to a culture of automation by streamlining operational tasks and enhancing efficiency across the environment. You’ll support automation initiatives for incident management tooling, application autoscaling, and recovery processes to ensure resilient systems that adapt to changing demands. Collaborating with a skilled team, you’ll help automate playbooks, continuous delivery pipelines, and GitHub Terraform processes, driving improvements in service delivery and incident response.
Monitor and Improve Service Performance: Support end-to-end monitoring of service KPIs to drive improvements and maintain optimal performance. You’ll regularly review logs and performance metrics, using shared tools and incident response automations to enhance system reliability. With an analytical mindset, you’ll contribute to identifying areas for KPI improvement, helping us consistently meet and exceed our performance goals.
Learn and Grow with Mentorship Opportunities: Work alongside experienced engineers who will support your professional growth and skill development. By collaborating in a culture of empathy, curiosity, and psychological safety, you’ll deepen your understanding of infrastructure and site reliability best practices. Engaging in code reviews and team discussions will allow you to refine your skills, share insights, and contribute to a strong, capable team. This role offers a clear path for growth, helping you build toward new responsibilities and technical expertise.

We understand that not everyone will meet every requirement listed, and that’s perfectly okay! We encourage you to apply regardless of your self-assessment. We value a diverse range of skills and experiences and believe that your unique attributes can make a significant impact. We want to hear from you!

Desired Skill & Experience

Experience in Site Reliability Engineering supporting an enterprise SaaS service with evidence of maintaining high availability and performance in production environments.
Proficient in programming languages, particularly Go or Python, demonstrating the ability to write clean, efficient, and maintainable code.
Familiarity with Infrastructure as Code (IaC) principles, with proficiency in automation tools like Terraform for environment provisioning and configuration management.
Experience with a major cloud provider (AWS, Azure, Google), preferably AWS.
Understanding of cloud-native application deployment and management using technologies like Docker and Kubernetes with exposure to scaling and recovery strategies.
Experience in automating incident response processes using platforms such as PagerDuty to improve response times and incident management efficiency.
Proficient in log aggregation and analysis tools such as AWS Athena and Cloudwatch enabling thorough performance reviews and proactive issue identification.
Exposure to defining and implementing Service Level Objectives (SLOs) and key performance indicators (KPIs) to drive service quality and operational excellence.
Knowledge of security best practices in site reliability, with an emphasis on operational security measures and maintaining a secure software supply chain.
Impact-Driven and Collaborative: Track record of delivering solutions that drive business outcomes; excellent written and verbal communication skills for engaging diverse stakeholders. Committed to fostering growth and continuous improvement within teams.
Versatile and Self-Starting: Adaptable in dynamic, startup environments, comfortable in varied roles—from individual contributor to conference presenter—and skilled at making technical topics accessible to broad audiences.

#LI-Hybrid

Why Join Us?

At Stacklok, you will be a part of a culture that values open communication, collaboration, and innovation. We offer a competitive salary package and flexible work hours. If you’re a self-motivated and result-driven individual with a passion for designing and building secure, scalable, distributed systems, and you want to be part of the most exciting startup in the secure supply chain space, come and join us!

Stacklok Inc, is proud to be an equal opportunity employer. We are committed to providing equal employment opportunities for all people and place great value in both diversity and inclusiveness. All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law.