Site Reliability Engineer - New York, NY

Posted 2 Days Ago
Be an Early Applicant
New York, NY
Mid level
Cloud • Software • Database
Our mission is to enable every developer to build world-changing applications.
The Role
As a Site Reliability Engineer, you will manage and scale CockroachCloud services, overseeing production systems and ensuring operational efficiency. Your responsibilities include developing tools for reliability, automating processes, troubleshooting systems, and participating in disaster recovery tests while collaborating with various teams.
Summary Generated by Built In

Databases are the beating heart of every business in the world.

Cockroach Labs is the creator of CockroachDB, the most highly evolved cloud-native, distributed SQL database on the planet that scales fast, survives anything, and thrives anywhere. We created CockroachDB to unshackle teams from the constraints of their database. Join us on our mission to simplify how businesses build and operate world-changing applications!

About the Role

CockroachDB provides the backbone of storing data on a global scale. As a Site Reliability Engineer you’ll help manage and scale our CockroachCloud service, a fully managed offering of CockroachDB. You will oversee our production system, ensuring that we can provide stable and scalable infrastructure as we deliver CockroachDB to our customers. CockroachCloud is a global service spanning multiple cloud providers. Roughly half of your time will be spent on greenfield development work, with an emphasis on developing tooling and driving automation. In the role you will work across multiple teams within CockroachCloud as well as development and product teams working on CockroachDB.   

You Will

  • Manage the infrastructure for cloud services, including running internal production systems and hosting CockroachDB for our external customers.
  • Design, write and deliver software and systems to increase product reliability and operational efficiency.
  • Develop custom tools as necessary.
  • Keep a complex system running and solve problems relating to mission-critical services.
  • Design, implement, operate, and troubleshoot the automation and monitoring of production clusters to maximize performance and availability.
  • Drive the company through disaster recovery tests, where we manually turn down pieces of CockroachDB to test its overall resilience to failures.
  • Participate in an on-call rotation for our production systems and hosted services.

The Expectations

In your first 30 days, you will onboard and be exposed to our current internal and customer-facing production systems. Working with our existing SRE and engineering teams, you will pair on production operations and build out runbooks for the operation of different systems. We believe that it's essential for you to take this first month to become familiar with our technology and our company.

After 3 months, you'll be fully integrated into the team. You will develop and own tooling for reliability, automation, and other issues related to CockroachCloud’s stability and scalability. You will identify new opportunities for automating processes, streamlining delivery, deploying new core functionality, and building great tools. You will help make CockroachCloud the best platform to host CockroachDB on by bringing your expertise to our database.

You Have

  • Expertise in analyzing, monitoring, and troubleshooting large-scale distributed systems.
  • Experience in software development using one or more of the following: Go, C, C++, Python, Java.
  • Proficiency working with algorithms, data structures, and production troubleshooting.
  • Expertise in working with major cloud providers (AWS, Azure, GCP, etc.) and Cloud APIs.
  • Debugged and optimized code and to automate routine tasks.
  • Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc.)
  • Previous on-call experience, with a sense of urgency.
  • Experience building collaborative relationships with your colleagues. You enjoy being part of the code review process and partnering with your teammates on challenging problems.

The Team 

Our core mission on the SRE team is to operate at scale a secure & reliable Cockroach Cloud product. We are a group of software engineers first & foremost. We use software engineering as a means to achieve our mission; this is the SRE way. The SRE team is currently distributed across North America (6) and India (3).

Reporting to Tom Schmidt - Sr. Manager, Engineering (Site Reliability Engineering)

Tom recently joined Cockroach Labs as manager of Site Reliability Engineering and has taken responsibility for Cockroach Cloud’s production operations. Tom joined Cockroach Labs after 15 years at IBM where he initially contributed in a wide variety of technical leadership roles, generally focussing on quality and automation across compiler development, test frameworks, CICD, and more. Over the past 7 years, Tom has become an enthusiastic advocate of the Site Reliability Engineering discipline, presenting on the topic at conferences, developing certification curriculum, and securing multiple patents. Tom was also a primary contributor towards the establishment of IBMs formal SRE profession and was recognized as one of the first three SRE Thought Leaders within the company. Most recently, Tom transitioned into a management role where he introduced Site Reliability Engineering to the IBM Business Analytics organization, building an SRE team from the ground up, eventually managing over 20 individuals across 3 unique project areas while establishing practices that now guide over 80 engineers internationally. Cockroach Labs presented a new and unique opportunity to gain experience in a high paced startup environment, laying the foundation for scalable reliability as we prepare for the rapid growth of our Cockroach Cloud offering. Beyond the business, Tom is blessed to call himself a proud father of a 4 year old boy, and otherwise enjoys finding balance between spending time in nature (hiking, camping, exploring) and testing his mettle in competitive gaming.

Jordan Lewis - Senior Director of Engineering

Jordan is the Head of Engineering for CockroachDB Cloud. He’s responsible for the teams that build, maintain and keep CockroachDB Cloud reliably serving the needs of Cockroach Labs’ most demanding customer base. He joined Cockroach Labs as a database engineer in 2016 when it was just 25 people before moving into engineering leadership and most recently moving to lead the Cloud organization. Jordan lives in his hometown of Brooklyn NY with his wife. Outside of work he enjoys folk music and riding his electric scooter around town.

Isaac Wong - EVP of Engineering

Isaac is responsible for the health of the engineering organization at Cockroach Labs. He partners closely with teams to ensure we have a balanced culture that promotes quality and innovation in pursuit of our goals. Before joining Cockroach Labs Isaac was in life sciences for 16 years with Medidata Solutions where he had a front row seat on the exciting ride from a 30 person startup to more than 2000 people worldwide. But the lure of distributed, resilient, and consistent SQL databases, along with the amazing technology and culture at Cockroach Labs proved too much. When not working he likes to draw, play the piano and search NYC for cannolis with his wife and kids.

Our Benefits

  • Competitive Health Insurance Coverage (for you & your dependents!)
  • Paid parental leave (with baby bucks)
  • Flex Fridays
  • Flexible time off & flexible hours
  • Education reimbursement
  • Relocation support or home office allowance

Cockroach Labs is proud to be an Equal Opportunity Employer building a diverse and inclusive workforce. If you need additional accommodations to feel comfortable during your interview process, please email us at [email protected].

The annual anticipated base salary range for U.S. candidates for this role is USD $165,000 to $225,000 plus commission if a sales role. We set standard ranges for all U.S.-based roles based on function, level, and geographic location, benchmarked against similar stage growth companies. Actual salaries may vary and fall outside of this range depending on factors such as a candidate’s qualifications, geographic location, skills, experience, and competencies. In addition, we are often open to a wide variety of profiles, and recognize that the person we hire may be less experienced (or more senior) than this job description as posted. Salary is one component of the Cockroach Labs’ total rewards package, which includes stock options, health insurance, life and disability insurance, funds towards professional development resources, flexible PTO, paid holidays, and parental leave, to name a few! Salaries for candidates outside the U.S. will vary based on local compensation structures.

Top Skills

C
C++
Go
Java
Python
The Company
HQ: New York, NY
473 Employees
Hybrid Workplace
Year Founded: 2015

What We Do

Named after resilience and continuity, Cockroach Labs is the creator of CockroachDB, the planet's most highly evolved cloud-native, distributed SQL database. The goal is simple: to enable companies of all sizes across the world to build mission-critical apps and scale fast, survive anything, and thrive anywhere. Currently, CockroachDB is deployed at some of the world's top enterprises including Bose, Comcast, Netflix, and some of the largest names in banking, retail, and media.

Cockroach Labs was founded by a dedicated team of engineers and is backed by seasoned investors including Altimeter, Benchmark, GV, Firstmark, Index Ventures, Redpoint Ventures, Sequoia Capital, Tiger Capital, and Workbench.

Why Work With Us

Maintaining a human-centered culture has been a top priority at Cockroach Labs since day one. Even as we grow, we remain focused on building diverse, inclusive spaces that inspire innovation and creating opportunities to connect while encouraging employees to find their own unique balance of personal & professional commitments. Flexibility is key.

Gallery

Gallery

Similar Jobs

Formation Bio Logo Formation Bio

Senior Site Reliability Engineer

Artificial Intelligence • Big Data • Healthtech • Biotech • Pharmaceutical
Easy Apply
Hybrid
New York, NY, USA
140 Employees

Citadel Logo Citadel

Site Reliability Engineer

Information Technology • Software • Financial Services • Big Data Analytics
New York, NY, USA
4000 Employees

Peloton Logo Peloton

Site Reliability Engineer

Fitness • Hardware • Healthtech • Retail • Software
Easy Apply
New York, NY, USA
2500 Employees
172K-223K Annually

Alchemy Logo Alchemy

Site Reliability Engineer

Blockchain • Information Technology • Software • Cryptocurrency • Web3
Easy Apply
Hybrid
2 Locations
200 Employees

Similar Companies Hiring

TrainingPeaks (A Peaksware Company) Thumbnail
Software • Fitness
Louisville, CO
69 Employees
bet365 Thumbnail
Software • Gaming • eSports • Digital Media • Automation
Denver, Colorado
6100 Employees
Jobba Trade Technologies, Inc. Thumbnail
Software • Professional Services • Productivity • Information Technology • Cloud
Chicago, IL
45 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account