Site Reliability Engineer

Posted 10 Days Ago
Be an Early Applicant
Mountain View, CA
Senior level
Information Technology • Consulting
The Role
The Site Reliability Engineer (SRE) will manage cloud infrastructure and applications for data processing and governance, collaborating with engineering teams to optimize performance and reliability. Responsibilities include developing core services on AWS, automating system configurations, improving incident response, and utilizing data analytics technologies at scale.
Summary Generated by Built In

Job Brief:
As Data Platform Site Reliability Engineering you will manage infrastructure and applications on cloud computing platforms to deliver data processing, governance, and storage. Our platform teams work with exabytes of data, terabytes of memory, and hundreds of thousands of jobs to enable predictable and performant data analytics.
As an SRE, you’ll need to solve problems that arise using empirical data, teamwork, and your own unique expertise. The Data Platform SRE will work directly with our data platform and engineering teams in an embedded SRE model, operating in unison with the developers to deliver seamless experiences for our customers. We run a mix of open source, vendor licensed, and internally developed tools which you will use and have opportunities to improve upon. The cross functional team collaborates to ensure we apply a consistent incident management process across all data platform services and provide user journey based SLOs derived from exhaustive observability metrics, high availability architecture, and automation for deployments. We think critically and strive to balance the best solution with the need to get things done for each engineering challenge we face.
VentureDive Overview:
Founded in 2012 by veteran technology entrepreneurs from MIT and Stanford, VentureDive is the fastest growing technology company in the region that develops and invests in products and solutions that simplify and improve lives of people world-wide. We aspire to create a technology organization and an entrepreneurial ecosystem in the region that are recognized as second to none in the world.
Key Responsibilities:

  • Make an impact from design phase, through development and operation of Data Platform over Kubernetes cluster and its ecosystem on AWS.
  • Build core services, and tooling and create technical processes that simplify and enable engineers across multiple services.
  • Identifying, automating, and scaling system configurations without compromising on security and reliability.
  • Participate in on-call rotations and help improve incident response.


Qualifications and Experience:

  • BS/MS in Computer Science or Equivalent (4+ years of software development or production operations experience in a large-scale environment).
  • Strong sense of ownership and integrity demonstrated through clear communication and collaboration
  • Experience in architecting, developing, operating, and troubleshooting Kubernetes clusters and/or other highly available systems at scale.
  • Proficiency with the architecture, deployment, performance tuning, and troubleshooting of open-source data analytics technologies, especially Apache Spark, Trino and related software in a large-scale environment.
  • The ability to design, author, and release code in languages like Go, Python, or Java
  • Acute drive to automate manual operations and to improve them through repeated iteration.
  • Understanding of the Linux Operating System, standard networking protocols, and components
  • Experience with cloud-native services on AWS/GCP
  • Hands-on experience managing large numbers of diverse systems with configuration management or software delivery platforms (such as Terraform, CloudFormation, ArgoCD,and Flux)
  • Experience with deploying, supporting, and monitoring new and existing services, platforms, and application stacks.
  • Excellent troubleshooting and problem-solving skills
  • Experience with scale testing, disaster recovery, and capacity planning.
  • Effective communication and collaboration skills: have the ability to drive and promote technical partnerships across teams.
  • Incident response and/or incident management experience


In order to thrive at VentureDive, you
…are intellectually smart and curious
…have the passion for and take pride in your work
…deeply believe in VentureDive’s mission, vision, and values
…have a no-frills attitude
…are a collaborative team player
…are ethical and honest
Are you ready to put your ideas into products and solutions that will be used by millions?
You will find VentureDive to be a quick pace, high standards, fun and a rewarding place to work at. Not only will your work reach millions of users world-wide, you will also be rewarded with competitive salaries and benefits. If you think you have what it takes to be a VenDian, come join us ... we're having a ball!
​​​​​​​#LI-Hybrid

Top Skills

Go
Java
Python
The Company
HQ: Mountain View, California
379 Employees
On-site Workplace
Year Founded: 2012

What We Do

VentureDive is an award-winning digital development company that builds cutting-edge technology solutions to improve lives globally. Since its inception in 2012, the firm has enabled two tech unicorns and successfully driven digital transformation initiatives for large enterprises. Led by co-founders Atif Azim and Shehzaad Nakhoda, VentureDive has a presence in Silicon Valley, London, Portugal, Dubai, and Pakistan. To learn more, visit https://www.venturedive.com.

Similar Jobs

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer - Remote

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
San Francisco, CA, USA
3000 Employees
173K-242K Annually

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer , Cloud Platform - Remote

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
San Francisco, CA, USA
3000 Employees
173K-242K Annually

Atlassian Logo Atlassian

Site Reliability Engineer

Cloud • Information Technology • Productivity • Security • Software • App development • Automation
Remote
San Francisco, CA, USA
11000 Employees

Atlassian Logo Atlassian

Principal Site Reliability Engineer

Cloud • Information Technology • Productivity • Security • Software • App development • Automation
Remote
San Francisco, CA, USA
11000 Employees
167K-269K Annually

Similar Companies Hiring

Silverfort Thumbnail
Security • Sales • Information Technology • Cybersecurity • Automation
GB
357 Employees
Jobba Trade Technologies, Inc. Thumbnail
Software • Professional Services • Productivity • Information Technology • Cloud
Chicago, IL
45 Employees
InCommodities Thumbnail
Renewable Energy • Machine Learning • Information Technology • Energy • Automation • Analytics
Austin, TX
234 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account