SRE Leader

Posted 24 Days Ago
Be an Early Applicant
New York, NY
Hybrid
Expert/Leader
Cloud • Healthtech • Internet of Things • Machine Learning • Software
The Role
The SRE Leader will oversee the reliability and performance of the cloud platform, ensuring 99.99% uptime, enhancing observability, and leading SRE initiatives while managing a high-performance team in a healthcare technology context.
Summary Generated by Built In

Kontakt.io is building the platform that care operations run on.


We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations.


Easy to deploy and scale, it gives a clear picture of spaces, equipment, and people, eliminating inefficiencies and enhancing the patient experience. With measurable 10X ROI and over 20+ use cases, Kontakt.io is the go-to platform for better and faster care delivery operations.


We’re looking for a SRE Leader to own the reliability, performance, and automation of our cloud-based, real-time platform. This role will focus on keeping our platform running smoothly 24/7, minimizing downtime, improving observability, incident response, and self-healing automation. You will lead and scale the SRE team to ensure our infrastructure stays ahead of demand, operates efficiently and meets the needs of our growing healthcare customers.

Responsibilities

  • Ensure 99.99% uptime across our cloud platform, meeting strict SLAs for healthcare customers.
  • Design and implement self-healing, fault-tolerant systems to prevent failures before they happen.
  • Define SLIs, SLOs, and SLAs, ensuring proactive performance monitoring and incident resolution.
  • Architect and manage scalable cloud infrastructure (AWS) for massive real-time data processing.
  • Optimize containerized environments (Kubernetes, Docker) to support multi-region deployments.
  • Lead the adoption of infrastructure as code (Terraform) to fully automate infrastructure management.
  • Build and refine a world-class monitoring, alerting, and logging system using Prometheus, Grafana, OpenTelemetry, and Datadog.
  • Lead incident response and on-call operations, reducing mean time to detection (MTTD) and mean time to resolution (MTTR).
  • Conduct blameless postmortems and continuously improve system resilience.
  • Reduce manual intervention through automated deployment, scaling, and failover mechanisms.
  • Partner with Security & Compliance teams to ensure infrastructure meets HIPAA and SOC 2 standards
  • Lead disaster recovery and business continuity planning to ensure critical healthcare services are always available.
  • Drive technical strategy and roadmap for scalability, monitoring, and reliability engineering.
  • Collaborate with Product, Engineering, and Infrastructure teams to align SRE initiatives with business priorities.

What You Bring

  • 10+ years of experience in Site Reliability Engineering or Cloud Infrastructure.
  • Proven success scaling high-traffic, mission-critical platforms in SaaS, IoT, or healthcare.
  • Deep expertise in cloud platforms (AWS), Kubernetes, and distributed systems.
  • Strong background in monitoring, logging, and observability with Prometheus, OpenTelemetry, or similar tools.
  • Hands-on experience with incident management, postmortems, and building resilient systems.
  • Deep knowledge of CI/CD automation, GitOps, and infrastructure as code (Terraform, etc.).
  • A mature leadership approach, with the ability to drive technical strategy while growing and mentoring a high-performance SRE team.
  • Strong understanding of network security, access management, and compliance frameworks (HIPAA, SOC 2).
  • Bonus Points If You Have:

  • Experience with healthcare IT, including EHR data, FHIR, and HL7 interoperability.
  • Expertise in real-time distributed systems, event-driven architectures, or large-scale data pipelines.
  • Prior experience leading on-call rotations and major incident management processes.

Why You'll Love It Here

  • Own Mission-Critical Reliability – Ensure hospitals and care facilities always stay online with a 99.99% uptime healthcare platform.
  • Scale AI-Powered Infrastructure – Work on real-time automation and self-healing cloud systems that orchestrate care delivery.
  • Drive Big Impact in Healthcare – Help reduce waste, optimize resources, and improve patient care with technology that delivers 10X ROI.
  • Automation-First Culture – Minimize manual ops with cutting-edge automation, observability, and incident response strategies.
  • Join a High-Performing Team – Work with top engineers, AI experts, and healthcare innovators solving real-world challenges.

Ready to Build the Future of Healthcare?

Apply now and help scale the platform that care operations run on. 🚀

Top Skills

AWS
Ci/Cd
Datadog
Docker
Grafana
Kubernetes
Opentelemetry
Prometheus
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: New York, NY
100 Employees
Hybrid Workplace
Year Founded: 2013

What We Do

As the leader in Indoor Journey Analytics, Kontakt.io optimizes processes and resources by revealing how customers move through your business. Using AI, IoT, and RTLS, Kontakt.io helps businesses uncover waste, streamline capacity, improve workflows, and help customers and staff feel seen and valued.

Gallery

Gallery

Similar Jobs

NBCUniversal Logo NBCUniversal

Live Event Engineer - NBC Sports

AdTech • Cloud • Digital Media • Information Technology • News + Entertainment • App development
Hybrid
New York, NY, USA
68000 Employees

NBCUniversal Logo NBCUniversal

Audio Engineer- PEP (Freelance)

AdTech • Cloud • Digital Media • Information Technology • News + Entertainment • App development
Hybrid
New York, NY, USA
68000 Employees

Hinge Logo Hinge

Scaled Operations Program Manager / Quality Operations Program (Trust and Safety)

Artificial Intelligence • Machine Learning • Mobile • Other • Social Impact • Software • App development
Easy Apply
Hybrid
New York, NY, USA
305 Employees

Spring Health Logo Spring Health

Chief of Staff to the COO

Artificial Intelligence • Healthtech • Telehealth
Easy Apply
Hybrid
2 Locations
1300 Employees

Similar Companies Hiring

True Anomaly Thumbnail
Software • Machine Learning • Hardware • Defense • Artificial Intelligence • Aerospace
Colorado Springs, CO
131 Employees
Caliola Engineering Thumbnail
Software • Machine Learning • Hardware • Defense • Data Privacy • App development • Aerospace
Colorado Springs, CO
53 Employees
Red 6 Thumbnail
Virtual Reality • Software • Hardware • Defense • Aerospace
Orlando, Florida
113 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account