The Baldwin Group is an award-winning entrepreneur-led and inspired insurance brokerage firm delivering expertly crafted Commercial Insurance and Risk Management, Private Insurance and Risk Management, Employee Benefits and Benefit Administration, Asset and Income Protection, and Risk Mitigation strategies to clients wherever their passions and businesses take them throughout the U.S. and abroad. The Baldwin Group has award-winning industry expertise, colleagues, competencies, insurers, and most importantly, a highly differentiated culture that our clients consider an invaluable expansion of their business. The Baldwin Group (NASDAQ: BWIN), takes a holistic and tailored approach to insurance and risk management.
We’re looking for a highly motivated, practical and responsible Observability/Site Reliability Engineer who is excited to play a critical role in our rapidly growing Platform team. The Observability Engineer role will make significant contributions to our Observability, APM, Monitoring and Logging strategy, be integral to our day-to-day operations, and be an advocate for designing and implementing Site Reliability Engineering principles within the company.
The successful candidate will have experience with CI/CD, Observability, APM, Monitoring, Logging, Infrastructure-as-Code, On-Call Support. Understanding of Cloud (AWS/Azure), SRE Practices, version control, configuration management, and automation are also required.Principal Responsibilities:
- Develop and maintain comprehensive observability solutions for infrastructure, applications, and services, and implement APM tools and frameworks to monitor application performance, user experience, and system health.
- Implement and Maintain tools and systems that provide insights into the health and performance of applications and infrastructure including metrics, logs, and traces to monitor system behavior.
- Proactively analyze performance metrics and logs to identify bottlenecks, failures, and areas for improvement, ensuring systems are consistently reliable, highly available, and optimally performing by addressing potential issues before they impact users.
- Strategically assess system capacity requirements and plan for future growth to ensure seamless scalability, working closely with development and operations teams to implement robust and effective scaling strategies.
- Create automated solutions for monitoring, deployment, scaling, and recovery operations, and develop custom tools and scripts to enhance observability and monitoring capabilities.
- Collaborate closely with software engineers, QA teams, and operations staff to seamlessly integrate observability and reliability best practices into the development lifecycle with expert guidance and support for instrumenting code and services with comprehensive monitoring and logging solutions.
- Develop and maintain incident response plans, including alerting, escalation, and communication protocols, and lead efforts to resolve production incidents, minimizing downtime, and ensuring thorough root cause analysis and post-mortem reviews
Education, Experience, Skills and Abilities Requirements:
- 3+ years of experience as a Observability or Site Reliability Engineer role.
- Experience with cloud infrastructure platforms such as AWS or Azure.
- Proven Experience with administering Observability, Monitoring tools (Datadog or similar).
- Experience with containerized and serverless compute technology (Docker, ECS, Kubernetes, Lambda, etc.)
- Experience with DevOps & CI/CD processes and tools (GitHub, Terraform, Ansible etc.).
- Experience in integrations b/w DevOps, SRE, Testing tools to generate DORA metrics, reports and create dashboards.
- Understanding of SRE principles including SLO, SLI, KPI, Metrics, logging, tracing etc.
- Proficient in writing scripts (Bash, PowerShell) and program in one or more language (Python, JavaScript, Go, Java, or similar).
- Experience in capacity planning and scaling resource requirements based on traffic patterns and performance metrics.
- Experience in preparing, executing, and improving incident response plans.
- Strong understanding of on-call rotation practices and incident escalation processes.
- Knowledge of security best practices and compliance standards relevant to observability and monitoring (e.g., GDPR, HIPAA).
- Datadog or relevant Certifications preferred.
- Highly self-motived, highly available, and driven to exceed colleague expectation
- Ability to think critically and logically under pressure.
- Strong technical experience with proven history of troubleshooting complex, cross segment, cross office, and cross team problems.
- Demonstrates the organization’s core values, exuding behavior that is aligned with the firm’s culture.
Click here for some insight into our culture!
The Baldwin Group will not accept unsolicited resumes from any source other than directly from a candidate who applies on our career site. Any unsolicited resumes sent to The Baldwin Group, including unsolicited resumes sent via any source from an Agency, will not be considered and are not subject to any fees for any placement resulting from the receipt of an unsolicited resume.
Top Skills
What We Do
BRP is now The Baldwin Group! We’ve updated our name to reflect our unified group of talented teams across the country.
The Baldwin Group is a cohesive group of experts in business insurance, employee benefits, retirement planning, and all areas of private and personal insurance. Since our founding in 2011, we’ve evolved from a local business into a national firm with a vast network of specializations and industry practices for the benefit of our more than two million clients across the country.
In addition, we have built excellent relationships with a wide range of insurance company partners. These relationships, coupled with our entrepreneurial and family-oriented culture, and deep expertise enable us to seamlessly deliver a breadth of innovative solutions to clients.
At The Baldwin Group, we help provide the solutions our clients need to have confidence and gain peace of mind as they pursue what’s possible for themselves, their families, and their businesses. Whether they are renting their first apartment or buying a larger home, opening a small business or taking their company public, we offer solutions to support them on every step of their journey. This has been our story since the beginning—we provide the indispensable expertise and quality insights that give our clients peace of mind to pursue their purpose, passion, and dreams. And that’s what The Baldwin Group will continue to do for years to come: we Protect the Possible℠.