Company Description
Run:AI is bridging the gap between data science and computing infrastructure by creating a high-performance compute virtualization layer for deep learning, speeding the training of neural network models and enabling the development of large AI models. By abstracting workloads from underlying infrastructure, Run:AI creates a shared pool of resources that can be dynamically provisioned for full utilization of expensive GPU compute.
Job Description
The Run:AI product is a mixture of SaaS and on-prem deployments on top of Kubernetes. The DevOps Engineer will be responsible for the design, build and health of these deployments.
You will work with technologies and tools such as Kubernetes custom operators and controllers, admission controllers and webhooks, Helm, GitHub, Github actions, ArgoCD and Gitops.
In your day-to-day work you will interact with the Engineering teams, Customer Success, Professional Services, pre-sales as well as the IT departments of our enterprise customers.
Responsibilities:
- Full end to end ownership over our entire cloud infrastructure, including individual development environments, Build/CI server, and production systems on various cloud environments.
- Design, build, and shape the architecture of deployments of Run:AI cloud-native products over a wide range of complex customer environments (on-premise, cloud, edge), constraints (e.g air-gapped installation variant), and K8s flavors (vanilla, cloud-managed, Openshift, Rancher, Tanzu, and more).
- Troubleshoot production issues and tackle performance challenges.
- Collaborate with stakeholders to offer input on product direction and design.
- Continually evaluate tools and technologies to improve the overall release and product deployment processes.
Qualifications
- 3+ years of work experience as a DevOps
- hands-on technical leadership in a large scale software development environment
- Key qualification: expert in Kubernetes -2+ years of hands-on experience with vanilla kubernetes.
- Proficiency in Linux, Networking, Storage and Security.
- Vast experience in managing a production environment, including monitoring and logging solutions.
- Excellent Bash/Shell scripting skills -AND- scripting using Python, Go.
- Strong software engineering skills in backend systems and databases.
Top Skills
What We Do
Run:ai helps companies execute on their AI initiatives quickly, while keeping budgets under control, virtualizing expensive hardware resources in order to pool, share and allocate your resources efficiently