Technical Operations Manager

Posted 3 Days Ago
Be an Early Applicant
Allen, TX
105K-130K Annually
Senior level
Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
We build infrastructure for machine learning
The Role
As a Technical Operations Manager at Voltage Park, you will oversee the data center's infrastructure management, ensure operational integrity and efficiency, manage a technical team, handle incident management, and coordinate project management efforts while ensuring compliance and performance metrics are met.
Summary Generated by Built In

Voltage Park is on a mission to make machine learning infrastructure accessible to all, from large enterprises and research universities to seed-stage startups and nonprofits. Providing seamless access to compute with pricing and inventory transparency is the future of access to GPUs, and we are the only cloud provider offering a platform that shows all available GPUs with transparent, market-based pricing, in addition to long-term reserve contracts for our customers. 

We’re in search of a Technical Operations Manager in the datacenter organization to oversee the operational integrity, maintenance, and efficiency of the data center's infrastructure and technical teams. This role focuses on ensuring that the data center's physical infrastructure runs smoothly and meets performance and availability standards, while aligning with the organization’s broader business objectives.

This role is based onsite in our Allen datacenter. We are unable to provide sponsorship for this position.

What you’ll do:

  • Infrastructure Management: Ensure the data center’s power, cooling, and physical infrastructure (including servers, racks, and networking equipment) are properly maintained and optimized to maximize uptime.

  • Team Leadership: Oversee and develop a team of technical staff responsible for day-to-day operations, including an onsite asset manager, fostering a culture of accountability, collaboration, and continuous improvement.

  • Ticketing System Oversight: Monitor and manage break-fix tickets through the organization’s ticketing system, ensuring issues are prioritized, assigned, and resolved in a timely manner by appropriate team members.

  • Response and Resolution Coordination: Coordinate responses to tickets that involve hardware repairs, component replacements, or network/server troubleshooting. Ensure timely dispatch and effective resolution by qualified personnel.

  • Tracking and Reporting: Track ticket progress to ensure issues are resolved within agreed Service Level Agreements (SLAs), and provide regular performance reports to senior management, covering metrics such as ticket resolution time and uptime.

  • Incident and Problem Management: Lead troubleshooting and incident management efforts for technical issues, including power failures, equipment malfunctions, or connectivity problems, aiming for swift resolution and minimal downtime.

  • Vendor and Asset Management: Manage relationships with external vendors for hardware, software, and facility services; oversee data center assets, from procurement to installation and lifecycle management.

  • Capacity and Performance Planning: Monitor infrastructure performance to meet current and projected demand, planning for necessary upgrades or expansions, and ensuring resources are allocated efficiently.

  • Compliance and Security: Ensure data center compliance with industry standards and regulations (e.g., ISO, SOC, HIPAA) and oversee the implementation of security protocols to protect data and systems.

  • Project Management: Manage and deliver data center projects related to expansions, migrations, and upgrades, coordinating cross-functional teams to meet project goals within schedule and budget.

Qualifications:

  • Minimum of 5 years of experience in data center operations, with a proven track record in team management, optimizing operations, and meeting uptime and SLA targets.

  • Strong knowledge of data center infrastructure, including power distribution, HVAC, cabling, networking, and server environments.

  • Experience with capacity planning, resource allocation, and budget management for efficient, cost-effective operations.

  • Proven leadership abilities in hiring, training, and developing technical teams, with a focus on fostering accountability and continuous improvement.

  • Excellent problem-solving and decision-making skills, with the ability to handle critical incidents under pressure to ensure timely resolution.

  • Strong communication and collaboration skills, with the ability to work effectively across cross-functional teams, stakeholders, and vendors.

  • Project management experience, particularly in coordinating deployments, decommissioning, and infrastructure upgrades, with a focus on adhering to schedules and budgets.

  • Metrics and KPIs: Proven experience in managing and achieving operational metrics, including uptime percentage, ticket resolution time, and overall customer satisfaction.

  • Preferred Certifications: Certifications such as PMP, Data Center Certified Associate (DCCA), or ITIL are a plus, reflecting advanced expertise in data center management practices.

Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter. 

Compensation Range: $105K - $130K

What the Team is Saying

Melissa Du
The Company
HQ: San Francisco, CA
51 Employees
Remote Workplace
Year Founded: 2023

What We Do

The market for cutting-edge ML compute is broken. Startups, researchers and even big AI labs are scrambling to buy or rent access to the latest chips for ML training. But demand far outstrips supply, and what’s available is only accessible to the well-resourced, placing an artificial damper on innovation.

To solve this challenge, we've launched Voltage Park, and we’re on a mission to make machine learning infrastructure accessible to all, from large enterprises and research universities, to seed-stage startups and nonprofits.

With around 24,000 NVIDIA H100 GPUs, the Voltage Park cloud is one of the most powerful collections of cutting-edge ML compute in the world. Our clusters consist of 80GB H100 SXM5 GPUs fully interconnected with 3.2T InfiniBand.

Why Work With Us

You’ll play a pivotal role as a member of the founding team that will change the face of machine learning infrastructure. As an early hire, you’ll have outsize influence in defining the company’s culture and ensuring mission success.

Voltage Park Offices

Remote Workspace

Employees work remotely.

Voltage Park is a 100% remote company.

Typical time on-site: None
HQSan Francisco, CA

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account