Summary
We are seeking talented Functional Test Software engineers with embedded systems experience to join our Hyderabad, India team focused on functionally verifying our ML optimized SW/HW solutions. In this role, you will write test plans and develop software in our automation framework to validate high-speed I/O subsystems, along with system level testing of our solutions with ML workloads. Background in ML hardware technologies, RDMA, the Linux kernel and Server I/O is highly desired.
Roles and Responsibilities:
- Write comprehensive test plans that functionally verify components of our solution based on HW and SW architectural specifications
- Develop software to exercise all test cases for each component
- Write verification libraries for fabric communication services, network interfaces, GPU, storage, and other server based I/O components
- Write applications, libraries and kernel modules that stress I/O technology capabilities including those that stress RDMA NIC, NCCL, CUDA and NVLink GPU technology
- Develop test libraries in Python, C and C++
- Develop software that integrates with Bazel based build and test environments
- Develop low-level SW applications to test I/O performance of next-gen compute systems
- Debug complex system issues in customer use cases
- Assist other team members with developing test plans and writing verification software
Desired Knowledge and Skill Set:
- Strong coding skills in multiple languages such as Python, C and C++
- Good knowledge of TCP/IP and RoCE and other networking protocols
- Knowledge of general packet flow pipelines in silicon
- Hands on experience with ML Collective Communication and CUDA programming
- Hands on experience with ML frameworks such as PyTorch and TensorFlow
- Background in Linux device drivers, memory management, network communications libraries and low-level I/O performance
- Detailed understanding of server components and applicable drivers for CPUs, memory, GPUs, networking devices and storage
- Experience building out test framework infrastructure such as equipment provisioning, Linux system config, traffic generators, statistic monitoring, reporting and data capture
- Knowledge of configuration and monitoring techniques such as gRPC, gNMI, SNMP, REST, SSH, Prometheus and Grafana
- Background in highly optimized CI/CD environments
- Proficient in git and docker usage
- Linux systems knowledge
- 5+ years of software development / QA experience working closely with hardware
About Us
Enfabrica is on a mission to revolutionize AI compute systems and infrastructure at scale through the development of superior-scaling networking silicon and software which we call the Accelerated Compute Fabric. Founded and led by an executive team assembled from first-class semiconductor and distributed systems/software companies throughout the industry, Enfabrica sets themselves apart from other startups with a very strong engineering pedigree, a proven track record of delivering, deploying and scaling products in data center production environments, and significant investor support for our ambitious journey! Together, with their differentiated approach to solving the I/O bottlenecks in distributed AI and accelerated compute clusters, Enfabrica is unleashing the revolution in next-gen computing fabrics.
Top Skills
What We Do
We develop groundbreaking hardware, software, and system technologies that solve the critical bottlenecks in next-generation computing workloads - at any scale - across hyperscale cloud, edge, enterprise, 5G/6G, and automotive infrastructure.