Software Engineer, LLM Training Frameworks Engineer

Posted 4 Hours Ago
Be an Early Applicant
San Francisco, CA
Senior level
Artificial Intelligence • Information Technology
The Role
The Software Engineer will focus on developing infrastructure solutions for AI/ML applications. Responsibilities include optimizing performance, collaborating with teams to enhance infrastructure management, and researching machine learning systems. The role requires expertise in distributed systems and proficiency in multiple programming languages.
Summary Generated by Built In

Job Responsibilities

  1. Infrastructure Development: Identify and resolve infrastructure gaps to ensure reliable, efficient, and scalable solutions.
  2. AI/ML Solutions: Develop advanced AI/ML infrastructure solutions to enhance the efficiency of our ML teams.
  3. System Design: Design and implement solutions for distributed storage systems, scheduling systems, high availability, and core reliability issues within large-scale GPU clusters.
  4. Performance Optimization: Monitor and optimize the performance of our AI/ML infrastructure, ensuring high availability, scalability, and efficient resource utilization.
  5. Automation Tools: Develop and deploy automation tools, monitoring solutions, and operational strategies to streamline infrastructure management and reduce manual tasks.
  6. Collaboration: Work with various teams, including ML developers, data engineers, and DevOps professionals, to create a cohesive and integrated AI/ML infrastructure ecosystem.
  7. Parallel Training: Optimize large-scale parallel training for state-of-the-art deep learning algorithms, including large language models, multi-modality models, diffusion, and reinforcement learning.
  8. Research & Development: Research and develop our machine learning systems, including accelerated computing architecture, management, and monitoring.
  9. Deployment: Deploy machine learning systems for distributed training and inference.
  10. Cross-layer Optimization: Manage cross-layer optimization of system and AI algorithms and hardware for machine learning (GPU, ASIC).

Minimum Qualifications

  1. Bachelor's degree in Computer Science, Engineering, or a related technical field.
  2. 5-8+ years of experience in software engineering, with a strong background in developing and managing large-scale distributed systems, ideally within the AI/ML infrastructure domain.
  3. Proficiency in programming languages such as Python, Go, or C++, with knowledge of cloud computing platforms like AWS, Azure, etc.
  4. Familiarity with machine learning algorithms, platforms, and frameworks such as PyTorch and Jax. Basic understanding of GPU and/or ASIC functionality.
  5. Expertise in at least one or two programming languages in a Linux environment: C/C++, CUDA, Python.
  6. Familiar with open-source distributed scheduling/orchestration/storage frameworks, such as Kubernetes (K8S), Yarn (Flink, MapReduce), HDFS, Redis, S3, etc., with practical experience in machine learning system development.
  7. Mastery of distributed systems principles and participation in the design, development, and maintenance of large-scale distributed systems.
  8. Strong communication and collaboration abilities, effective in working with diverse teams and individuals.

Preferred Qualifications

  1. In-depth understanding of AI/ML workflows, including model training, data processing, and inference pipelines.
  2. Practical experience with containerization technologies (Docker, Kubernetes), automation tools, and monitoring solutions (Prometheus, Grafana).
  3. Exceptional problem-solving skills, capable of analyzing complex systems, identifying bottlenecks, and implementing scalable solutions.
  4. A passion for continuous learning and staying abreast of new technologies and best practices in the AI/ML infrastructure space.
  5. Experience with GPU-based high-performance computing, RDMA high-performance networks (MPI, NCCL, ibverbs).
  6. Familiarity with distributed training framework optimizations (e.g., DeepSpeed, FSDP, Megatron, GSPMD).
  7. Knowledge of AI compiler stacks (torch FX, XLA, MLIR).
  8. Experience with large-scale data processing and parallel computing.
  9. In-depth CUDA programming and performance tuning experience (cutlass, triton).


About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https://www.together.ai/privacy  


Top Skills

C++
Go
Python
The Company
San Francisco, California
84 Employees
On-site Workplace
Year Founded: 2022

What We Do

Together AI is a research-driven artificial intelligence company. We contribute leading open-source research, models, and datasets to advance the frontier of AI. Our decentralized cloud services empower developers and researchers at organizations of all sizes to train, fine-tune, and deploy generative AI models. We believe open and transparent AI systems will drive innovation and create the best outcomes for society

Jobs at Similar Companies

InCommodities Logo InCommodities

Head of People & Culture - NA

Information Technology • Machine Learning • Analytics • Energy • Automation • Renewable Energy
Hybrid
Austin, TX, USA
234 Employees

Silverfort Logo Silverfort

Commercial Sales Manager- East

Information Technology • Sales • Security • Cybersecurity • Automation
Remote
8 Locations
357 Employees

Jobba Trade Technologies, Inc. Logo Jobba Trade Technologies, Inc.

Senior Back End Developer

Cloud • Information Technology • Productivity • Professional Services • Software
Remote
Hybrid
Chicago, IL, USA
45 Employees

Similar Companies Hiring

Silverfort Thumbnail
Security • Sales • Information Technology • Cybersecurity • Automation
GB
357 Employees
Jobba Trade Technologies, Inc. Thumbnail
Software • Professional Services • Productivity • Information Technology • Cloud
Chicago, IL
45 Employees
InCommodities Thumbnail
Renewable Energy • Machine Learning • Information Technology • Energy • Automation • Analytics
Austin, TX
234 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account