Together AI | Ry Walker Research

Key takeaways

Research-driven platform from FlashAttention creators with contributions like Red Pajama and DeepCoder
Running on cutting-edge NVIDIA GB200/GB300 NVL72 hardware — among the first to deploy next-gen GPUs
Full-stack AI cloud covering inference, fine-tuning, pre-training, and dedicated GPU clusters
Claims 3.5x faster inference, 2.3x faster training, and 20% lower cost versus alternatives

FAQ

What is Together AI?

An AI cloud platform providing inference, fine-tuning, pre-training, and GPU clusters with research-grade infrastructure.

Who founded Together AI?

Founded by AI researchers who created FlashAttention and contributed to Red Pajama, Mixture of Agents, and DeepCoder.

What hardware does Together AI use?

NVIDIA GB200 and GB300 NVL72 racks — cutting-edge GPU hardware.

How many models does Together AI support?

200+ models available via OpenAI-compatible APIs.

Company Overview

Together AI positions itself as "The AI Native Cloud" — a full-stack platform for AI inference, fine-tuning, pre-training, and GPU cluster management.^[1] What distinguishes Together AI from pure inference providers is its research DNA: the team created FlashAttention (now standard in virtually all transformer training) and has contributed Red Pajama datasets, Mixture of Agents, and DeepCoder to the open-source ecosystem.^[2]

Backed by NVIDIA, Salesforce Ventures, General Catalyst, and Kleiner Perkins, Together AI has positioned itself at the intersection of research innovation and production infrastructure.

What It Does

Inference — 200+ models via OpenAI-compatible APIs, including Llama, Mistral, Qwen, and DBRX^[3]
Fine-tuning — Custom model training on Together's infrastructure
Pre-training — Full model training from scratch on dedicated clusters
GPU clusters — Dedicated NVIDIA GB200/GB300 NVL72 hardware for large-scale workloads
Serverless & dedicated — Choose between shared endpoints or reserved capacity

How It Works

Together AI runs on cutting-edge NVIDIA hardware, including GB200 and GB300 NVL72 racks. Their inference stack is optimized with custom kernels derived from FlashAttention research, plus integration with TensorRT-LLM and vLLM.

For inference, you call OpenAI-compatible API endpoints. For fine-tuning and pre-training, you configure jobs through the API or dashboard. GPU clusters provide dedicated multi-node hardware for organizations needing guaranteed capacity.

Pricing

Per-token pricing for inference (varies by model)
Per-GPU-hour for fine-tuning and training
Dedicated clusters — custom pricing for reserved hardware
Free credits available for new accounts

Together AI claims 20% lower cost than alternatives, though this varies by model and workload.^[1]

Strengths

Research pedigree — FlashAttention creators bring genuine optimization expertise
Latest hardware — GB200/GB300 NVL72 before most competitors
Full stack — Inference + fine-tuning + pre-training + clusters under one roof
Open-source contributions — Red Pajama, DeepCoder build community trust
200+ models — Broad selection with OpenAI-compatible APIs
Performance claims — 3.5x faster inference, 2.3x faster training

Weaknesses / Risks

Premium positioning — Cutting-edge hardware comes at a price
Less enterprise compliance — SOC 2 but lacks HIPAA (vs Baseten, Fireworks)
Crowded market — Competing with well-funded Baseten, Fireworks, and custom silicon players
Research focus may diffuse — Balancing research contributions with product execution
CodeSandbox acquisition — Expanding scope into developer tools may dilute inference focus

Competitive Landscape

vs. Baseten: Baseten leads on compliance (HIPAA) and GPU breadth. Together AI leads on hardware generation and research.

vs. Fireworks AI: Fireworks focuses on pure inference speed and reliability. Together AI offers a broader platform including pre-training.

vs. Groq/Cerebras: Custom silicon is faster for supported models. Together AI offers more model variety and fine-tuning.

vs. Modal: Modal is general-purpose serverless compute. Together AI is purpose-built for AI workloads with optimized serving.

Ideal User

AI teams wanting cutting-edge hardware without managing infrastructure
Organizations needing the full ML lifecycle (train → fine-tune → serve)
Research groups that value open-source contributions and community alignment
Companies running large-scale inference across many model types

Bottom Line

Together AI is the research-meets-production play in AI inference. The FlashAttention pedigree and latest NVIDIA hardware create genuine technical differentiation. Best for teams that want a single platform for the entire ML lifecycle with research-grade optimization. The risk is whether they can maintain focus as they expand into adjacent areas.

Sources