← Back to research
·2 min read·company

Tongyi DeepResearch

Tongyi DeepResearch — Alibaba's SOTA deep research agent. RL-trained Qwen3-30B-A3B model leads benchmarks on BrowseComp, GAIA, and Humanity's Last Exam. 18.4k stars, Apache-2.0, Python.

Key takeaways

  • SOTA deep research agent — RL-trained Qwen3-30B-A3B (30.5B total params, 3.3B activated per token) leads benchmarks on BrowseComp, GAIA, HLE, WebWalkerQA, FRAMES, and SimpleQA
  • Fine-tuned approach vs prompt-based: fully automated synthetic data pipeline powers agentic pre-training, supervised fine-tuning, and end-to-end reinforcement learning (GRPO with token-level policy gradients)
  • 18.4k stars, Apache-2.0 license. Available on HuggingFace and ModelScope with online demos. Built by Alibaba's Tongyi Lab
  • Proves that specialized, smaller models can outperform frontier models on agentic search tasks — a different bet than the prompt-engineering approach

FAQ

What is Tongyi DeepResearch?

An RL-trained deep research agent from Alibaba's Tongyi Lab. Uses a specialized Qwen3-30B-A3B model (30.5B params, 3.3B active) trained via GRPO reinforcement learning on synthetic agentic data. Leads multiple deep research benchmarks.

How does it compare to GPT Researcher or dzhng/deep-research?

Those are prompt-based (use any frontier LLM). Tongyi trains a specialized model specifically for deep research tasks, achieving higher benchmark scores but requiring more setup and compute to run locally.

Overview

Tongyi DeepResearch is Alibaba's state-of-the-art deep research agent — a fine-tuned Qwen3-30B-A3B model (30.5 billion total parameters, only 3.3 billion activated per token) trained specifically for long-horizon, deep information-seeking tasks. It leads benchmarks across BrowseComp, GAIA, Humanity's Last Exam, WebWalkerQA, FRAMES, and SimpleQA.

Unlike prompt-based approaches (GPT Researcher, dzhng/deep-research), Tongyi takes the fine-tuned route: a fully automated synthetic data pipeline powers agentic pre-training, supervised fine-tuning, and end-to-end reinforcement learning using Group Relative Policy Optimization (GRPO) with token-level policy gradients.

Key stats: 18,432 stars, Apache-2.0 license, Python. Available on HuggingFace and ModelScope.


Technical Innovation

Three key technical contributions:

  1. Fully automated synthetic data pipeline — Scalable data synthesis for agentic pre-training, SFT, and RL. No human annotation required
  2. Large-scale continual pre-training — Extends model capabilities on diverse agentic interaction data while maintaining freshness and reasoning
  3. End-to-end RL — Strictly on-policy GRPO with token-level gradients, leave-one-out advantage estimation, and selective negative sample filtering

The key insight: a smaller, specialized model trained end-to-end for agentic search outperforms much larger frontier models on research benchmarks.


Competitive Position

Strengths: SOTA benchmarks. Proven that fine-tuning beats prompting for research tasks. Efficient (3.3B active params). Open weights on HuggingFace.

Weaknesses: Requires significant compute to run locally. Less flexible than prompt-based approaches (locked to Qwen architecture). Chinese-language ecosystem dominates community.


Research by Ry Walker Research