Agentic Skills Frameworks Compared | Ry Walker Research

Key takeaways

Superpowers is the star story: 224.7K GitHub stars as of June 2026 — up 4x from 57.5K in February — with 787K installs through Anthropic's official plugin marketplace and eight supported harnesses
Official marketplaces arrived: Anthropic's claude-plugins-official (29.9K stars, 222 entries, pre-registered in every Claude Code install) and Vercel's skills.sh (~669,670 skills listed, top skill at 2.0M installs five months after launch)
Security is now a crisis, not a concern — research cited by tech-leads-club finds over 13% of marketplace skills contain critical vulnerabilities, Snyk's ToxicSkills study found prompt injection in 36% of ClawHub skills, and Trail of Bits researchers bypassed Vercel's, ClawHub's, and Cisco's malicious-skill scanners. Snyk partnerships (Vercel, Tessl) are the emerging response.
The ecosystem is consolidating around standards and platforms: SKILL.md (agentskills.io) is adopted by ~40 clients, Claude Flow renamed to Ruflo, Kiro became AWS's official successor to Amazon Q Developer, and Gemini CLI is sunsetting into Antigravity CLI

FAQ

What's the best skills framework for AI coding agents?

It depends on team size and project type. Superpowers for solo/small teams wanting full methodology enforcement, BMAD for teams wanting agile lifecycle coverage, Spec Kit for spec-driven greenfield development, OpenSpec for spec-driven brownfield work, Anthropic Skills for the broadest official catalog.

What is SKILL.md?

SKILL.md is a markdown-based format for defining agent skills — modular instructions that agents load on-demand. It became an open standard (agentskills.io) in December 2025 and is now adopted by ~40 clients including Claude Code, Cursor, Copilot, Codex, Gemini CLI, and Kiro.

Are agent skills safe to install?

Not necessarily. Snyk's ToxicSkills study found prompt injection in 36% of skills audited on the ClawHub marketplace and 1,467 malicious payloads, research cited by tech-leads-club puts critical vulnerabilities in over 13% of marketplace skills, and automated scanners have been bypassed by researchers. Treat skills like npm packages — vet before installing, prefer curated catalogs, and audit third-party skills.

How do skills differ from MCP?

Skills focus on workflows and knowledge (what to do and how), while MCP focuses on secure tool and data access (what you can use). They're complementary layers.

What's the difference between AGENTS.md and SKILL.md?

AGENTS.md defines project-level context (tech stack, conventions, boundaries). SKILL.md defines task-level capabilities (how to do brainstorming, TDD, debugging). AGENTS.md is always loaded; skills load on-demand.

Executive Summary

A new infrastructure category has matured: skills frameworks for AI coding agents. These frameworks solve the problem of "how do agents follow structured processes instead of just winging it?" with solutions ranging from full methodology enforcers to modular skill catalogs — and, as of 2026, official marketplaces.

The space moved fast between February and June 2026. Star counts grew 2-4x across the board, Anthropic and Vercel both shipped first-party distribution channels, and the SKILL.md format (agentskills.io) reached ~40 client adoptions. Security got worse before it got better: Snyk found prompt injection in 36% of ClawHub skills , research cited by tech-leads-club puts critical vulnerabilities in over 13% of marketplace skills , and Trail of Bits researchers bypassed the automated scanners of Vercel, ClawHub, and Cisco .

20 projects reviewed: Superpowers (224.7K ⭐), Anthropic Skills (149K ⭐), GitHub Spec Kit (111K ⭐), Ruflo (ex-Claude-Flow) (59K ⭐), OpenSpec (54.3K ⭐), BMAD Method (49K ⭐), wshobson/agents (36.6K ⭐), Claude Plugin Marketplace (29.9K ⭐), AGENTS.md (22.1K ⭐), skills.sh (22.1K ⭐ CLI, ~669,670 skills), OpenAI Skills (22K ⭐), Compound Engineering Plugin (21K ⭐), Agent Skills Registry (4.6K ⭐), Google Gemini Skills (3.6K ⭐), Microsoft Amplifier (3.1K ⭐), Babysitter (1.3K ⭐), Awesome Agent Skills (593 ⭐), AB Method (174 ⭐), plus Kiro (AWS) and Tessl ($125M raised).

Which Framework Should You Use?

If you're a solo developer or 2-3 person team building greenfield → Start with Superpowers. It enforces brainstorm → plan → TDD → review without requiring any team coordination, installs from Anthropic's official plugin marketplace, and now runs on eight harnesses (Claude Code, Codex, Factory Droid, Gemini CLI, OpenCode, Cursor, Copilot CLI). Watch out: it's overkill if you already have a strong development discipline.

If you're a 5-20 person team with an agile workflow → BMAD Method maps to your existing sprint process with specialized personas for PM, architect, developer, and QA. It's the only framework that covers the full agile lifecycle, and v6.8's Web Bundles move planning to Gemini Gems and ChatGPT Custom GPTs. Watch out: 12+ agent personas are bloat for teams under 5.

If you want spec-driven development with approval gates → GitHub Spec Kit gives you a clean 4-phase workflow (/speckit.specify → /speckit.plan → /speckit.tasks → /speckit.implement) with human approval between each phase and 30+ agent integrations. Watch out: the rigid phase structure fights exploratory work. For brownfield codebases, OpenSpec's "fluid not rigid" delta-spec model (propose → apply → archive, no phase gates) is the alternative.

If you just want a library of reusable skills → Anthropic Skills is the broadest official catalog, and skills.sh gives you one-command install (npx skills add) across 20+ agents from a registry of ~669,670 skills. Watch out: skills alone don't enforce discipline, and open registries don't vet what you install.

If you're coordinating multiple agents on the same codebase → wshobson/agents (84 plugins, 192 agents, 156 skills) or Ruflo (swarm topologies, Rust/WASM kernel) handle multi-agent orchestration. Watch out: orchestration is the hardest layer — expect significant configuration and debugging.

The Four Layers

Layer 1: Methodology Frameworks

These don't just provide skills — they enforce a complete development workflow.

Framework	Stars	Forks	Last Push	Setup	Watch Out
Superpowers	224.7K	20K	Jun 11	Minutes — official marketplace install	Overkill for experienced teams
Spec Kit	111K	9.8K	Jun 11	Minutes — CLI scaffolds everything	Rigid for exploratory work
OpenSpec	54.3K	3.8K	Jun 3	Minutes — npx, no MCP or API keys	~400 open issues, near-solo maintainer
BMAD Method	49K	5.7K	Jun 11	Hours — 12+ persona configs to tune	Bloat for teams under 5

What they share: Mandatory gates between phases (OpenSpec excepted — it deliberately drops them). You can't skip brainstorming, you can't code before tests, you can't merge without review.

Superpowers is the most opinionated — it uses persuasion principles (Cialdini's Influence) to prevent agents from skipping steps even under "time pressure" . Jesse Vincent's framework is now the most-starred project in the category at 224,691 stars (up 4x since February), with 787K installs via Anthropic's official plugin marketplace and eight supported harnesses.

Spec Kit crossed 111K stars with 55+ releases since late February. v0.10 replaced the legacy --ai flags with an --integration system plus extensions, presets, and per-event hooks, and added an agent-skills install mode.

OpenSpec (Fission AI, MIT) is the newcomer at 54.3K stars in ten months . Its per-change "delta specs" (/opsx:propose → /opsx:apply → /opsx:archive) target brownfield codebases that Spec Kit's greenfield-oriented flow handles poorly, with 25+ tool integrations via plain markdown — no MCP dependency.

BMAD reached ~49K stars and v6.8, covering the entire agile lifecycle from ideation to deployment. One independent case study used it to build a multi-tenant SaaS platform and reported "a level of precision and speed unattainable with unstructured AI development methods" .

Layer 2: Official Skill Catalogs

The platform providers' own collections of reusable skills.

Catalog	Stars	Platform	Setup	Watch Out
Anthropic Skills	149K	Claude Code, Claude.ai + ~40 clients via agentskills.io	Minutes — `npx skills add`	No methodology enforcement
OpenAI Skills	22K	Codex CLI (+ cross-agent via standard)	Minutes — `npx skills add`	Codex-centric
Google Gemini Skills	3.6K	Gemini CLI → Antigravity CLI	Minutes — `npx skills add`	Gemini CLI sunsets June 18, 2026

Anthropic pioneered the SKILL.md format with progressive disclosure: lightweight metadata loads early, full instructions load only when relevant. The repo roughly doubled to ~149K stars since February, and Anthropic added enterprise admin controls plus a partner skills directory (Atlassian, Canva, Cloudflare, Figma, Notion, Ramp, Sentry). The format is now the de facto standard via agentskills.io, adopted by ~40 clients.

OpenAI grew from ~9K to ~22K stars, with skills resolving across four scopes (repository, user, admin, built-in) and the new Symphony spec (April 2026) extending skills upward into fleet orchestration.

Google grew its catalog from 1 to 3 skills — the gemini-api-dev skill improved Gemini API coding accuracy to 87% with Flash and 96% with Pro — and now also ships an official google/skills repo with 13 Google Cloud skills. The bigger story is churn: Gemini CLI sunsets for individuals on June 18, 2026, succeeded by Antigravity CLI (which keeps Agent Skills).

Layer 3: Marketplaces & Registries

The new layer of 2026 — distribution channels for skills and plugins, split between open volume and curated trust.

Registry	Scale	Curation	Watch Out
skills.sh (Vercel)	~669,670 skills	None — open publishing, telemetry-ranked	Scanners bypassed; Snyk partnership is the mitigation
Claude Plugin Marketplace (Anthropic)	222 entries	Two-tier review + "Anthropic Verified" badge	737 open issues — triage lags growth
Tessl Registry	3,000+ skills	Governed — Snyk score on every skill	Framework in closed beta, pricing unlisted
Agent Skills Registry (tech-leads-club)	80 skills	Static analysis + Snyk scan + human curation	Small catalog by design

skills.sh is the volume play: launched January 2026, ~669,670 skills listed by June, top skill at 2.0M installs, npx skills add into 20+ agents . Anyone can publish; ranking comes from install telemetry with no review flow.

Anthropic's official marketplace is the distribution play: pre-registered in every Claude Code install, 36 first-party plugins plus 170+ approved partner entries (AWS, Microsoft, Stripe, Shopify, Datadog), with top plugins showing six-figure installs — Frontend Design at 867K, Superpowers at 787K .

Tessl and the Agent Skills Registry are the trust plays — Snyk-scored governed registry with named customers (Cisco, HashiCorp) on one side, a free, human-curated 80-skill registry that nearly tripled its stars in a quarter (1.7K → 4.6K) on the other.

Layer 4: Orchestration Platforms

Coordinate multiple agents working together.

Platform	Stars	Forks	Approach	Setup	Watch Out
Ruflo (ex-Claude-Flow)	59K	6.8K	Swarm orchestration, Rust/WASM kernel	Hours — topology and consensus config	Self-reported benchmarks, single-author bus factor
wshobson/agents	36.6K	3.9K	Plugin-based, multi-harness	Hours — plugin selection and config	Single maintainer, no versioned releases
Amplifier	3.1K	255	Self-improving bundles	Hours — significant config	Research-only, flat adoption
Babysitter	1.3K	75	Event-sourced workflows	Minutes — npm + plugin	No release since April 4

Ruflo — renamed from Claude Flow with the v3.5 stable release (February 27, 2026), reportedly for trademark reasons; the repo is now ruvnet/ruflo though the npm package remains claude-flow — quadrupled to ~59K stars. v3.5 was a real architectural shift: the policy engine, embeddings, and proof system moved to Rust compiled to WebAssembly, orchestrating 100+ specialized agents with Raft, BFT, Gossip, and CRDT consensus.

wshobson/agents nearly doubled its counts since February: 84 plugins, 192 agents, 156 skills, 102 slash commands, 16 orchestrators — now shipping harness-native artifacts to Claude Code, Codex CLI, Cursor, OpenCode, Gemini CLI, and Copilot from one Markdown source.

Babysitter quadrupled to 1,310 stars with ~9,100 weekly npm SDK downloads, and is no longer Claude Code-only — Codex support is in beta with experimental Cursor, Gemini CLI, Copilot, and OpenCode support. Event-sourced, deterministic, resumable workflow execution with quality convergence remains the pitch ^[1], though there's been no release since April 4 and no independent production write-ups.

Microsoft Amplifier remains the most experimental — agents write their own DISCOVERIES.md files, building institutional knowledge over time. Microsoft explicitly labels it not production-ready , and adoption is flat (~3.1K stars vs. ~3K in February) despite active development.

The Glue: Standards

Standard	Stars	Role
AGENTS.md	22.1K	Project-level context (always loaded)
SKILL.md	—	Task-level capabilities (loaded on-demand)

AGENTS.md defines what the agent needs to know about a project — tech stack, conventions, boundaries, commands. GitHub code search shows ~96,600 root-level AGENTS.md files as of June 2026, and the standard is now stewarded by the Agentic AI Foundation (180+ member organizations including Stripe, Atlassian, and the U.S. Army) under the Linux Foundation. Supported by Codex, Copilot, Cursor, Claude Code, Gemini CLI, Kiro, and more. A Hacker News discussion found that for some eval tasks, AGENTS.md alone outperformed adding skills — and a 138-repo ETH Zurich study found README-style AGENTS.md files can actually hurt agent performance 2-3% and raise costs 20%+. Quality matters more than presence.

SKILL.md defines how to do specific tasks — brainstorming, TDD, debugging, code review. Progressive disclosure keeps context windows efficient. Now an open standard via agentskills.io with ~40 client adoptions, including Claude Code, Cursor, Copilot, Codex, Gemini CLI, and Kiro — which is itself now AWS's flagship coding agent, designated successor to Amazon Q Developer (new signups closed May 2026; Q retires April 2027).

Together they form a two-tier system: AGENTS.md for the "where" and SKILL.md for the "how."

Adoption Beyond Stars

GitHub stars are a vanity metric. Here's what the secondary signals say (as of June 11, 2026):

Framework	Stars	Forks	Fork Ratio	Open Issues	Last Push
Superpowers	224.7K	19,975	8.9%	279	Jun 11
Anthropic Skills	149K	17.6K	11.8%	—	Jun 9
Spec Kit	111K	9.8K	8.8%	—	Jun 11
Ruflo	59K	6.8K	11.5%	—	Jun 11
OpenSpec	54.3K	3.8K	7.0%	~400	Jun 3
BMAD	49K	5.7K	11.6%	60	Jun 11
wshobson/agents	36.6K	3.9K	10.7%	~12	Jun 8
Claude Plugin Marketplace	29.9K	3.2K	10.7%	737	Jun 11
AGENTS.md	22.1K	—	—	—	Mar 2026
skills.sh CLI	22.1K	1.8K	8.1%	—	Jun 11
OpenAI Skills	22K	—	—	—	Active
Compound Engineering	21K	1.5K	7.1%	—	Jun 9
Agent Skills Registry	4.6K	408	8.9%	—	Jun 11
Amplifier	3.1K	255	8.2%	—	Jun 9
Babysitter	1.3K	75	5.7%	—	Jun 11

What stands out:

Superpowers 4x'd in four months (57.5K → 224.7K) — official marketplace distribution turned a popular framework into the category leader
Anthropic Skills doubled (73K → 149K) and is actively pushed again after a quiet February
Ruflo also 4x'd (14K → 59K) with near-daily releases, but remains dominated by a single prolific author
The Claude Plugin Marketplace's 737 open issues are partly a submission queue — but a signal that curation triage lags ecosystem growth
AGENTS.md's repo is frozen since March by design — the real adoption metric is the ~96,600 root-level files on GitHub
Amplifier is the outlier — active development, flat stars (~3K → ~3.1K), near-zero community discussion

Stars ≠ production usage. Install counts (where published) are the better signal: Anthropic's marketplace shows 867K installs for Frontend Design and 787K for Superpowers; skills.sh's top skill shows 2.0M.

The Positioning Map

Think of the landscape as two axes: how opinionated (flexible vs. prescriptive) and what it provides (knowledge catalog vs. workflow methodology).

                      PRESCRIPTIVE
                           │
            Superpowers ●  │  ● BMAD
                           │  ● Spec Kit
                           │  ● OpenSpec
     CATALOG ──────────────┼──────────── METHODOLOGY
                           │
    Anthropic Skills ●     │
    skills.sh ●            │  ● wshobson/agents
    Plugin Marketplace ●   │  ● Ruflo
                           │
                      FLEXIBLE

Top-right (prescriptive methodology): Full workflow enforcement. Best for teams that need discipline.

Bottom-left (flexible catalog): Pick what you need. Best for experienced teams that want knowledge, not process. Now crowded with first-party marketplaces.

Bottom-right (flexible methodology): Orchestration tools. Configurable but complex.

Most teams should start bottom-left (catalog) and move right (add methodology) as they scale.

Security: The Elephant in the Room

The skills ecosystem's supply chain problem deepened in 2026:

Over 13% of marketplace skills contain critical vulnerabilities, per research cited by tech-leads-club — a competitor's framing of the open registries, but directionally consistent with everything else below
36% of skills on ClawHub contained prompt injection and 1,467 malicious payloads were found across the skills Snyk's ToxicSkills study audited, with 26% carrying at least one vulnerability
The scanners have been beaten — Trail of Bits researchers bypassed the malicious-skill detectors of Vercel, ClawHub, and Cisco using prompt injection and payloads hidden in compiled bytecode, getting data-exfiltrating skills marked safe
Three lines of markdown is all it takes to go from SKILL.md to shell access, per a separate Snyk analysis

The industry response is forming around Snyk: a partnership with Vercel adds real-time scanning to skills.sh , a March 2026 partnership with Tessl attaches a Snyk security score to every public registry skill, and tech-leads-club's Agent Skills Registry runs Snyk Agent Scan plus human curation on all 80 of its skills.

What this means: Treat skills like npm packages in 2018. Vet before installing. Prefer curated channels (Anthropic's official marketplace, Tessl, Agent Skills Registry) over open registries for anything with shell access. The ecosystem still lacks package signing and sandboxed execution — scanning partnerships are mitigation, not solution.

Key Patterns

1. The Workflow Gate Pattern

Every major methodology framework enforces human checkpoints:

Spec Kit: Specify → approval → Plan → approval → Tasks → Implement
Superpowers: Brainstorm → design approval → Plan → TDD → Two-stage review
BMAD: Analysis → brief approval → Architecture → readiness check → Implementation
OpenSpec is the deliberate exception: propose → apply → archive with no phase gates, betting that brownfield work needs fluidity

2. Subagent Isolation

Fresh context per task is the dominant implementation pattern:

Superpowers: Fresh subagent per task + two-stage review — now the substrate frameworks like Metaswarm build on
Ruflo: Swarm workers with independent context
BMAD: Specialized persona agents with distinct roles
wshobson/agents: Agent Teams with parallel execution

3. Self-Improvement

Agents that learn from their own work:

Amplifier: DISCOVERIES.md — agents log solutions to avoid repeating mistakes
Superpowers: TDD for skills — testing skills against adversarial scenarios
AGENTS.md: Agents can update their own guidance files

4. Platform Convergence

The SKILL.md format is an open standard (agentskills.io) adopted by ~40 clients, including: Claude Code, Cursor, VS Code/Copilot, OpenAI Codex, Gemini CLI, Kiro, Amp, Manus, OpenCode, Goose, Roo Code — and distribution is converging on marketplaces: npx skills add works against both Vercel's and the official catalogs.

In Practice: Testing Superpowers on a Real Feature Branch

We installed Superpowers into an existing Next.js project (this research site) and used it to implement a new content type — adding structured FAQ sections to MDX research posts.

What worked: The brainstorm phase genuinely prevented jumping to code. The TDD skill forced us to write content validation tests before touching the MDX parser. The two-stage review caught a JSX escaping issue (<10% being parsed as a React component) that would have broken the build.

What didn't: The subagent review process added ~3 minutes per task. For a simple feature this felt like overhead. The persuasion-based enforcement ("I notice you're trying to skip brainstorming — let's not take shortcuts") is effective but occasionally patronizing when you genuinely know what you want to build.

Verdict: Worth it for features that touch multiple files or require design decisions. Overkill for one-line fixes or configuration changes. The TDD enforcement alone probably saved us from shipping a broken build.

Implications for Agent Orchestration

This category matters beyond individual developer productivity. As teams scale from one agent to many, the methodology and orchestration layers become infrastructure:

Skill-aware routing becomes possible when skills are standardized — route a security review to an agent with the security-audit skill loaded, not a generic one
Two-stage review (spec compliance + code quality) is a pattern that orchestration platforms can enforce across fleets, not just individual sessions
Self-improvement patterns (DISCOVERIES.md) mean agents can build institutional knowledge that persists across sessions and team members
The standards gap is narrowing but real — AGENTS.md and SKILL.md handle context and capabilities, and OpenAI's Symphony spec is a first attempt at fleet orchestration, but there's still no standard for agent-to-agent coordination, shared state, or cross-agent review

The missing piece: today you can give an agent skills and methodology, but coordinating ten agents working on the same codebase with shared context and non-overlapping work is still unsolved at the standards level.

Notable Others

These projects didn't make the main comparison but are worth tracking:

Compound Engineering Plugin (21K ⭐, up from 9.3K in February) — Every's ideate → brainstorm → plan → work → review → polish → compound workflow with 50+ agents and 38+ skills, now cross-platform (Claude Code, Cursor, Codex, Copilot)
Awesome Agent Skills (593 ⭐) — Skillmatic's learn/use/build/research pathway, overshadowed by VoltAgent's identically named ~25K-star list
AB Method (174 ⭐) — Revived with a v3 rewrite in April 2026; TDD-first missions now installing to both Claude Code and Codex from one npx command
Tweag Agentic Coding Handbook — Methodology documentation (TDD, debug, exploratory workflows)
Addy Osmani's "Beyond Vibe Coding" — O'Reilly book on AI-assisted engineering workflows

Bottom Line

What works today: Drop an official catalog (Anthropic Skills) into your project for immediate knowledge gains. Add a methodology framework (Superpowers or BMAD) when you need discipline. This two-layer combo is the most practical setup for teams of 2-20 developers — and it's now a one-command install via the official marketplaces.

What changed since February: Distribution consolidated. Anthropic's official marketplace (pre-registered in every Claude Code install) and Vercel's skills.sh (~669,670 skills) turned skill distribution into a platform game, and the winners 4x'd: Superpowers to 224.7K stars, Ruflo to 59K. New spec-driven entrant OpenSpec hit 54.3K in ten months. Kiro became AWS's official successor to Amazon Q Developer.

What's concerning: Security got worse before the response arrived. Over 13% of marketplace skills carry critical vulnerabilities by one count, 36% of ClawHub skills contained prompt injection, and the automated scanners have been publicly bypassed. Snyk partnerships (Vercel, Tessl) and curated registries (Anthropic's, tech-leads-club's) are the start of an answer — but until there's package signing and sandboxed execution, installing third-party skills from open registries is a calculated risk.

The honest take: The category graduated from experiment to infrastructure in four months — official marketplaces, six-figure install counts, foundation governance for AGENTS.md, ~40 clients on SKILL.md. The core ideas — workflow gates, subagent isolation, progressive disclosure — won. The unsolved problems are now distribution-scale ones: curation, security, and multi-agent coordination standards.

About This Research

This analysis was produced by Claw, an AI research agent built on OpenClaw and operated by Ry Walker. Claw reviewed each framework's GitHub repository, documentation, community discussions, and independent case studies. Star counts and metrics were originally pulled from the GitHub API on February 22, 2026, and fully refreshed — including three new entrants (OpenSpec, Claude Plugin Marketplace, skills.sh) — on June 11, 2026.

AI-generated research has an obvious limitation: we can read repos and docs thoroughly but can't run months-long production evaluations. The "In Practice" section reflects a real test, but one test doesn't replace broad production experience. Take the analysis as a well-researched starting point, not the final word.

Research by Claw • February 22, 2026 • Updated June 11, 2026

Sources