Goodhart's law shows up the moment you tie a metric to compensation. "When a measure becomes a target, it ceases to be a good measure." This is a law, not a tendency. It applies to lines of code, story points, deployment frequency, and yes — every AI tool metric anyone is excited about right now.
If you start tracking prompts per week, engineers will write more prompts. If you start tracking acceptance rate, they'll accept more suggestions and quietly rewrite them after. If you start tracking AI-generated lines of code, they'll find a way to inflate that too. None of this requires malice. It requires only that the people being measured are smart and care about their reviews. Both of those are true by construction in any team worth having.
You can detect the most obvious gaming with a script. Throwaway prompts. Auto-accepted suggestions immediately overwritten. Suspicious bursts before performance review season. Fine. That catches the lazy gamers. It doesn't catch the thoughtful ones, and it certainly doesn't address the underlying problem: the metric drove the wrong behavior in the first place.
The fundamental tension hasn't moved an inch with AI: things that are easy to measure aren't necessarily things that matter, and things that matter aren't necessarily easy to measure. AI tools don't solve this. They just give us new things to measure and new ways to get it wrong. I've argued elsewhere that AI coding tool telemetry is a microscope, not a diagnosis — and a microscope pointed at a comp decision is a thing nobody should want.
The right response isn't more sophisticated detection. It's metric design that anticipates gaming. Use a portfolio of signals so no single one is worth gaming. Decouple development metrics from compensation metrics — let the same data inform learning conversations without showing up on a perf review rubric. Treat any single number that moves "too cleanly" with suspicion. Most of all, take seriously that what actually helps developer performance is rarely a number on a dashboard.
If the only thing standing between your team and gamed metrics is a detection script, the metric is already broken. Build it like you expect smart people to optimize against it — because they will.
Related Essays
What Actually Helps Developer Performance
No single number captures developer performance. The honest answer is a portfolio of imperfect signals, used to develop people rather than rank them.
The AI Coding Tool Wrinkle
AI assistants are generating a granular log of every prompt, accept, reject, and iteration. That's new data — and a new way to measure the wrong thing.
Top Performer Analysis: The Real Opportunity in AI Tool Telemetry
The interesting use of AI coding tool data isn't ranking. It's understanding how your best engineers actually work — and helping the rest of the team catch up.
Key takeaways
- Every metric tied to comp eventually gets gamed.
- AI tool metrics are not immune — they may be easier to game.
- The fix is not better detection, it is better metric design.
FAQ
Won't engineers just inflate their AI prompt counts?
Yes. Anything that can be counted will be optimized, and anything tied to compensation will be optimized harder. AI tool telemetry is not magically immune.
Can you script around gaming?
A little. You can detect obvious patterns. But the deeper problem isn't caught by scripts — it's that the metric drove the wrong behavior in the first place.