Measuring Developer Performance (And Why AI Might Make It Worse)

Key takeaways

Engineering metrics differ from GTM metrics.
AI tools add signals but increase gaming risk.
The goal is learning, not punishment.

FAQ

Why is dev performance hard to measure?

Work is creative and context-dependent. Outputs vary and quality is hard to quantify.

How could AI make it worse?

It introduces noisy micro-metrics that can be gamed. That can distort behavior without improving outcomes.

We've been arguing about how to measure developer productivity for decades. Lines of code? Story points? Commits? Each metric has been proposed, debated, and ultimately found wanting. The SPACE framework is one of the more useful attempts to broaden how we think about productivity beyond output alone.^[1] I found myself thinking about this again recently, and I realized something: AI coding tools might be about to blow this whole conversation wide open.

Here's the thing—I'm not bringing a solution to this age-old debate. Nobody has one. But what's interesting is how the rise of AI coding assistants is creating a new dimension to an already thorny problem.

The GTM vs. R&D Measurement Gap

One of the responses to my thread hit on something important: there's a fundamental difference in how we measure performance between go-to-market teams and R&D.

Loading tweet...

Sales teams have it relatively easy (at least in theory). Revenue generated. Deals closed. Pipeline created. These are imperfect metrics, sure, but they're tangible. When a salesperson says "you can't measure what I do," they get shown the door pretty quickly.

But engineering? We've been hiding behind complexity for years. "Software development is creative work." "You can't reduce our output to numbers." "Quality matters more than quantity."

All of these things are true. And yet, the business still needs some way to understand who's performing and who isn't.

Industry benchmarks like DORA push teams to measure outcomes and flow, not vanity activity metrics.^[2]

The AI Coding Tool Wrinkle

Here's where it gets interesting. AI coding assistants like Copilot, Cursor, and Claude are generating something we've never had before: a granular log of micro-outputs.

Every prompt. Every acceptance. Every rejection. Every iteration.

Suddenly, we have visibility into the actual process of coding, not just the end results. And while I'm not proposing any specific system here, I can't help but notice the possibilities—and the pitfalls.

Imagine a dashboard that tracks AI tool adoption across your engineering team. Who's using it most effectively? Who's getting the best results from their prompts? Who's still writing everything by hand?

If AI coding tool adoption becomes a 2026 corporate initiative (and for many companies, it already is), these questions aren't hypothetical. They're coming to a planning meeting near you.

The "Top Performer Analysis" Opportunity

One thing that excites me about this new visibility: the potential for genuine learning.

Right now, when someone on your team is exceptionally productive, it's hard to understand why. They might be better at breaking down problems. They might have deeper domain knowledge. They might just be faster typists.

But with AI coding tools, we can potentially see how top performers interact with the technology. What kinds of prompts do they write? How do they iterate? When do they accept suggestions versus modify them?

This isn't about surveillance—it's about understanding excellence so we can help everyone improve.

The Gaming Problem (It Never Goes Away)

Of course, any measurement system can be gamed. Someone pointed out that if we started tracking AI prompts, people would just find ways to inflate their numbers.

Loading tweet...

They're right. And yes, you could probably write a script to detect obvious gaming. But that's not really the point.

The point is that we're still stuck in the same fundamental tension: the things that are easy to measure aren't necessarily the things that matter, and the things that matter aren't necessarily easy to measure.

AI tools don't solve this. They just give us new things to measure—and new ways to get it wrong.

What Would Actually Help?

When I asked "yeah but HOW" in response to someone's suggestion, I wasn't being glib. I genuinely want to know.

Loading tweet...

The honest answer is that measuring developer performance probably requires a combination of:

Output metrics (with all their flaws)
Peer feedback (with all its biases)
Manager judgment (with all its subjectivity)
Business impact (with all its attribution problems)

AI tool usage might become another input in this mix. But it won't be the answer any more than lines of code or story points were.

The Bottom Line

I don't love thinking about developer measurement. Most engineers don't. We got into this field to build things, not to be quantified.

But the question isn't going away. And as AI tools become more prevalent, the data exhaust from our work is only going to grow.

The companies that figure out how to use this information wisely—to genuinely improve performance rather than just create leaderboards—will have a real advantage. The ones that use it poorly will drive away their best people.

As for me? I'm still thinking about it. Still don't have the answer. But I suspect the conversation is about to get a lot more interesting.

What do you think? Would you object to a dashboard ranking your dev team for AI adoption? I'm genuinely curious where people land on this.

Sources

Measuring Developer Performance (And Why AI Might Make It Worse)

The GTM vs. R&D Measurement Gap

The AI Coding Tool Wrinkle

The "Top Performer Analysis" Opportunity

The Gaming Problem (It Never Goes Away)

What Would Actually Help?

The Bottom Line

Related Essays

Rise of the Agents: An AI Coding Ecosystem Map

Why "Good Enough" Code Wins

Put AI on Defense, Not Just Offense