A CFO I work with pulled me aside after a quarterly review last month and asked, almost apologetically, "Can you tell me what the AI thing is actually doing for us?" The deployment had gone live six months earlier. The team was using it daily. And yet, sitting across from him with his glasses pushed up on his forehead, I realized he genuinely could not answer his board's question. Neither, honestly, could I — not in the language he needed.

That conversation has stayed with me. Because it captures the part of AI adoption nobody warns you about: the gap between "it works" and "it's working for the business." You've already walked through the budgeting and scoping decisions we covered in the previous post on realistic AI cost ranges by scale. You've invested. The system is live. Now the harder question arrives — what proves it worked, and to whom?

An analytics dashboard on a monitor beside a notebook with handwritten metrics in a softly lit office

The measurement gap is bigger than people admit

The numbers are not flattering. A 2025 Gartner survey of 782 infrastructure and operations leaders found that only 28% of AI use cases fully succeed and meet their ROI expectations, while 20% fail outright. McKinsey's late-2025 Global AI Survey reported that 88% of companies now use AI in at least one function — but only 39% see EBIT impact from it. IBM and Larridin research has suggested that roughly 72% of AI investments destroy value through waste, and only about 29% of executives say they can measure ROI with confidence.

I cite these not to alarm anyone but because they reframe what we're doing. When most projects can't demonstrate value, the bottleneck is rarely the model. It's the measurement layer underneath. And in my experience working with SMBs, that layer almost never exists at the start of a project. People build the AI; they forget to build the instrument that watches it.

Start with a baseline, or you've already lost the argument

If I could go back and add one thing to most projects I've seen, it would be a two-week baseline measurement period before deployment. Nothing fancy. Just an honest snapshot of how long the work takes today, what it costs, where the errors happen, and how the people doing it feel about it.

This sounds simple, and it mostly is, except that almost no one does it. The pressure to launch is too high, and the team assumes "we'll figure out the metrics later." Later turns into never, and three months in someone asks how much time the AI is saving, and the answer is a shrug dressed up in a slide.

A baseline does two things. First, it gives you a defensible before-and-after comparison your CFO will accept on a P&L. Second — and this surprised me the first few times — it often reveals that the original problem wasn't quite what the team thought it was. I've watched teams discover during baselining that the bottleneck wasn't the task they were automating; it was the handoff two steps upstream.

Three lenses: productivity, accuracy, value-realization speed

The framework I keep coming back to has three lenses. Not five, not seven. Three is enough to argue about clearly.

Productivity is the obvious one — hours reclaimed, tickets closed, throughput per FTE. It's the lens executives reach for first because it converts cleanly to currency. The trap, which I'll come back to, is that hours saved are not automatically dollars earned.

Accuracy is where AI projects either build trust or quietly lose it. Error rate against a human-reviewed gold set, hallucination frequency, false positives in classification, deviation from the SOP. In customer-facing work, accuracy is the upstream cause of every CSAT score you'll measure three months later. I've seen teams chase a 5% productivity bump while accuracy quietly degraded by 8% — a deal the business would never have agreed to if it had been put in those terms.

Value-realization speed is the one most teams underweight. How long from deployment to the first measurable business outcome? An AI that reaches breakeven in four months is a categorically different asset from one that takes fourteen, even if the steady-state savings end up similar. Speed compounds, and it also protects the project politically — long gaps between go-live and visible value are where AI initiatives get quietly killed.

Leading vs lagging indicators — and why you need both

This is where I see the most preventable mistakes. Teams report on lagging indicators — cost savings, CSAT, agent productivity, revenue per customer — because those are what the board wants. Lagging indicators are real, but they tell you what already happened. By the time CSAT drops, you've already lost the customers.

Leading indicators sit closer to the system: intent classification accuracy on the live traffic, response quality scores from a sample reviewer, integration uptime, model drift against a held-out validation set, the proportion of queries escalated to a human. None of these will impress a board on their own. All of them give you two to six weeks of warning before a lagging metric moves.

A practical rule I use: every lagging KPI in your dashboard should have at least one leading indicator wired to it, and the leading indicator should be reviewed weekly while the lagging one is reviewed monthly. If you can't name the leading indicator for a given lagging metric, you don't have a measurement system. You have a report.

The time-saved illusion, and other vanity metrics

"AI saves our team 30 minutes per task." I hear some version of this in almost every readout, and to be blunt, it's the most overstated number in the industry.

Thirty minutes saved per task only matters if that thirty minutes converts to productive output somewhere else. If the analyst who used to spend an hour on a report now spends thirty minutes on it and thirty minutes refreshing email, the company has saved nothing. The cost line on the P&L is unchanged. The hours did not become dollars; they became slack.

This isn't a reason to dismiss time-saved metrics. It's a reason to pair them with a downstream conversion metric — output volume, revenue per analyst, capacity absorbed without new hires. The honest question is not "how much time did we save?" but "what did we do with the time we saved, and can the controller see it?"

The other vanity metrics worth naming: usage counts (logins, queries, sessions) without an outcome attached, satisfaction scores from the team that built the system, and any metric whose denominator changes month to month without explanation. If a number can only go up, it's not measuring anything.

Intangibles count — but you have to name them upfront

Some of the most durable value from AI deployments doesn't show up in a productivity report. Customer trust, employee retention in roles that used to burn people out, the institutional capacity to take on the next AI project faster because the first one taught you how. Gartner's recent work has noted that 57% of leaders who reported AI project failure said they "expected too much, too fast" — and a lot of that is a failure to name intangible value as part of the original ROI thesis.

The discipline I'd recommend is simple. At the start of the project, write down two or three intangible outcomes you genuinely expect, and define a rough proxy for each one. Trust might be measured by repeat-purchase rate in a specific segment. Innovation capacity might be measured by the time it takes to ship the next AI use case. Retention might just be a quarterly conversation with managers in the affected team. None of these are perfect. All of them are better than pretending the intangibles don't exist and then arguing about them after the fact.

The instrumentation layer most teams skip

One pattern I see again and again: the AI itself is built well, the dashboards on top of it look professional, and yet nobody can answer a simple question like "how did accuracy trend over the last six weeks for our top three customer segments?" The data isn't being captured at the right grain, or it is, but it lives in three different systems that nobody has joined.

This is the part of AI ROI work that nobody finds exciting and that quietly determines whether the project is defensible a year in. Event-level logging of every model decision. A reviewer workflow that lets a human grade a sampled slice each week. Joined data between the AI's outputs and the downstream business outcomes — orders shipped, tickets resolved, deals closed. This is the data analytics layer that turns an AI deployment from a feature into a measurable asset, and it's the thing most SMBs ask us to retrofit after they've already deployed.

It is much cheaper to build it in from day one.

Closing thought, and a look ahead

The honest framing of AI ROI is that you are not measuring the AI. You are measuring the decision your company made to deploy it, against a baseline, with leading indicators wired to the lagging ones, and with intangibles named in advance so they can be defended later. The companies that do this aren't the ones with the best models. They're the ones who decided, before go-live, what "working" was going to mean.

At 5years+, much of our work with SMBs in Korean and Japanese markets sits in exactly this gap — the data analytics and measurement layer that turns a working AI system into a measurable business asset. If your team has deployed something and isn't yet sure how to prove it, a conversation about how we approach the instrumentation layer is usually a useful starting point, even if you only walk away with a clearer baseline for your own team to run.

In the next post, we'll look at the five most common reasons AI rollouts fail — many of them only visible once you start measuring honestly.

How to Measure AI ROI — What to Track After Deployment