Key takeaways
- →MIT's 2025 research found 95% of enterprise generative AI pilots delivered no measurable P&L impact — the cause is a workflow and learning gap, not weak models.
- →The single biggest driver of AI's bottom-line impact is fundamental workflow redesign; high performers are nearly three times as likely to redesign workflows rather than layer AI on top.
- →Generic, unsupervised tools stall because they never learn your process and have no senior reviewer to catch errors — fatal in finance, where a wrong number carries real consequences.
- →The 5% that win buy or build for a specific workflow, keep a senior practitioner in the loop on every judgment call, and iterate the system over time.
- →Finance AI adoption has plateaued at around 59% of finance leaders, proving the gap is no longer access to tools — it is turning adoption into value.
Ask a CFO why AI projects fail in finance and you will usually hear some version of the model wasn't good enough or the data wasn't clean enough. The most rigorous evidence we have says otherwise. The problem is almost never the model. It is the gap between a capable tool and the messy, judgment-heavy workflow it was dropped into — a tool that was never redesigned around how finance actually works, never supervised by someone senior enough to catch its mistakes, and never given a chance to learn your process over time.
That distinction matters because it changes what you do about it. If the model is the problem, you wait for a better one. If the workflow is the problem, you can fix it now — and the firms that have done exactly that are the small minority pulling real value out of AI while everyone else is stuck in pilot purgatory. This article unpacks the three failure modes the research keeps surfacing, then lays out what the 5% do differently and a pragmatic sequence any finance team can follow.
Why AI projects fail in finance: the workflow gap, not the model
In 2025, MIT's NANDA initiative published The GenAI Divide: State of AI in Business 2025, drawing on just over 300 public AI deployments, 52 structured leader interviews, and 153 senior leader survey responses. The headline finding has been quoted everywhere, usually badly. Here is what it actually says: 95% of enterprise generative AI pilots delivered no measurable P&L impact. Only about 5% crossed over into real, attributable value.
of enterprise generative AI pilots delivered no measurable P&L impact, per MIT's analysis of 300+ public deployments, 52 leader interviews, and 153 survey responses.
Source: MIT NANDA, The GenAI Divide: State of AI in Business 2025
The report is blunt about why, and it is not what most headlines implied. The barrier is not infrastructure, regulation, or talent. It is learning. In MIT's framing, the core issue is that most generative AI systems do not learn from or adapt to workflows. Generic tools shine in the hands of an individual because they are flexible, but they stall inside an enterprise because they never absorb the context, exceptions, and judgment that make a finance process work. The pilot demos beautifully and then dies on contact with reality.
The 95% failure rate for enterprise AI solutions represents the clearest manifestation of the GenAI Divide. Most systems do not retain feedback, adapt to context, or improve over time.
There is a telling misallocation underneath the failure rate, too. MIT found that more than half of generative AI budgets go to sales and marketing tools, while the biggest measurable ROI sits in unglamorous back-office and process automation — exactly the territory finance teams own. The money is chasing the demo, not the value. Three specific failure modes explain how that happens.
Failure mode 1: bolting AI onto a broken workflow
The most common and most expensive mistake is treating AI as a feature you sprinkle on top of an existing process. You take a month-end reconciliation that is already a tangle of spreadsheets, email approvals, and tribal knowledge, and you wire a model into one step of it. Nothing about the shape of the work changes. You have automated a fragment of a process that was never designed to be automated — and you are surprised when the time savings evaporate at the next manual handoff.
McKinsey's State of AI 2025 puts hard numbers on the alternative. Out of all the organizational attributes it tested, the redesign of workflows had the single biggest effect on whether a company captured EBIT impact from generative AI. Not the choice of model. Not the size of the budget. The willingness to rebuild the process around what AI can reliably do.
AI high performers are nearly three times as likely as other organizations to say they have fundamentally redesigned workflows, rather than layering AI on top.
Source: McKinsey, The State of AI 2025
Failure mode 2: raw automation with no senior judgment in the loop
The second failure mode is treating AI as a replacement for experienced people instead of a multiplier of them. In finance this is uniquely dangerous, because the cost of a confident, wrong output is not an awkward sentence — it is a misstated accrual, a mispriced add-back, or a covenant calculation that blows up in diligence. A model will produce a plausible number with total fluency. Only judgment knows whether it is right.
This is where the human-in-the-loop model earns its keep. The point is not to slow AI down with bureaucracy; it is to put the AI where it is fast and reliable — extraction, drafting, reconciliation, anomaly flagging — and put a senior practitioner where machines are weak: deciding what a number means, when an exception is real, and when an answer that looks clean is quietly wrong. Strip out that reviewer and you have not saved a salary. You have shipped your errors faster.
AI does not replace senior judgment in finance — it makes it faster, more thorough, and more consistent. The model handles volume; the experienced practitioner owns every call that carries consequence. Remove the human and you have automated the mistakes along with the work.
MIT's data quietly backs this up from another angle: AI solutions bought from specialized vendors and embedded into a real workflow succeeded about 67% of the time, while internal, generic builds succeeded only about a third as often. The difference is not raw horsepower — it is fit, supervision, and adaptation to the actual job. The same logic applies whether you are automating a close or running diligence; we explore the M&A version of it in where AI adds edge and where human judgment still wins.
Failure mode 3: generic tools that never learn your process
The third failure mode is the one MIT puts at the center of the GenAI Divide. A general-purpose assistant is genuinely useful for one person drafting an email. The moment you ask it to be a repeatable finance process, its greatest strength — flexibility — becomes a liability. It does not remember your chart-of-accounts quirks, your entity structure, your recurring exceptions, or the corrections your controller made last quarter. Every run starts from zero. There is no compounding.
| Dimension | The 95% (stalled pilots) | The 5% (real P&L impact) |
|---|---|---|
| Starting point | Bolt AI onto the existing workflow | Map and redesign the workflow first |
| Human role | Replace juniors / remove the reviewer | Senior practitioner owns every judgment call |
| Tooling | Generic tool, used the same way as day one | Fit to a specific process; learns your context |
| Time horizon | One-off project with a launch date | Iterated as models and the business evolve |
| Where it's aimed | Whatever demos well (often sales & marketing) | High-volume back-office work with measurable ROI |
| Success metric | Tool deployed | Hours recovered, errors reduced, cost per output down |
Notice that none of the right-hand column is about a better model. It is about fit, ownership, and iteration. That is the whole game.
What the 5% do differently: workflow-first, human-led, iterated
Put the three failure modes in reverse and you have the winning pattern, and it is remarkably consistent across MIT, McKinsey, and Bain. The 5% redesign the workflow before they build anything; they keep an experienced human in the loop on every consequential decision; and they treat the system as something that improves over time, not a project that ends at go-live.
Private equity is a useful proving ground because the pressure to show returns is relentless. Bain's Global Private Equity Report 2025 — drawing on investors representing $3.2 trillion in assets — found that nearly 20% of portfolio companies had operationalized generative AI use cases and were seeing concrete results, while "the preponderance of use cases remain in pilot mode." The winners were the ones applying AI to specific, high-value workflows and sharing what worked across the portfolio. The rest were still testing.
of PE portfolio companies had operationalized generative AI with concrete results, per Bain's 2025 report; most remained stuck in pilot mode.
Source: Bain & Company, Global Private Equity Report 2025
This is precisely the model OpsFi runs. Our team is deliberately AI-native — fluent in frontier tools — but every engagement starts with full workflow analysis and process mapping before any build begins, and a senior practitioner supervises the output rather than handing it to a junior or to raw automation. AI makes our experienced people faster, more thorough, and more consistent; it never replaces the judgment that finance work demands. Institutional rigor, modern leverage. That is the documented winning pattern, not a marketing line.
How human-in-the-loop protects accuracy, compliance, and audit trails
There is a second reason the human-in-the-loop model wins in finance specifically, beyond accuracy: defensibility. An auditor, a lender, or an acquirer does not just want the right number — they want to see how you got it. A black-box automation that produces a figure with no reviewer, no rationale, and no exception log is a liability in any room where the books are scrutinized.
- Accuracy — a senior reviewer catches the confident-but-wrong outputs that generic models produce, before they reach the ledger or the data room.
- Compliance — judgment calls on revenue recognition, accruals, and add-backs stay with a qualified person who can stand behind them.
- Audit trail — every material decision has a human owner and a documented rationale, not an unexplained machine output.
- Continuity — the practitioner who supervises the system also improves it, so corrections compound instead of repeating.
This is the difference between AI that survives diligence and AI that becomes a finding. Reviewed, attributable output is what makes a faster close defensible rather than risky — the same standard that separates a credible set of books from a fragile one.
A pragmatic sequence for finance teams: from diagnosis to running systems
If you want to land in the 5% rather than the 95%, the order of operations matters more than the toolset. Here is the sequence that mirrors what the research rewards.
- 01Diagnose, don't shop. Start with the workflow, not the tool. Find the highest-volume, most time-intensive processes — document handling, reporting, reconciliations, review cycles — and measure where time and errors actually accrue.
- 02Redesign the workflow. Decide what AI can reliably handle and what requires human judgment, then rebuild the process around that split. This is the step the 95% skip.
- 03Fit the tool to the process. Choose or build for this workflow and your context — not the most fashionable platform — so the system can learn your exceptions over time.
- 04Put a senior human in the loop. Assign an experienced practitioner to own every consequential output and supervise the system, not just operate it.
- 05Validate against real data. Test on your actual operations, not a sandbox, before anything goes live.
- 06Iterate continuously. Treat it as a living capability. As models improve and the business changes, the system and the workflow evolve with them.
The encouraging part is that access to tools is no longer the constraint. Gartner's 2025 survey of 183 CFOs and senior finance leaders found that about 59% of finance leaders report using AI in the finance function — barely up from 58% the year before, after a jump from 37% in 2023. Nearly everyone has access now. The plateau tells you the bottleneck has moved: it is no longer adoption, it is execution. Workflow-first, human-led, iterated execution is what separates the teams getting value from the ones that just bought a license.
of finance leaders report using AI in 2025 — essentially flat versus 58% in 2024, signalling an adoption-vs-value gap.
Source: Gartner, 2025 AI in Finance Survey
Two firms can buy the same tool on the same day. A year later, one has recovered thousands of hours with a defensible, reviewed process, and the other has a license it barely uses. The model was identical. Everything that mattered happened in the workflow and in who was watching it.
Sources
- 01The GenAI Divide: State of AI in Business 2025 (full report) — MIT NANDA
- 02MIT report: 95% of generative AI pilots at companies are failing — Fortune
- 03The state of AI in 2025: Agents, innovation, and transformation — McKinsey & Company
- 04Field Notes from the Generative AI Insurgency — Global Private Equity Report 2025 — Bain & Company
- 05Gartner Survey Shows Finance AI Adoption Remains Steady in 2025 — Gartner
FAQ
Frequently asked questions
Why do most AI projects fail in finance specifically?+
Because finance work is judgment-heavy and consequence-heavy. MIT's 2025 research found 95% of enterprise generative AI pilots delivered no measurable P&L impact, and the root cause was a learning and workflow gap — tools bolted onto unchanged processes that never adapt to context. In finance, the absence of a senior reviewer compounds the problem, because a confident wrong number can become a misstatement or a diligence finding.
Is the 95% AI failure statistic real?+
Yes. It comes from MIT NANDA's report The GenAI Divide: State of AI in Business 2025, based on just over 300 public AI deployments, 52 structured leader interviews, and 153 senior leader survey responses. The precise finding is that 95% of enterprise generative AI pilots delivered no measurable P&L impact — it is about business value, not whether the tools technically work.
What is 'AI pilot purgatory' and how do you escape it?+
It is the state where an AI pilot demos well but never crosses into production value, so it lingers indefinitely. You escape it by redesigning the underlying workflow before building, fitting the tool to a specific process so it learns your context, keeping a senior practitioner in the loop, and iterating over time — the pattern shared by the roughly 5% of pilots that delivered real impact.
Does human-in-the-loop AI slow finance teams down?+
No — it targets where humans add value. The AI handles high-volume work (extraction, drafting, reconciliation, anomaly flagging) at speed, and the senior practitioner owns the consequential judgment calls and the audit trail. McKinsey found workflow redesign with the right human roles is the biggest single driver of EBIT impact, not a drag on it.
Should we build AI tools in-house or buy from a specialist?+
MIT's data favours specialised, workflow-fitted solutions: tools bought from specialised vendors and embedded into a real process succeeded about 67% of the time, versus internal generic builds succeeding only about a third as often. The deciding factor is fit and supervision, not whether you built it yourself. We break down the trade-offs in our companion guide on choosing an AI implementation partner.
Where should a finance team apply AI first for the best ROI?+
Start with high-volume, repetitive back-office workflows. MIT found the biggest measurable returns sit in process and back-office automation, even though most budgets chase sales and marketing. Reconciliations, document processing, reporting, and review cycles are typically the cheapest, lowest-risk places to prove value before expanding.