Unpacking the Reality of "Agentic AI" in Enterprises

By Saurabh Dubey August 25, 2025

This write-up is not intended to be just a high-level critique of LLM-based agentic AI in enterprise settings — such high-level commentary, highlighting shortcomings, already exists widely online. Its purpose is to go a layer deeper: to place an enamored user of ChatGPT-like tools in the shoes of enterprise stakeholders — product owners, security teams, and developers — and examine how the perceived strengths and inherent limitations of LLMs unfold in real-world enterprise workflows. The focus is on revealing where costs, risks, and failure modes emerge, and why the consumer experience does not translate into enterprise context.

Introduction

Before diving into why LLM-based agents fall short for bespoke, company-specific workflows, it’s important to separate two things: the personal experience we have as individual users of ChatGPT-like tools, and the actual realities of deploying LLM based agentic AI in an enterprise setting. Enterprises expecting to replicate the ChatGPT “magic” through API calls alone are reaching too far.

The “magic” we feel when using ChatGPT doesn’t directly translate to specialized enterprise use cases. To see why, we need to unpack where that magic comes from, what strengths make it work, and how some of its weaknesses are acceptable in a casual, individual context. In an enterprise context, however, those same weaknesses can be catastrophic, while the strengths often can’t be transferred in the same way.

Instead of immediately highlighting shortcomings in the enterprise context, I’ll first outline high level mechanics of the LLM model that everyday users interact with. That background matters, because it creates the contrast one needs to see the limits clearly. Over time, people pick on those mechanics but when the discussion shifts to adjacent claims like agentic AI in enterprise, those mechanics rarely make it back into working memory for critical analysis.

It’s well understood by many that the LLM we interact with is the product of both pre-training and post-training phases. After pre-training, it operates purely as a next-token prediction engine, and even after post-training, that core mechanism doesn’t change. It cannot logically reason or perform consistently reliable calculations.

You might be thinking, “But I use reasoning models all the time, and I can upload CSV files for data analysis. What is this person talking about?” Those capabilities are the result of human driven “grunt” work.

In fact, the pre-trained model, armed with its vast knowledge and strong pattern-recognition ability, is a greater example of the technological leap made possible by the transformer and attention architecture introduced 8 years ago in 2017. Scaling laws delivered tremendous improvements until they hit a plateau (better explained by Cal Newport in a New Yorker article).

LLMs do perform well at tasks such as summarizing, generating new text, synthesizing multiple documents, rephrasing, and knowledge retrieval etc. Here, my focus will be on expectations related to performing actions, i.e. agentic behavior.

Bag of impressive tricks

Over the past couple of years, the primary frontier has shifted to post-training. Initially, supervised fine-tuning (SFT) combined with reinforcement learning from human feedback (RLHF) focused on improving instruction-following, tone, personality, safety, and helpfulness.

Subsequently, hundreds of thousands of step-by-step problem-solving examples (mostly written manually by experts) are fed at SFT stage into the model to activate neural network layers that guide next-token generation to mimic a chain-of-thought pattern. There is no upfront plan; success relies on predictive tokens hopefully landing correctly most of the time. If a user prompt falls in or close to the distribution of examples, outcomes are likely acceptable, but deviations can produce nonsensical outputs. Curating examples for every scenario is impossible, so large companies prioritize high-coverage use cases and carefully curate examples for them.

This also brings us to tool use: training examples at SFT stage also include explicit calls to Python interpreters, calculators, web searches, and a few other tools. The payoff is high even with these few tools because it can satisfy most users’s prompts by performing calculations correctly, web-based grounding and generating table stakes python scripts for data analysis (Note: users see a summarized and rephrased thinking output; the actual chain-of-thought contains custom tokens recognized only by the orchestrator to invoke tools, gather results, and feed them back into the context window for the next token generation). Traditional Reinforcement Learning (not RLHF, which in my view is closer to supervised learning than to RL) is mostly useful in verifiable domains so most of the gains in recent times are concentrated in coding, sql and math.

Despite all this effort, inherent unpredictability (aka hallucinations) occur, producing confident but incorrect answers. To improve the odds of success, approaches resembling trial and error are rebranded as “parallel reasoning,” with the hope that one path will yield results. Even two decades ago, many problems could have been brute-forced if we had chosen to apply vast, guided computational power but silly ROI discussions came in the way. Scaling laws were originally conceived as a way to enhance intelligence in pre-trained models, yet proponents now cite these human and trial-and-error-driven methods as evidence that scaling still works. But “being less wrong” is insufficient for most applications. A 12-year-old may be less wrong than a 4-year-old, but that does not qualify them for most jobs. Unpredictability in LLMs remains an intrinsic trait; one that can be mitigated in specific scenarios but never fully eliminated.

In ChatGPT, these failures are limited to the individual user, who can see the result, detect errors, and take corrective action, a luxury not available to agents expected to operate autonomously in many enterprise agentic AI scenarios.

You can call all that effort innovative engineering techniques or be reductive and call it an elaborate effort to hack the LLM derived from 8 year old transformer innovation. Does that provide a magical experience to most users in more scenarios? Undoubtedly. So, is all that effort worth it? Absolutely!

You could actually zoom out and look at all this effort from a company like OpenAI and rephrase this as “An enterprise investing billions and deploying extensive human resources to optimize LLM-based technology for a large volume of use cases across its vast user base, because the potential payoff appears worthwhile”. At that scale, applying a similar rationale, they may choose to invest in reinforcement learning to develop web-surfing agents, even though inherent unreliability and errors are inevitable.

However, these calculations shift if the enterprise is smaller, and even more so if you aim to use an LLM for specialized workflows serving only a small group of users.

Enterprise reality beyond flashy demos

Let’s sketch out a hypothetical enterprise scenario. While many claims and demos emphasize transforming entire enterprise workflows, we’ll instead consider a simple example. Imagine an AI agent tasked with autonomously completing a workflow that requires calling 3–4 different systems.

Before we explore this further, it’s worth contrasting the development effort required to stitch systems together manually with the value an LLM adds in automating such workflows.

Developer effort vs. LLM value in multi-system workflows

Dev: Define the workflow and enumerate all systems involved.
Dev: Implement API wrappers or helper functions for each system, since in most organizations, systems and data aren’t in a state that allows immediate integration with an agentic AI.
Dev: Craft the system prompt describing the agent’s role, constraints, and output format.
Dev: Define function schemas containing function names and their descriptions, and provide them to the LLM.
Dev: Implement guardrails, safety checks, validation logic, and logging mechanisms around function calls.
LLM: Proposes which function calls to make based on context.
LLM: Generates structured outputs for those function calls (arguments like order IDs, ticket IDs).
Dev: Executes the proposed function calls via API wrappers, enforcing guardrails and safety checks.
Dev: Validates API responses, handles errors, retries if needed, and logs all actions for auditability.
Dev: Sends results or updated context back to the LLM if the next step depends on them.
LLM: Uses updated context to propose the next set of functional calls.

Repeat steps 6–11 until the workflow is complete.

Most of the heavy lifting still falls on the developer: calling functions, implementing validations, guardrails, and safety checks — work made heavier, not lighter, by relying on LLMs. The model mainly proposes function calls and reshapes data into function-ready formats, but since it was never trained on company-specific workflows or terminology, reliability of producing tokens that lead to the correct path is shaky and hallucinations inevitable. Even in a simple setup with four systems and four functions each, the LLM faces 16 choices with arguments, amplifying the odds of missed calls or corrupted data transformations. Model weights are frozen after the post-training step, so the absence of continuous learning remains a well-known impediment that prevents these systems from improving on past mistakes.

And unlike traditional software, where execution has near-zero marginal cost, every LLM inference adds variable cost as every stateless LLM API call needs to include full system prompt, current user instruction, available function schemas, and optionally previous assistant outputs.

Security and privacy risks deepen the problem. The attack surface spans from direct hacks to subtle exploits, like coercing a booking chatbot to reveal hidden mystery hotel names. The risk multiplies when LLMs parse third-party content from emails or chats, where prompt-injection and data-poisoning aren’t just remote possibilities.

Even if we assume zero hallucinations, what remains is an expensive LLM intent and data parser, a role traditional software handles more reliably. Developers still carry the orchestration and validation burden, while the “intelligence” layer adds failure modes without delivering significant value.

Proponents counter with remedies: better system prompts, stronger validators, richer context per API call, pre-chosen function schemas. But how many layers of ritual must developers maintain just to appease the quirks of an LLM? And what happens if the model gets deprecated and its replacement behaves differently? The entire setup risks collapsing like a house of cards.

Others argue LLMs shine when workflows scale. Yet repetitive tasks they want replaced are usually well-scoped, like the simple example above, where traditional automation suffices. When workflows do grow more complex, unpredictability and errors compound, leaving developers firefighting errors with little visibility. And if complexity expands further, developers still need to define new functions, update schemas, wire execution paths, and maintain guardrails. The LLM won’t adapt on its own, it must be told explicitly each time. At that point, updating a traditional orchestration layer is usually simpler and sturdier. One might say that at extreme complexity an LLM could outperform but “better” in that sense still doesn’t mean reliable. And if your system needs to prove repeatable behavior to auditors annually, good luck with that.

Proponents may point to tools like code interpreters, claiming they give agentic AI systems the flexibility to handle unforeseen scenarios. But there’s no guarantee the Python code generated by an LLM will always be reliable and parse inputs consistently.

And if the real value proposition is automating repetitive enterprise tasks, then most scenarios would reduce to a handful of recurring scripts. At that point, why not write them once, which would be more reliable and without ongoing output token costs? It also raises a troubling possibility: if an LLM provider silently tweaks the model to generate more verbose code, enterprises could end up paying materially more over billions of transactions.

Finally, proponents may point to RAG applications. While success is highly dependent on your chunking strategy and which limited top-K results are fed into the context, you can build a reasonably effective information retrieval system or combine it with simple database lookups. In scenarios where the cost of error is low, like customer support chatbots, and the primary goal is to combine various bits of information in a coherent human sounding response or augment it with translation/transcription, this can provide some value in limited context but that is not what I would call agentic.

Summary of differences between Consumer and Enterprise Agentic AI

1. Knowledge Domain

Consumer: Works on the open internet; knowledge is broad, general, and somewhat verifiable.
Enterprise: Works on proprietary, siloed data (databases, records, documents) that isn’t publicly verifiable.

2. Tooling & Verifiability

Consumer: Uses universal tools (calculator, Python, search) governed by common rules, so outputs are easier to trust.
Enterprise: Must use bespoke internal tools (CRM, inventory, finance APIs) with unique rules, auth, and opaque outputs.

3. Execution Model

Consumer: Relies on probabilistic Chain-of-Thought — a simulation of reasoning, essentially best guesses.
Enterprise: Demands deterministic workflows — even in fluid situations — that are explicit, repeatable, auditable, and grounded in business logic.

4. Supervision

Consumer: Human-in-the-loop; the user actively corrects and steers the model.
Enterprise: Runs autonomously; no supervisor checks every action.

5. Failure Consequence

Consumer: Mistakes are low-stakes (e.g., wrong answer, messy summary).
Enterprise: Mistakes can carry financial, compliance, and operational consequences.

Think of an LLM like a skilled guitarist. Years of practice build core skills - chords, scales, and finger techniques - which corresponds to the model’s pre-training. A coach then refines their style and corrects mistakes, similar to supervised fine-tuning (post-training); at this stage, the guitarist is polished, but their skills are mostly fixed. Give them sheet music for a new song, and they can play it immediately while reading it, but won’t remember it permanently - that illustrates in-context learning. To truly add the song to their repertoire, they need repeated practice after post-training, which is like fine-tuning the model to acquire new knowledge or skills. Now, imagine asking the metal guitarist to perform a complex jazz improvisation live in front of an audience that expects flawless execution. They have the underlying musical skill, so they might pull it off occasionally, but without rehearsal specific to this piece and style, mistakes are likely. Similarly, plugging a generic LLM API into bespoke enterprise workflows may work sometimes with guidance from system prompts and function schemas. Fine-tuning on enterprise-specific processes, terminology, and patterns is one way to improve reliability, but it requires skill, effort, and cost - and even then, occasional errors are inevitable.

All is not necessarily lost. Just as OpenAI can justify investing in human-driven fine-tuning to optimize a few tools for broad user needs, companies like Salesforce, ServiceNow, Notion, and Figma can do the same. With their pricing power, they can justify investments in fine-tuning models using their proprietary data models and APIs to handle core use cases that often produce code like outputs — such as when a Figma user requests rounded corners, the system translates that request into underlying code and updates the canvas visually.

It may be more practical to give employees access to tools like ChatGPT, Claude, or Gemini, much like they were given Excel, so they can use them for their own custom needs, verify the outputs, and proceed as needed, provided appropriate security and privacy measures are in place.

Expectations for journalists who value nuance

I hope the points explained in detail so far give journalists a solid foundation to scrutinize hand-wavy outlandish claims about bespoke agentic AI in enterprises. They should separate personal experiences with tools like ChatGPT from enterprise realities and press on key questions: How would it actually work? How reliable and safe is it? What ongoing burden remains on developers? What real value is produced, and at what cost and risk?

Big picture ahead

If my arguments hold, it would be valuable to assess the impact through a holistic evaluation of the entire LLM landscape. Most LLM use cases fall into four categories:

Information retrieval/Search and data analysis (enabled by tools like Python code generation)
Generating text, images, or videos
Coding
Enterprise agentic AI

Let’s briefly analyze the first 3 in terms of value to customers and revenue potential for providers.

Search and Data Analysis: Search delivers strong value despite occasional hallucinations but a sustainable consumer business model remains limited so far. Similarly, data analysis tasks like parsing CSVs or pulling data from the web are highly valuable, though for most users they arise infrequently and can usually be handled with free tools.

Text, Images and Video Generation: Text generation for casual use or marketing copy has become table stakes. Video generation, however, stands out: it can provide 3–5 orders of magnitude more value for industry users and creators by eliminating the need for cameras, travel, scene setup, teams, and editing. The size of the paying market for video generation remains to be seen. Use cases like game asset creation, where outputs are inherently subjective, are also strong candidates. Although evolving, this remains in its early stages.

Coding: This has emerged as the strongest domain for LLMs, largely because outputs are verifiable and backed by abundant, high-quality training data. Two primary user groups stand out: (1) seasoned software engineers and (2) “vibe coders.” The vibe coders are unlikely to produce anything meaningful or sustainable beyond disposable apps for personal use. Successful software requires far more than generating lines of code. Proficiency in data modeling, deployment, understanding code for bug fixing, and system design, among other skills, is essential. For seasoned engineers, risks remain: hallucinations and potential security vulnerabilities introduced through various vectors from supply chain attack to simple poor exception handling. However, even basic code completion, when combined with disciplined self- and peer review, delivers significant value. Let’s assume, generously, that 10M professional software engineers worldwide each pay $1,000 per month. That works out to $120B annually. With little real differentiation in core technology, suppose five providers (three U.S., two international) split the market evenly. Assuming 90% margins, each would average about $24B in revenue and ~$21B in profit — the same amount Meta burned through on its hardware division in a single year. In the absence of meaningful differentiation, margins will erode over time, likely shrinking to a quarter or lower.

I wonder how much of the current valuation and investment is predicated on the story of agentic AI in enterprises. If my theory is correct and agentic AI does not succeed as it is being marketed, can the remaining major use cases sustain those valuations? If there is a high likelihood that I am correct, insiders most probably know this as well. So what would they be doing in such a scenario while the world waits for another breakthrough at the level of Transformers?

They would likely double down on coding as a core use case while pursuing tactical opportunities. This would involve developing adjacent sub-products using targeted system prompt engineering or quick fine-tuning of the underlying model, such as building a study companion or creating a large set of curated examples for fine-tuning in specific domains, like healthcare, that are largely self-contained to simplify grounding and have the potential to impact a large number of users. Coding, healthcare, and underwhelming outcomes in enterprise settings — where have I heard that recently? I remember, in an August 7th interview with the OpenAI COO by Alex Kantrowitz on the Big Technology Podcast.

auto_awesome

Link to Article on Medium