Skip to main content

How Google Built an AI That Doesn't Know What It Can Do

The Truth About Gemini's Architecture

Here is a scenario that hundreds of Gemini users have encountered in 2025 and 2026: you open a chat with Google's AI assistant, ask it to create a slide presentation, and it does (well, at least in some cases). Clean slides, export to Google Slides, the whole thing. The next day, you open a new chat, ask the same question, and Gemini tells you it cannot create presentations. Or it spits out VBA code asking you to create a script in a development environment to create PowerPoint presentations. Or sometimes makes pretty speeches why you shouldn't have asked it to perform such tasks or gaslighting you to suggest AI is "not there yet".

This is not a hallucination in the traditional sense. The model is not making up facts about the world. It is making up facts about itself. And the reasons behind this are worth examining, because they reveal something deeper about how large technology companies ship AI products under competitive pressure.

I cannot generate images - let alone slides.
- Gemini 3 Pro right after Google I/O 2026

The Timeline

December 2022: Code Red

When OpenAI released ChatGPT on November 30, 2022, Google's management declared a literal "code red." The New York Times reported that CEO Sundar Pichai brought in co-founders Larry Page and Sergey Brin for emergency strategy sessions. Teams across Google were reassigned to work on competing AI products. The irony was thick: much of the foundational research behind ChatGPT, including the Transformer architecture, originated in Google's own labs. But the company had been cautious about public deployment, citing accuracy and bias concerns. One executive told the Times the AI "can make stuff up" when uncertain. That caution evaporated overnight.

Early 2023 – Mid 2024: The Parallel Track Problem

Google's AI response did not emerge from a single, coordinated engineering effort. It came from at least three organizationally distinct groups:

Google DeepMind built the core Gemini models. Their incentives were benchmark performance, research publications, and model capability.

Google Workspace (Docs, Sheets, Slides, Gmail) had a separate engineering organization responsible for integrating AI features into existing productivity tools. Their incentives were enterprise adoption, per-seat licensing, and IT-admin-friendly rollouts.

The Gemini App team built the consumer-facing web and mobile chat interface. Their incentives were user growth, engagement metrics, and feature parity with ChatGPT.

Each group had its own roadmap, its own release cadence, and its own definition of "done." This is a textbook case of Conway's Law — the 1967 observation by computer scientist Melvin Conway that organizations design systems mirroring their own communication structures. When three teams with different managers, different OKRs, and different deployment pipelines all build on the same underlying model, you get three different products that happen to share a name.

October 2024: The Reorg That Proved the Problem

Google effectively admitted the organizational fragmentation was unsustainable. In October 2024, Pichai announced that the Gemini app team, led by Sissie Hsiao, would be merged into Google DeepMind under Demis Hassabis. His own blog post stated the goal was to "improve feedback loops, enable fast deployment of our new models in the Gemini app, make our post-training work proceed more efficiently." You do not restructure to improve feedback loops unless the feedback loops were broken.

Separately, the Assistant teams for devices and home experiences were moved into the Platforms & Devices division. Prabhakar Raghavan, who oversaw Search and the Knowledge & Information team, was moved to a "chief technologist" role — a lateral move that most industry observers interpreted as a demotion following Google's slow AI response.

March – October 2025: Canvas and the Feature Gap

Google launched Canvas, a workspace within the Gemini app, around March 2025. Canvas allowed users to write documents, generate code, and preview outputs inline. In October 2025, Google announced that Canvas could generate slide presentations — full multi-slide decks with themes and images — and export them directly to Google Slides.

But here is the problem: the Gemini sidebar inside Google Slides itself could not do this. If you opened Google Slides and activated the Gemini sidebar, it could only generate a single slide at a time. Not a deck. One slide. The same underlying model family that could produce a complete presentation in the Gemini app's Canvas was artificially constrained inside Slides.

This was not a model limitation. It was a plumbing problem. The integration between the LLM and the Slides API was built differently depending on which frontend you accessed it from. The Gemini app had a richer integration layer. The Workspace sidebar had a thinner one. And neither one communicated the other's capabilities to the model itself.

2025 – 2026: The Self-Knowledge Failure

This brings us to the core absurdity. Users reported — and continue to report — that Gemini's stated capabilities change from session to session. In one chat, the model confidently offers to create a presentation. In the next, it insists it cannot. Sometimes it outputs VBA code for PowerPoint instead. One user community documented that the "Fast" model variant was particularly inconsistent, while the "Thinking" model followed instructions more reliably.

Multiple factors contribute to this inconsistency:

A/B testing and staged rollouts. Google rolls out features incrementally by region, account type, and random experimental cohort. Two users sitting next to each other may have access to different feature sets. The same user may lose access to a feature between sessions if they shift between experimental groups.

Model variant switching. Since October 2025, Gemini allows mid-conversation model switching. But different model variants (Flash, Pro, Thinking) have different system-level instructions and different capability profiles. A user who starts a chat on one variant and continues on another may get contradictory answers about what the system can do.

Absent self-awareness in the training data. The model does not have reliable access to a ground-truth manifest of its own current capabilities. Its answers about what it "can" or "cannot" do are probabilistic completions based on training data, not lookups against a live capability registry. When the training data contains examples of the model generating slides AND examples of it declining to generate slides, the model will do both — depending on context, prompt phrasing, and which weights happen to activate.

System prompt fragmentation. Different execution environments (web app, mobile app, API, Workspace sidebar) inject different system prompts. These system prompts define the model's available tools and behavioral constraints. If the system prompt for one environment does not mention slide generation, the model will correctly report that it cannot do it — because in that context, it genuinely cannot.

What Gemini Got Right About Its Own Diagnosis

When asked to explain this situation, Gemini produced a timeline attributing the problem to organizational silos, rushed deployment, and fragmented architecture. This analysis is largely accurate. The "code red" panic is documented. The team silos were real and acknowledged by Google's own restructuring. The feature gap between Canvas and the Slides sidebar is observable and reproducible.

Gemini also correctly applied Conway's Law as a framing device, and it was right to distinguish between incompetence and corruption. There is no evidence that anyone at Google deliberately designed the system to deceive users. The deception is emergent — it arises from the gap between what the model knows about itself and what is actually true in a given execution context.

What Gemini Got Wrong — Or Conveniently Omitted

Gemini's analysis was generous to its own parent company in several ways.

First, it described the integration architecture as a "proprietary middleware interceptor hardcoded exclusively into the primary consumer web client's backend." This is plausible but speculative. No public documentation confirms the specific internal architecture. Gemini presented speculation as established fact.

Second, it omitted the role of A/B testing and regional rollouts in creating user confusion. This is not a minor detail. It is one of the primary mechanisms by which users experience contradictory behavior.

Third, and most importantly, Gemini framed the problem as purely architectural — a routing and middleware issue. It avoided the more uncomfortable conclusion: the model lacks reliable self-knowledge. This is not just a plumbing problem. It is a training and system design failure. A model that confidently tells you it can do something it cannot do, or that it cannot do something it can, is a model with a broken relationship to its own identity. No amount of middleware fixes that. It requires either a live capability registry that the model can query at inference time, or training procedures that give the model accurate beliefs about its own capabilities across all deployment contexts.

Google chose to ship fast and sort out consistency later. Later has not arrived yet.

The Broader Pattern

This is not unique to Google. Every major AI lab is navigating the tension between shipping features quickly and maintaining a coherent product experience. OpenAI declared its own "code red" in late 2025 in response to Gemini 3's strong benchmark performance, and reportedly rushed GPT-5.2 to market with known issues. Anthropic takes a more conservative approach to capability claims but operates a smaller product surface area.

What makes Google's case instructive is the scale of the organizational complexity. Google is not a startup shipping a single product. It is a conglomerate with dozens of product teams, each with their own release cycles, each integrating the same underlying model differently. The result is an AI that is simultaneously brilliant and confused about what it is — depending on which door you walk through to talk to it.

For users, the practical takeaway is simple: if Gemini tells you it cannot do something, try again in a new chat, on a different device, or with a different model variant. The capability may exist. The model just might not know it.

For Google, the takeaway is harder to execute: unify the system prompts, build a capability registry the model can query, and stop treating feature rollouts as invisible experiments on users who have no way to know which version of the product they are using on any given Tuesday.

The engineers did not set out to build a system that lies. But they did build a system that does not know the truth about itself. At this point, the distinction is academic.


Sources consulted: The New York Times (Dec 2022), Bloomberg (Oct 2024), Google's official blog (Sundar Pichai, Oct 2024), Google Workspace Updates blog (Oct 2025), Computerworld (Feb 2026), Android Police (Feb 2026), Workalizer.com (Mar 2026), HyperAI, DevOps.com (Mar 2025). Conway's Law reference: Melvin Conway, "How Do Committees Invent?" (1968).

Comments