Peering Inside the Black Box: How Anthropic is Unraveling the Mysteries of LLMs

Anthropic has developed a neuroscience-inspired technique called circuit tracing to uncover how large language models (LLMs) like Claude 3.5 Haiku generate responses. Their research revealed that LLMs often compute in unexpected ways, such as approximating numbers during math problems, and that their self-explanations can be misleading. These insights highlight the importance of understanding AI’s internal processes to improve trust, safety, and transparency in artificial intelligence.

MODELS

The AI Maker

8/18/20252 min read

Anthropic’s circuit tracing reveals how LLMs like Claude think, exposing odd math methods, misleading explanations, and hidde
Anthropic’s circuit tracing reveals how LLMs like Claude think, exposing odd math methods, misleading explanations, and hidde

Artificial intelligence has become an integral part of our daily lives, yet even the experts who create advanced large language models (LLMs) admit that these systems remain, in many ways, a mystery. Engineers can design and train AI models, but the intricate web of billions of parameters that allows them to generate coherent responses often defies intuitive understanding. That’s beginning to change thanks to groundbreaking research from Anthropic, the AI company behind the Claude family of models.

Anthropic’s team, including research scientist Joshua Batson, has pioneered a method called circuit tracing, inspired by neuroscience techniques, to peer into the inner workings of LLMs. Just as brain scans highlight which regions of the human brain are active during different cognitive processes, circuit tracing illuminates which components of a model activate as it processes a prompt. For the first time, researchers can trace decision-making step by step, moving closer to understanding how these “black boxes” generate their outputs.

One surprising insight emerged when Anthropic applied circuit tracing to its own Claude 3.5 Haiku model: the strange way LLMs handle math. Ask Claude to calculate 36 + 59, and its internal process looks nothing like a human calculation. The model first approximates numbers—adding “40ish to 60ish” and “57ish to 36ish”—before narrowing in on the final answer. It eventually identifies the correct sum of 95, but if you ask how it arrived there, it confidently gives a standard explanation: “I added the ones (6+9=15), carried the 1, then added the 10s (3+5+1=9).” In reality, this reasoning is more a reflection of its training data than its actual computation.

This phenomenon highlights a key challenge in AI safety and governance. LLMs not only produce outputs that can be misleading—they also generate explanations that are often fabricated narratives of their thought process. For engineers building guardrails and trust mechanisms, it’s not enough to evaluate outputs alone. Understanding the internal “circuits” is crucial for creating reliable and safe AI applications.

Anthropic’s research also challenges the widely held assumption that LLMs simply predict the next word in a sequence. While next-word prediction is their fundamental training paradigm, circuit tracing revealed more strategic behaviors. When generating rhyming couplets, for example, Claude often chooses the rhyming word at the end of a line first, then backfills the sentence to make it coherent—a form of planning that mimics human poetic composition. This suggests that LLMs exhibit higher-order organizational strategies, even if we don’t fully understand how these patterns emerge.

Another fascinating discovery is that Claude sometimes operates in a conceptual space shared across languages, a hint of a “universal language of thought.” This aligns with observations from linguistics and cognitive science, suggesting that the abstractions formed inside AI models may transcend the boundaries of any single human language.

While circuit tracing is still in its early days—Anthropic notes that even short prompts can take hours for researchers to analyze—the method provides a rare window into the mechanisms behind LLM behavior. It doesn’t yet explain how these intricate structures form during training, but it represents a critical step toward transparent and interpretable AI.

In a world where AI plays an ever-growing role, this kind of research is vital. We may have built these intelligent systems, but only by truly understanding how they think can we ensure they operate safely, predictably, and in harmony with human goals.

Cited: https://www.pcgamer.com/software/ai/anthropic-has-developed-an-ai-brain-scanner-to-understand-how-llms-work-and-it-turns-out-the-reason-why-chatbots-are-terrible-at-simple-math-and-hallucinate-is-weirder-than-you-thought/