How AI Actually Works: A Visual Guide
Nobody really knows what this means. Not most engineers, not most executives, not most of the people building products on top of it. They know what it does. They don't know how. Because at the bottom of it, it's a math problem. And people aren't good at math problems.
This is my attempt to fix that. No hand-waving, no jargon walls, no "it's like a brain." It's not like a brain. It's linear algebra. And if you can follow a recipe, you can understand it.
Hot Dog or Not
Let's start with what most people think AI is.
In 2017, Silicon Valley gave us the perfect distillation of classical machine learning: an app that looks at a photo and tells you whether it's a hot dog. That's it. Hot dog, or not hot dog. The audience laughed because it seemed absurd. But that's genuinely what traditional machine learning does — it takes an input, extracts features, and sorts things into buckets.
You show it ten thousand pictures of hot dogs and ten thousand pictures of not-hot-dogs. It learns the patterns — the shape, the color, the bun, the mustard line. Then you show it a new picture and it says: 94% hot dog. That's classification. It's pattern matching with statistics. And it works incredibly well for exactly that kind of problem.
Every classical ML model works like this. Spam filter? Classification — spam or not spam. Fraud detection? Classification — fraud or legit. Tumor detection? Classification — malignant or benign. You define the categories. You label the training data. The model learns to sort.
This is not what ChatGPT does.
The Leap
Here's the difference, and it's fundamental.
A classifier answers: "Which bucket does this go in?" It has a fixed number of answers. Hot dog or not. Spam or not. Cat, dog, or fish. The answers are defined before the model is ever trained. It can't surprise you. It can't generate something new. It sorts.
A large language model answers a completely different question: "Given everything that's come before, what word comes next?"
That's it. That's the whole trick. Everything you've seen from ChatGPT, Claude, Gemini, Llama — every poem, every essay, every piece of code, every conversation that made you wonder if it was alive — is the result of a machine that does one thing: predict the next word. Then the next one. Then the next one. One word at a time. Billions of times.
The classifier is a sorter. The language model is a generator. And that distinction changes everything.
But how do you get from "predict the next word" to "write me a Shakespearean sonnet about database migrations"? That's where the math starts. Let's build it from the ground up.
Words Become Numbers
Computers don't understand words. They understand numbers. So the first step is converting text into numbers — a process called tokenization.
But you don't tokenize by whole words. The vocabulary would be too large — English alone has over a million words if you count all the forms, slang, proper nouns, technical terms. Instead, modern models use subword tokenization. The text is split into pieces that might be whole words, parts of words, or even single characters. The algorithm (usually BPE — Byte Pair Encoding) figures out the most efficient split based on how frequently sequences of characters appear in the training data.
Every piece gets a number — its token ID — from a fixed vocabulary of 30,000 to 100,000 tokens. This is the model's alphabet. Every sentence, every paragraph, every book it's ever read or will ever read is converted into a sequence of these numbers.
Here's the critical thing: at this point, the numbers are arbitrary. Token 3857 doesn't mean "cat" in any deep way. It's just a lookup index. The meaning comes next.
And text isn't the only thing you can tokenize. Multimodal models — GPT-4o, Gemini, LLaVA — apply the same idea to images, audio, and video. An image gets sliced into a grid of patches (say, 16×16 pixels each). Each patch becomes a token. Audio gets converted to a spectrogram and sliced the same way. These visual and audio tokens flow through the exact same transformer architecture as text tokens. The model learns cross-modal attention — which parts of an image relate to which words. This is why you can ask a model "What's in this photo?" and get a coherent answer. It's not a separate vision system bolted on. It's tokenization applied to pixels instead of characters.
Numbers That Mean Things
Each token ID gets mapped to a vector — a list of numbers, hundreds or thousands long. In GPT-4-class models, each token becomes a list of over 12,000 numbers. This is the embedding.
Why so many numbers? Because each number represents one dimension of meaning. Not a meaning we designed — a meaning the model learned during training. One dimension might loosely correlate with "is this a living thing?" Another with "is this formal or casual?" Another with some abstract relationship no human has a name for. The model figured out what dimensions of meaning are useful for predicting the next word — and it needs thousands of them.
The remarkable property of these embeddings is that similar meanings end up near each other in the vector space. "Cat" and "dog" are close together. "King" and "queen" are close together. And — this is the wild part — the relationships between words are encoded as directions. The direction from "man" to "woman" is roughly the same as the direction from "king" to "queen." You can literally do arithmetic on meaning: king − man + woman ≈ queen.
This isn't magic. It's geometry. Words that appear in similar contexts during training end up with similar vectors. The math just... works. The model discovered the structure of language by staring at enough of it.
Paying Attention
Now every token is a vector. But meaning doesn't exist in a vacuum — it depends on context. The word "bank" means something different in "river bank" and "bank account." How does the model figure out which meaning to use?
Self-attention. This is the mechanism at the heart of every modern language model, and it's the single most important idea in all of this.
For every token in the sequence, the model asks: "Which other tokens in this sentence should I pay attention to in order to understand this one?" It then computes attention weights — a score for every pair of tokens — that determine how much each token influences every other token's representation.
Consider the sentence: "The cat sat on the mat because it was tired." What does "it" refer to? The cat, not the mat. The attention mechanism learns to assign a high weight from "it" to "cat" — the model literally learns pronoun resolution from seeing millions of examples. No one programmed a grammar rule. The math figured it out.
Each token doesn't just pay attention once — it does it across multiple attention heads simultaneously. Different heads learn to track different types of relationships: one might focus on syntax, another on semantic meaning, another on positional relationships. A large model might have 96 or more attention heads running in parallel. The results are combined to produce a rich, context-aware representation of each token.
The computation is matrix multiplication. Three matrices — Q (query), K (key), and V (values) — each derived from the input. The formula:
That's it. Every poem, every philosophical argument, every code function these models produce — is downstream of this operation. Matrix multiply, scale, softmax, matrix multiply. The rest is engineering.
The Assembly Line
One round of self-attention isn't enough. The model needs to build increasingly abstract representations — from individual word meanings, to phrase meanings, to sentence-level understanding, to document-level patterns. It does this by stacking transformer blocks, each one refining the representation.
Each block follows the same recipe: normalize the input, run self-attention, add the result back to the input (a "skip connection" that helps information flow), normalize again, run through a feed-forward neural network, add back again. That's one transformer block.
Now stack 96 of them. The input tokens go in the top, and at each layer the representation gets richer. Early layers tend to capture syntax and local patterns. Middle layers capture semantic relationships. Deep layers capture abstract, long-range patterns that even researchers struggle to interpret.
Current stage: input. Top prediction: sat (23%).
At the very end, after the last transformer block, the model projects the final representation down to a probability distribution over the entire vocabulary. Every token in the vocabulary gets a score. That score becomes a probability. And the highest probability token? That's the model's answer for "what comes next."
Not Every Expert Shows Up
Here's the scaling problem: a 1.8-trillion parameter model uses all 1.8 trillion parameters for every single token. That's an enormous amount of computation. What if you didn't have to?
Mixture of Experts (MoE) is the architectural breakthrough that changed the economics of large models. Instead of one massive feed-forward network in each transformer block, MoE models have many smaller "expert" networks — typically 8 to 16. A lightweight router (also called a gate) looks at each token and decides which 2 experts should process it. The other experts sit idle.
The result: a model can have 1.8 trillion total parameters but only activate 37 billion per token — roughly 2% of the network. The model is large in storage but small in compute. This is why DeepSeek's MoE models shocked the industry in 2025 — they achieved frontier-level performance at a fraction of the compute cost. Every major open-weight model released in 2026 uses MoE.
Different experts tend to specialize, though not in ways humans designed. Some experts handle syntactic patterns, others semantic relationships, others domain-specific knowledge. The router learns to dispatch tokens to the right experts during training. No one tells it how to route — it figures out the optimal assignment by minimizing the same next-token prediction loss.
MoE is also why local inference is getting viable for larger models. A 47-billion parameter MoE model that only activates 8 billion per token fits comfortably on a consumer GPU and runs fast. The total parameter count is a storage number. The active parameter count is what determines speed.
Training the Beast
Now you know the architecture — the machine that takes tokens in and produces probabilities out. But how does it learn to produce good probabilities?
Training. And training is where classical ML and large language models diverge most dramatically.
A hot dog classifier trains on thousands of labeled images. A human looked at each image and said "hot dog" or "not hot dog." The model learns from those labels.
A language model trains on the entire internet. Books, Wikipedia, code repositories, forums, news articles, scientific papers — trillions of tokens. And it doesn't need labels. The labels are built into the data itself: for any sequence of text, the next word is the label. "The cat sat on the ___" — the answer is right there in the training data. It's called self-supervised learning, and it's what makes scale possible.
The training loop is simple in concept: show the model a sequence, ask it to predict the next token, compare its prediction to the actual next token, calculate how wrong it was (the loss), then adjust the model's billions of parameters slightly to make it less wrong next time. Repeat this — billions of times. The loss goes down. The predictions get better. The model gets smarter.
The scale is staggering. GPT-4-class models have roughly 1.8 trillion parameters. They trained on 13 trillion tokens. Training costs over $100 million in compute. A hot dog classifier has 25 million parameters and trains in an hour on a single GPU. These are different categories of thing entirely. And somewhere in that gap — somewhere between millions and trillions — something qualitatively changes. Capabilities emerge that weren't explicitly trained. Reasoning-like behavior. Code generation. Multilingual translation. Nobody fully understands why scale produces these capabilities. But it does.
From Raw Model to Assistant
A model that's been pre-trained on the internet is impressive but not useful. Ask it "What's the capital of France?" and it might respond with "What's the capital of Germany? What's the capital of Spain?" — because in its training data, questions tend to be followed by more questions (think: quiz pages, FAQ lists). It learned to predict text, not to answer text.
Turning a raw model into an assistant requires two additional steps. This is what separates a base model from the thing you talk to.
Step 1: Supervised Fine-Tuning (SFT). Humans write thousands of example conversations — question and ideal answer pairs. The model trains on these, learning the format of helpful responses. This is where it learns that when a human asks a question, the correct next-token pattern is an answer, not another question.
Step 2: Reinforcement Learning from Human Feedback (RLHF). The model generates multiple responses to the same prompt. Human raters rank them from best to worst. A reward model is trained on those rankings, and then the language model is fine-tuned to maximize the reward model's score. This is where it learns not just to answer, but to answer well — to be helpful, to avoid harmful outputs, to follow instructions precisely.
Frontier models — GPT-4, Claude, Gemini — go further. Proprietary techniques, safety layers, tool use, longer contexts, specialized training data. The base architecture is the same transformer, but the post-training process is where companies differentiate. A model you run on your own GPU (Llama, Mistral, Qwen) has the same fundamental architecture as a frontier model. The differences are scale, training data, and post-training refinement. Not magic — engineering.
There's a third technique that's quietly becoming one of the most important: knowledge distillation. A large "teacher" model generates training data — thousands or millions of high-quality examples — and a smaller "student" model trains on that output. The student doesn't need to learn from the raw internet. It learns from a curated, refined version of what the teacher already knows. This is why local models are improving so rapidly: they're not independently trained from scratch. They're compressed copies of frontier models, distilled into something that fits on your hardware. The tradeoff is real — distilled models lose some reasoning depth and nuance — but for most production tasks, they're good enough. And they're getting better every quarter.
The catch: when models train on AI-generated data recursively — models training on outputs from other models, which trained on outputs from other models — quality degrades. This is called model collapse. The statistical patterns become self-reinforcing and drift from reality. It's why human-curated data remains essential even as synthetic data scales. The best training pipelines in 2026 use a hybrid: synthetic data for volume, human data for grounding.
Rolling the Dice
The model has been trained. You type a prompt. It processes your tokens through the entire transformer stack and produces a probability distribution over its vocabulary — 100,000 tokens, each with a probability. Now what?
It doesn't just pick the highest-probability token every time. If it did, the output would be repetitive and boring — the same safe, obvious word every time. Instead, it samples from the distribution. It rolls a weighted die.
The key parameter is temperature. Temperature controls how "peaked" or "flat" the probability distribution is.
At low temperature (near 0), the distribution becomes extremely sharp — the top token dominates. The model is deterministic, predictable, and focused. Good for factual answers and code. At high temperature (above 1), the distribution flattens — unlikely tokens get a real chance. The model becomes creative, surprising, occasionally incoherent. Good for brainstorming and poetry.
There are other sampling strategies stacked on top — top-k (only consider the k most likely tokens), top-p / nucleus sampling (only consider tokens whose cumulative probability reaches p) — but they're all variations on the same idea: how much randomness do you let into the dice roll?
This is why you can ask the same question twice and get different answers. The model isn't being inconsistent. It's sampling from a distribution. Different rolls, different results.
The Memory Trick
Here's a problem. The model generates text one token at a time. To generate token #100, it needs to attend to all 99 previous tokens. To generate token #1000, it needs all 999. Every new token requires self-attention over the entire sequence. That's a lot of repeated computation.
The solution is the KV cache — Key-Value cache. Remember the attention mechanism computes keys (K) and values (V) for every token at every layer. Once computed, these don't change. So the model caches them. When generating the next token, it only needs to compute the new token's query, key, and value — then it can look up all the previous tokens' keys and values from cache.
This is why long conversations use more memory. A 128K-token context window isn't just about the model's ability to "remember" — it's about how much KV cache memory is available. For a 70-billion parameter model, the KV cache for 128K tokens can consume 40+ GB of memory — more than the model weights themselves. This is the real bottleneck for long contexts, not the model's architecture.
When a model "forgets" something from earlier in the conversation, it's usually not that the tokens are gone from the context — it's that the attention mechanism assigns lower weight to distant tokens. The information is there. The math just doesn't prioritize it.
Context windows have grown dramatically — from 4K tokens in 2023 to 128K, 200K, even claims of 10 million tokens in 2026. But there's a gap between advertised and effective context length. A model that claims 200K tokens often degrades significantly past 130K. Information in the middle of very long contexts gets less attention than information at the beginning or end — a phenomenon researchers call the "lost in the middle" problem. The KV cache math doesn't lie: longer contexts cost quadratically more memory and compute. The engineering solutions (RoPE scaling, sliding window attention, sparse attention) are improving, but they're all tradeoffs between context length and attention quality.
There are ways to cheat — speculative decoding, paged attention, grouped-query attention, KV cache quantization — techniques that dramatically reduce memory use and increase throughput without sacrificing quality. That's a deep dive for the next post.
When the Math Is Wrong
The model produces the highest-probability next token. Usually, that's correct. Sometimes, it isn't. And when it's wrong, it's wrong with complete confidence. This is hallucination, and it's not a bug — it's a structural property of how the system works.
Think about what the model is doing: it's interpolating. It has seen billions of text examples and learned a smooth function that maps context to next-token probabilities. In regions where it has dense training data — common facts, well-documented topics — the interpolation is accurate. But in sparse regions — obscure facts, niche topics, specific dates — the function still produces a confident answer. It has to. The math doesn't have a "I don't know" output. Every point in the probability space gets a value.
The model has no concept of "I don't know." Every point in the space gets a probability. That's the math.
When someone says a model is "making things up," that's anthropomorphizing. The model doesn't know the difference between a fact and a fabrication. It doesn't have a concept of truth. It has a concept of probable next tokens given the context. If the most probable continuation of "The first person to walk on the moon was" is "Neil Armstrong," great. If the most probable continuation of "The third person to complete a solo Antarctic crossing in 1987 was" is a confident-sounding name that doesn't exist — well, the math produced it with the same mechanism. There's no alarm bell. There's no uncertainty flag. There's just probability.
This is why hallucinations are so hard to fix. You can reduce them — better training data, RLHF that rewards saying "I'm not sure," retrieval-augmented generation that grounds responses in real documents. But you can't eliminate them without changing the fundamental nature of what the model is: a probability machine that has no concept of ground truth.
Reasoning, or Something Like It
Standard language models predict one token at a time. Each prediction is independent — given the context, what's the single most likely next token? This works remarkably well for language, but it struggles with problems that require planning. If you need to think three steps ahead to get the right answer, a one-step predictor might take a wrong turn at step one and never recover.
Reasoning models — OpenAI's o1/o3, Anthropic's extended thinking, DeepSeek-R1 — approach this differently. Instead of committing to one token at a time, they evaluate multiple possible chains of thought. They're computing probabilities of probabilities.
Think of it like this: a standard model is walking down a path and picking the most promising-looking direction at each fork. A reasoning model climbs a tree — it explores multiple branches simultaneously, evaluates where each branch leads, and then picks the path that reaches the best destination. It's search over thought sequences, not just next-token prediction.
The model generates internal "thinking" — chains of reasoning steps that it evaluates and prunes before producing a final answer. Some branches are abandoned when their probability drops. Others are pursued further. The final answer comes from the chain with the highest cumulative probability.
Is this reasoning? That depends on your definition. It's not the way humans reason — it has no world model, no genuine understanding of the problem. But it produces correct answers to problems that require multi-step logical deduction, mathematical proof, and strategic planning. Whether you call that "reasoning" or "very sophisticated pattern matching over reasoning-shaped text" is a question for philosophers. The math doesn't care what you call it.
Grounding the Machine
The model hallucinates because it has no concept of ground truth. But what if you could give it ground truth — right there in the prompt, right when it needs it?
That's Retrieval-Augmented Generation (RAG). Instead of asking the model to answer from its training data alone, you first search a database of real documents, find the most relevant passages, and inject them into the prompt as context. The model generates its answer grounded in actual source material, not statistical interpolation.
The search step uses the same embeddings from earlier in this post. Your question gets embedded into a vector, and a vector database finds the documents with the most similar vectors. The top results get stuffed into the prompt alongside your question. The model sees both: "Here are the relevant facts. Now answer this question."
RAG grounds the model's answer in retrieved documents. The model still generates — but now it has evidence, not just memory.
RAG doesn't eliminate hallucinations — the model can still ignore or misinterpret the retrieved context. But it dramatically reduces them for factual queries. More importantly, it lets models work with private data they were never trained on — your company's documentation, your legal contracts, your medical records. The knowledge lives in the database, not in the model's weights. Update the database and the model's answers update immediately, with no retraining.
This is the production pattern behind every enterprise AI deployment in 2026. Not a raw model answering from memory — a system that retrieves, grounds, then generates.
From Prediction to Action
Everything so far has been about generating text. The model takes tokens in, produces tokens out. But what if the output isn't a sentence — it's a decision to do something?
AI agents are the defining application pattern of 2026. An agent is a model that can use tools — call APIs, search the web, run code, query databases, read files. It doesn't just predict the next word. It decides whether to act, which tool to use, interprets the result, and decides what to do next. It's an action loop, not a generation pipeline.
The mechanism is straightforward. During fine-tuning, models are trained to produce a special structured output — a tool call — instead of regular text. When the model "decides" to call a tool, it outputs something like get_weather(city="Tokyo"). The surrounding system executes that call, returns the result, and feeds it back to the model as new context. The model then reasons over the result and either calls another tool or produces a final response.
The critical insight: the model isn't "deciding" anything in the way you or I decide. It's producing the highest-probability next token, and the highest-probability token happens to be a tool call because that's what the training data says should come next in this context. The architecture is the same transformer. The sampling is the same temperature-adjusted distribution. The tool-calling behavior is an emergent property of training on examples of tool use — not a separate reasoning system.
Multi-agent systems take this further: specialized agents coordinate, each with their own tools and expertise. An orchestrator agent breaks a complex task into subtasks and delegates to specialist agents — one for code, one for research, one for writing. This is how modern AI-powered development tools, customer service systems, and research assistants work. Not one model doing everything, but a system of models collaborating through structured communication.
Your GPU vs The Cloud
Everything described above — the embeddings, the attention, the transformer blocks, the sampling — runs on hardware. Specifically, on GPUs doing matrix multiplication. The question is: whose GPUs?
Frontier models run on clusters of hundreds or thousands of data center GPUs — NVIDIA H100s, each with 80GB of memory, networked together. A single GPT-4-class model requires 8 or more H100s just for inference. These are machines you can't buy, running in data centers you can't access, processing your data on someone else's hardware.
But the same architecture scales down. A 7-billion parameter model — Llama 3, Mistral, Qwen — runs on a single consumer GPU. The trick is quantization: reducing the precision of the model's numbers. A model trained in 16-bit floating point (2 bytes per parameter) can be converted to 4-bit integers (half a byte per parameter), cutting memory requirements by 75% with surprisingly little quality loss.
This is why local AI matters. A quantized 70B model running on a pair of RTX 4090s in your office produces output quality that would have been state-of-the-art 18 months ago. It runs at 30+ tokens per second. Your data never leaves your machine. There's no API bill. No rate limits. No one else's priorities.
The tradeoff is real: frontier models are genuinely more capable, especially at complex reasoning and long-context tasks. But the gap is narrowing with every open-weight release, and for many production use cases — summarization, classification, extraction, code assistance — a local model is good enough. More than good enough.
The Illusion
You've now seen the whole pipeline. Text goes in. Tokens come out. Between those two events: embeddings, attention matrices, feed-forward networks, probability distributions, temperature sampling. Billions of matrix multiplications. Linear algebra from end to end.
The model doesn't understand your question. It doesn't know what a capital is, what France is, or what "knowing" means. It has performed a mathematical operation — a function that maps a sequence of input tokens to a probability distribution over output tokens — and the highest-probability output happened to be correct.
Is this intelligence? No. Intelligence implies understanding, intent, awareness. The model has none of these.
Is this sentience? Absolutely not. Sentience implies subjective experience, consciousness. The model is a function. It doesn't experience anything. When it says "I think," it's producing the next most-probable token, not reporting an internal state.
Is this useful? Profoundly yes. And that's the part people struggle with. We want intelligence to be the explanation for capability. We want to say it's smart because it does smart things. But the math doesn't work that way. The math says: given enough parameters, enough training data, and enough compute, a next-token predictor can produce outputs that are indistinguishable from human-written text. Not because it understands. Because the statistical structure of language encodes knowledge, and the model has learned that structure.
When you use ChatGPT and feel like you're talking to something that understands you — that's the illusion. Not a deception, not a trick. An emergent property of very good statistics on very large data. The output looks like understanding because language that follows the statistical patterns of understanding reads like understanding.
Nobody really knows what this means. That's what I said on LinkedIn. And now you can see why. The math produces something that walks, talks, and quacks like intelligence — but it's matrix multiplication. It's linear algebra. It's a machine that learned the shape of language so well that it can generate language you can't distinguish from a human's.
The question isn't whether it's intelligent. The question is whether intelligence was ever the right word for what it does. And that — that — is the part nobody has an answer for.
The Skynet Effect
Let's end where we started. The hot dog classifier. 25 million parameters, trained on labeled images, outputs a binary answer. That was 2017.
Today's frontier models have 1.8 trillion parameters, trained on trillions of tokens, producing text that reads like it was written by an expert in whatever you ask about. That's a staggering leap. And it happened in less than a decade.
So naturally, people extrapolate. If we went from "hot dog or not" to "write me a legal brief analyzing contract law in three jurisdictions" in seven years — what's next? Skynet? Superintelligence? The singularity?
This is The Skynet Effect: the assumption that the trajectory from classification to generation continues on a straight line to general intelligence. It doesn't. Because the gap between what we have and what people fear is not a gap of scale — it's a gap of kind.
The jump from "classifies images" to "predicts tokens" is engineering. The jump from "predicts tokens" to "understands meaning" is an unsolved problem in philosophy and cognitive science. One of these gaps has a roadmap. The other does not.
Current models don't understand. They don't have goals. They don't have a world model they can reason about independently. They produce text that looks like understanding because the statistical structure of language encodes the patterns of understanding. That's not the same thing. And scaling it — more parameters, more data, more compute — doesn't automatically produce understanding. It produces better statistics.
Could we get to AGI someday? Maybe. But the path from here to there isn't "make the model bigger." It's "solve problems we don't even know how to formulate yet." Consciousness. Grounding. Intentionality. Causal reasoning. These aren't engineering problems waiting for more compute. They're open questions in philosophy, neuroscience, and cognitive science.
What we have is extraordinary. It's the most powerful text generation technology ever created. It's transforming accessibility, productivity, software development, medicine, education. And it's math. All of it. From the tokenizer to the temperature knob to the KV cache — it's linear algebra and probability.
Now you know how it actually works. What you do with that knowledge is up to you.