Foundations

How an LLM Works

An LLM generates text by predicting the next likely token using the prompt and the available context.

Grounded in this concept — answers stay short and honest.

Key terms

TokenContext windowInference

Step 1 of 5 Arrow keys work in present mode

Prompt

User input enters the system.

Tokens + context

Text is split into tokens and loaded into the active context window.

Predict + stream

The model predicts the next token in a loop and streams the answer.

Core loop

Predict next token

Inputs that matter

Prompt + context + prior tokens

Truth guarantee

None by default

Business impact

How this shapes cost, speed, risk, and control.

Cost sensitivity

Token-driven

Larger prompts and longer outputs both raise cost.

Latency

Grows with output length

Streaming masks it, but more tokens means more time.

Quality

Depends on prompt + context

Better framing and better context data improve answers more than brute-force model size.

Truth guarantee

None by default

The model predicts likely text; it does not fact-check itself.

What can go wrong

Common failure modes to watch for when this concept shows up in production.

Hallucination

The model generates plausible but unsupported content when evidence is missing or weak.

Missing context

Important information is not in the active prompt, so the model cannot use it.

Overconfident tone

Fluent language can read as certainty even when the underlying answer is weak.

Related concepts

Context Window vs Memory KV Cache RAG End-to-End

How an LLM Works

Prompt enters the system

Text is tokenized

Context fills the window

Next-token prediction loop

Response streams out