AI Engineering Glossary

Plain-language definitions for the terms that come up most when building production AI and LLM systems: RAG, embeddings, agents, MCP, fine-tuning, evaluation, prompt injection, and more, each linked to a deeper guide.

Foundations

Large Language Model (LLM): A neural network, typically a transformer, trained on huge amounts of text to predict the next token. Everything else in this glossary is built on top of that one capability.
Generative AI: The broader category of models that produce new content, text, images, audio, code, rather than classifying or scoring existing input. LLMs are the dominant form of generative AI in software today.
Token / Tokenization: The unit an LLM actually reads and bills by, roughly a few characters or a word fragment, not a word. Pricing, context limits, and latency are all expressed in tokens. Read the full guide on tokens →
Context Window: The maximum number of tokens a model can read in a single request. A bigger window doesn't mean every token in it gets used equally well. Read why longer context windows hurt accuracy →
System Prompt: The instructions set once, outside the user-visible conversation, that define a model's role, constraints, and tone for every turn that follows.
Inference: The act of running a trained model to produce an output, as opposed to training it. Inference cost and latency are what users actually experience.
Hallucination: A confident, fluent output that is factually wrong or unsupported by any source. It gets worse, not better, when a model is fine-tuned on facts it never learned during pre-training. See why fine-tuning can increase hallucination →

Retrieval & Knowledge

Embedding: A vector representation of text, image, or other data such that semantically similar inputs end up close together in vector space. It is what makes "search by meaning" possible. Read the full guide on embeddings →
Vector Database: A database built to store embeddings and find the nearest ones to a query vector quickly, at scale, under filtering constraints. Pinecone, Qdrant, Weaviate, Milvus, and pgvector are common choices. Compare vector databases →
Retrieval-Augmented Generation (RAG): A pattern where relevant text is retrieved from an external source and injected into a model's prompt at request time, so it can answer using knowledge it wasn't trained on. Read what nobody tells you about RAG in production →
Chunking: Splitting source documents into smaller pieces before embedding them, so retrieval can return focused, relevant text instead of entire documents. Bad chunking is one of RAG's most common failure modes. Read about chunking failure modes →
Fine-Tuning: Continuing a model's training on your own examples so its weights, and therefore its behavior, change. It is for changing how a model acts, not for teaching it new facts. Read how to choose between RAG, fine-tuning, and prompting →
Prompt Engineering: Designing instructions, examples, and structure to get more out of a model's existing capability, without touching its weights or adding external data. Read why prompting is the cheapest rung on the ladder →

Agents & Orchestration

AI Agent: An LLM given instructions, tools, and enough autonomy to decide which actions to take to complete a task, rather than just answering a single prompt. Read about building agents with the OpenAI Agents SDK →
Multi-Agent System: An architecture where several specialized agents, each with a narrow role, collaborate on a task instead of one agent handling everything itself. Read Agent Skills vs Multi-Agent Systems →
Agent Skills: Reusable packages of procedural knowledge, instructions, scripts, and conventions, that an agent loads on demand to perform a task well, as an alternative to splitting that knowledge across multiple agents. Read the full comparison →
Tool Use / Function Calling: The mechanism by which a model emits a structured request to invoke a function with specific arguments, rather than only generating text. See how this relates to MCP →
Model Context Protocol (MCP): An open protocol, introduced by Anthropic, that standardizes how AI applications connect to external tools, data, and prompt templates, so an integration can be built once and reused across applications. Read the full explainer on MCP →
Handoff: A mechanism that lets one agent delegate a task to another, more specialized agent mid-conversation, a core primitive in frameworks like the OpenAI Agents SDK. Read more about handoffs →

Evaluation & Reliability

LLM Evaluation: The practice of systematically measuring whether an LLM application's outputs are actually good, with a representative dataset and consistent metrics, rather than judging from a handful of demo prompts. Read why your demo works but production fails →
LLM-as-a-Judge: Using a separate LLM call to score or grade the output of another LLM call against a rubric, useful for evaluating qualities that are hard to check with a deterministic rule. Read how to use it without fooling yourself →
Golden Dataset: A curated set of inputs with known-correct expected outputs, used as the stable benchmark you re-run every time you change a prompt, a model, or a pipeline. Read how to build one →
Lost in the Middle: The well-documented tendency of LLMs to pay less attention to information placed in the middle of a long input, even when nothing in the prompt has technically exceeded the context limit. Read the full breakdown →

Security

Prompt Injection: An attack where text an LLM reads, whether typed directly or hidden in a document, email, or web page, contains instructions that hijack the model's behavior. There is no patch that removes the vulnerability entirely. Read the full guide on prompt injection →
Lethal Trifecta: Simon Willison's framing for when an agent becomes a serious data-theft risk: access to private data, exposure to untrusted content, and the ability to communicate externally, all at once. Read the full breakdown →
Jailbreak: A prompt crafted to make a model ignore its own safety instructions or stated rules, distinct from prompt injection, which hijacks a model using content it processes rather than instructions a user writes directly.
Guardrails: Checks, often other model calls or deterministic rules, that validate inputs or outputs before they reach a user or a sensitive action, used to contain what an agent is allowed to do. Read how guardrails fit into the Agents SDK →

Efficiency & Cost

Quantization: Reducing the numerical precision a model's weights are stored and computed in, which shrinks memory use and speeds up inference at a small, often negligible, accuracy cost. Read the full guide on model optimization →
Pruning: Removing weights or structures from a trained model that contribute little to its output, producing a smaller model with similar accuracy. Read more about pruning →
Knowledge Distillation: Training a smaller "student" model to mimic the outputs of a larger "teacher" model, producing a cheaper model that keeps most of the teacher's capability on the tasks that matter. Read more about distillation →
Latency: The time between sending a request to a model and receiving the full response, driven heavily by how many tokens go in and come out.