Article

From LLM to Agentic AI: A Practical Guide to the Terms That Actually Matter

AI² terminology guide · May 2026

Vlad Larichev avatarVlad LarichevAlexey Samoshilov avatarAlexey Samoshilov

The AI conversation in industry is drowning in terminology. Vendor pitches, keynotes, and social posts reuse labels like “LLM,” “AI agent,” “agentic,” “RAG,” and “multi-agent” as if they were interchangeable. They are not—and the distinctions matter when you invest, procure, or deploy in manufacturing and engineering environments.

This guide establishes a practical, vendor-neutral vocabulary spanning foundation models, LLMs, retrieval, agents, agentic systems, orchestration, governance, and economics. Vlad Larichev and Alexey Samoshilov co-authored it for engineering, management, and research so teams can align on what each layer does, how risk scales, and where integration work actually lives.

A common language for Industrial AI

AI²’s aim is to reduce confusion and enable meaningful dialogue across disciplines. When everyone agrees what “RAG,” “tool use,” or “agentic” refers to, you can move faster on architecture reviews, vendor due diligence, and governance—without talking past each other.

The progression from LLM → agent → agentic system is a progression in capability, complexity, and operational risk. Clear definitions are not academic: they determine budgets, staffing, auditability, and what “production ready” should mean on your shop floor.

Foundation model

A foundation model is a large neural network trained on broad data at scale so it can serve as a reusable base for many downstream tasks. The term was popularised by the Stanford ecosystem studying risks and opportunities of such models—see the Stanford Center for Research on Foundation Models (CRFM) for the original framing.

Traditional ML models were often trained for a single task (classify, forecast, detect). Foundation models learn general patterns and are adapted with prompts, retrieval, fine-tuning, or tools rather than always retraining from scratch.

They may be text-only (many LLMs), vision, audio, or multi-modal. For industry, “we build on a foundation model” usually means you are composing on a shared base—not claiming bespoke pretraining for every feature.

Large Language Model (LLM)

An LLM is a text foundation model: it consumes a prompt as tokens and emits text, one token at a time. Architecturally, frontier LLMs almost always build on the Transformer idea introduced in Attention Is All You Need—parallel attention over token sequences.

Training optimises next-token prediction across massive corpora; emergent capabilities (summarisation, code, multi-step reasoning) arise from scale and data diversity—not from a magical separate module.

What an LLM can do alone: generate or transform text, draft procedures, explain code, and chain reasoning inside the context window. What it cannot do alone: access live enterprise systems, reliably know your private manuals without you supplying them, or safely act in OT without a controlled tool layer.

Prompt and prompt engineering

A prompt is the input text (user task, examples, retrieved passages, and instructions). Prompt engineering is the practice of structuring prompts, roles, and examples so outputs are reliable, measurable, and testable—not a one-off creative writing exercise.

Common patterns include zero-shot instructions, few-shot exemplars, chain-of-thought style reasoning steps, and separating stable policy (system prompt) from per-task user content.

System prompt

The system prompt is the developer-controlled instruction layer that sets role, scope, tone, refusals, and safety posture across a session. Think of it as the job description that keeps a general-purpose model inside your operational boundary.

In regulated environments, the system prompt should be versioned, reviewed, and treated as part of your compliance story alongside logging and access control.

Context window and tokens

The context window caps how many tokens the model can attend to in one request—prompt, retrieved text, tool outputs, and completion combined. There is no durable cross-session memory unless your application stores and re-injects state.

Tokens are the billing and latency unit: rough heuristics are ~0.75 words per token in English, but code and other languages differ. Long manuals and multi-agent loops burn tokens quickly—cost and latency belong in the architecture review, not only in finance after launch.

Hallucination

Hallucination means fluent but false or ungrounded outputs. Models optimise likelihood, not truth—so confident wrong torque values, invented citations, or plausible nonsense remain failure modes even as models improve.

Industrial response: combine grounding (RAG, citations, structured checks), output validation, constrained formats, and human review for safety-critical or compliance-bound outputs.

RAG (Retrieval-Augmented Generation)

RAG retrieves relevant chunks from your knowledge base at query time, injects them into the prompt, and asks the model to answer with that evidence in scope. It addresses freshness and proprietary knowledge without always retraining weights.

Quality depends on chunking, embeddings, indexing, re-ranking, and evaluation—not on the logo on the slide. A demo on five PDFs is not proof against fifty thousand messy work instructions.

How a RAG request flows

Documents are split into chunks, embedded, and stored in a vector index. At query time the user question is embedded, similar chunks are retrieved (often re-ranked), injected into the prompt with clear delimiters or citations, and the model answers conditioned on that evidence. If retrieval misses the right passage, the model may still sound authoritative—measure retrieval hit-rate and answer faithfulness, not only BLEU scores.

Fine-tuning

Fine-tuning continues training on a smaller domain dataset to shift behaviour or style—distinct from RAG, which supplies facts at inference time. Parameter-efficient methods (e.g., LoRA/QLoRA) reduce cost versus full fine-tunes.

Instruction tuning and preference alignment (RLHF-style methods; see InstructGPT for the classic formulation) improve instruction-following and safety tone—but they do not replace governed tool access for plant actions.

Use fine-tuning when prompts + retrieval cannot reach required formats, tone, or domain syntax; keep expectations grounded in data governance and retraining pipelines.

Embeddings and vector databases

Embeddings map text or media into vectors where semantic similarity becomes geometric proximity. Vector databases (e.g., Pinecone, Weaviate, Qdrant, Milvus, or Postgres with pgvector) accelerate nearest-neighbour search at scale.

They power RAG, semantic search over maintenance notes, clustering of defect narratives, and hybrid retrieval with keyword filters.

AI agent

An AI agent pairs an LLM with tools, policies, and orchestration so the system can take bounded actions—not only describe them. Typical tools: ERP/CMMS/PLM APIs, SQL, document search, ticket creation, calculators, and controlled code execution.

The critical vendor question is not “do you use GPT-4?” but which tools exist, with what permissions (read-only vs write), and how actions are audited and rate-limited.

Contrast: a standalone LLM might explain that a pump is due for service and list SAP PM fields to fill. An agent with approved write tools—inside your policy envelope—can draft or create the work order, attach procedures, and notify the crew, leaving an auditable trail.

Tool use / function calling

Tool use (function calling) exposes structured actions to the model as JSON-schema-like contracts. The model proposes calls; your runtime executes them and returns observations—preserving a hard security boundary.

This is the bridge from reasoning to doing: the same pattern underpins maintenance copilots, procurement assistants, and document-to-workflow automations.

Ecosystem libraries (LangChain, LangGraph, CrewAI, AutoGen, Semantic Kernel, and cloud agent builders) accelerate scaffolding; your moat is contracts on tools, observability, and who can approve what in production.

Agentic AI and agentic systems

Agentic AI refers to systems with meaningful autonomy: multi-step planning, delegation across specialised agents, tool loops, and recovery paths. It is not synonymous with “has a chat UI.”

Industrial illustration: an anomaly triggers a diagnostic agent (logs + manuals), a planning agent checks production impact, a procurement agent checks spares, a compliance agent checks permits, and a coordinator proposes a plan with human approval before execution.

Multi-agent systems versus agentic systems

Multi-agent systems (MAS) are a classical field—any society of autonomous agents interacting in an environment. Agentic stacks today usually mean LLM agents coordinating via language, tools, and orchestrators rather than purely hand-authored rules.

Ask vendors whether “multi-agent” refers to mature control-theoretic MAS or LLM-driven coordination—governance and testability differ materially.

Orchestration

Orchestration decides which agent runs when, how state is passed, where humans intervene, and what happens on failure. Graph frameworks (e.g., LangGraph-style designs), workflow engines, and explicit state machines increase traceability versus opaque prompt spaghetti.

For auditability in plants, the orchestration layer is often as important as model choice.

Guardrails

Guardrails span input filters, schema validation for tool args, retrieval allow-lists, output verifiers, spending caps, and role-based tool permissions. They turn “powerful” into “permissible” in safety- and quality-critical domains.

Human-in-the-loop (HITL)

HITL calibrates autonomy to consequence: human-in-the-loop for approvals, human-on-the-loop for supervised autonomy, human-out-of-the-loop only where hazards and verification are provably bounded.

Inference

Inference is forward-pass execution of a trained model—what happens on every user request. Latency and cost scale with model size, context length, and the number of serial LLM steps in an agent workflow.

Grounding

Grounding ties answers to evidence: citations to manuals, cross-checks against MES/ERP fields, or knowledge-graph constraints. It is the primary complement to hallucination risk when numbers or procedures matter.

MCP (Model Context Protocol)

The Model Context Protocol (Anthropic, 2024) standardises how clients connect to tool/data “servers,” reducing one-off integrations as your agent surface area grows across PLM, MES, CMMS, and ITSM.

Tokens and pricing

Commercial APIs typically meter input and output tokens separately; agent loops multiply calls. Model token budgets belong next to SLAs and unit economics in business cases.

Model sizes: frontier, mid-tier, and small

Frontier models maximise quality for hard reasoning; mid-tier models balance cost and capability; small or edge models support latency, offline, or data-sovereignty constraints. Heterogeneous routing (cheap model first, escalate on uncertainty) is increasingly common.

Open source versus closed source models

Closed API models offload ops but raise data-handling questions. Open-weight models you host yourself shift responsibility to your platform team but can satisfy air-gapped or residency requirements—trade-offs are organisational, not only technical.

Physical AI

Physical AI is AI bound to the physical world through sensing, control, and actuation—production lines, robots, inspection cells, energy systems, and mobility. It intersects Industrial AI where decisions must meet timing, determinism, interlocks, and safety integrity levels.

Language-only stacks do not replace PLC/SCADA discipline; they augment planning, vision, diagnostics, and HMI experiences when interfaces and guardrails are engineered deliberately.

Putting the stack together

Layer 1: foundation model. Layer 2: prompts/system policies and optional fine-tuning. Layer 3: knowledge (RAG, graphs, structured stores). Layer 4: tools/function calling into enterprise systems. Layer 5: single-agent loops. Layer 6: multi-agent orchestration. Layer 7: guardrails, monitoring, HITL, and audit logs.

Teams that overspend on Layer 1 while skipping retrieval, tool governance, or tests usually underperform teams with a smaller model inside a disciplined stack.

Vendor conversation decoder

“We have agents” → which tools, with which permissions? “Agentic platform” → how many agents, what orchestration, where are checkpoints? “RAG included” → evaluation harness, chunking, re-ranking, failure telemetry? “Powered by GPT-4/Claude/Gemini” → how integrated—not which sticker.

Conclusion

Shared definitions let engineering, management, and research spend time on decisions and risk—not on translating buzzwords. Start with the stack mental model above, insist on demonstrations against your data and systems, and scale autonomy only where verification matches impact.

The AI² – Association for Industrial AI is an independent practitioner network advancing responsible Industrial AI. Explore membership and programmes at Join AI². Have a term we should extend? Use Contact—this guide will evolve with the field.

0
Share on LinkedIn

AI² – Association for Industrial AI

← All articles