Tool Calling in Production — Building Reliable AI Agents with External Functions
Tool calling is the mechanism that transforms an LLM from a chatbot into an agent capable of action — calling APIs, querying databases, running code. How to define tools, handle errors, execute in parallel and monitor in production — with full code examples.
Tool calling transforms LLMs from chatbots into agents capable of action. Instead of just answering questions, the model can call an API, query a database, run a calculation or check the state of an external system — and only then formulate a response based on real data. This is the mechanism behind the "intelligence" of modern AI assistants: instead of hallucinating missing facts, the model requests data that actually exists.
How exactly does it work? The model generates JSON with a function name and arguments — but does not execute it. You receive that JSON, execute the function with your own code (with appropriate permissions, validation, logging) and return the result to the model. The model sees the result and formulates its final response. This is why tool calling is safe — the LLM never has direct access to your systems.
/// MECHANIZM TOOL CALLING
Cykl: User → LLM → Tool → LLM → Response
Key insight: the model doesn't "decide what to execute" just once. In complex agents the loop (LLM → tool call → result → LLM) can repeat several times. Set an iteration limit (typically 10) to avoid infinite loops when the model gets stuck cycling between tools.
Defining Tools — JSON Schema and Pydantic Validation
The quality of tool descriptions directly affects whether the model invokes them correctly. Too terse a description → the model doesn't know when to use the tool or passes wrong arguments. Too generic → the model uses the tool everywhere, even when it shouldn't. Pydantic eliminates boilerplate when defining JSON Schema and validates arguments before passing them to the handler.
from pydantic import BaseModel, Fieldfrom openai import OpenAIimport jsonclient = OpenAI()# ─── Input schemas (Pydantic → JSON Schema automatically) ────────────────────class SearchProductsInput(BaseModel): query: str = Field(description="Product search phrase in the catalogue") max_results: int = Field(default=5, ge=1, le=20, description="Max results to return") category: str | None = Field(default=None, description="Category filter (optional)")class GetOrderStatusInput(BaseModel): order_id: str = Field(description="Order ID in format ORD-XXXXXX")# ─── Tool registry (schema + description for the model) ──────────────────────TOOLS = [ { "type": "function", "function": { "name": "search_products", "description": "Searches products in the catalogue. Use when the user asks about availability or wants to compare products. Do NOT use for order status checks.", "parameters": SearchProductsInput.model_json_schema(), }, }, { "type": "function", "function": { "name": "get_order_status", "description": "Retrieves order status by ID. Use only when the user provides an order number (ORD-XXXXXX).", "parameters": GetOrderStatusInput.model_json_schema(), }, },]# ─── Tool implementations (your backend code) ────────────────────────────────def search_products(query: str, max_results: int = 5, category: str | None = None) -> list[dict]: # Here: Elasticsearch / Algolia / SQL query return [{"id": "P001", "name": f"Product: {query}", "price": 99.99, "stock": 12}]def get_order_status(order_id: str) -> dict: # Here: order database query return {"order_id": order_id, "status": "shipped", "delivery": "2026-06-07"}TOOL_HANDLERS: dict = {"search_products": search_products, "get_order_status": get_order_status}# ─── Agent loop (max 10 iterations) ──────────────────────────────────────────def run_agent(user_message: str) -> str: messages = [{"role": "user", "content": user_message}] for _ in range(10): resp = client.chat.completions.create( model="gpt-4o", messages=messages, tools=TOOLS, tool_choice="auto" ) msg = resp.choices[0].message if not msg.tool_calls: return msg.content messages.append(msg) for tc in msg.tool_calls: result = TOOL_HANDLERS[tc.function.name](**json.loads(tc.function.arguments)) messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)}) return "Agent iteration limit reached."
Three rules for a good tool description: (1) say when to use AND when NOT to — the model needs contrast to decide well; (2) describe argument formats (e.g. "ORD-XXXXXX") — reduces parsing errors; (3) one tool = one responsibility — don't combine "search and order" into a single function.
Parallel Tool Calls — When the Model Needs Several Tools at Once
GPT-4o and Claude can return a list of multiple tool_calls in a single response. If you execute them sequentially (one after another), you waste time unnecessarily — independent queries can be executed in parallel with asyncio.gather. The difference with 3 tools at 500ms latency each: sequential = 1.5s, parallel = ~500ms.
/// WZORCE WYWOŁAŃ NARZĘDZI
3 główne wzorce tool calling
import asyncioimport jsonfrom openai import AsyncOpenAIclient = AsyncOpenAI()async def execute_tool_async(tool_call, handlers: dict) -> dict: handler = handlers[tool_call.function.name] args = json.loads(tool_call.function.arguments) # If handler is async (httpx, asyncpg) — await; otherwise run_in_executor if asyncio.iscoroutinefunction(handler): result = await handler(**args) else: loop = asyncio.get_running_loop() result = await loop.run_in_executor(None, lambda: handler(**args)) return {"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result)}async def run_agent_parallel(user_message: str, tools: list, handlers: dict) -> str: messages = [{"role": "user", "content": user_message}] for _ in range(10): resp = await client.chat.completions.create( model="gpt-4o", messages=messages, tools=tools, tool_choice="auto" ) msg = resp.choices[0].message if not msg.tool_calls: return msg.content messages.append(msg) tool_results = await asyncio.gather( *[execute_tool_async(tc, handlers) for tc in msg.tool_calls] ) messages.extend(tool_results) return "Iteration limit reached."
When the model decides to run in parallel: the model itself detects independent queries. You can force tool use with `tool_choice="required"`, but you can't force it to call a specific pair — you only control whether tools are available at all.
Error Handling — What Happens When a Tool Fails
Tool errors fall into three types: retryable (temporary network issue, API timeout — retry with backoff), permanent (invalid arguments, missing permissions — don't retry, return a clear error for the model) and unexpected (code bug — log it, return a generic error). Key: always return something to the model via the `tool` role — never break the loop without a result.
import jsonimport timeimport logginglogger = logging.getLogger(__name__)class ToolError(Exception): def __init__(self, message: str, retryable: bool = False): self.retryable = retryable super().__init__(message)def execute_with_retry(tool_call, handlers: dict, max_retries: int = 2) -> str: name = tool_call.function.name args = json.loads(tool_call.function.arguments) handler = handlers.get(name) if handler is None: return json.dumps({"error": f"Tool '{name}' not found.", "code": "UNKNOWN_TOOL"}) for attempt in range(max_retries + 1): try: result = handler(**args) return json.dumps(result) except ToolError as e: if not e.retryable or attempt == max_retries: logger.error("Tool %s failed: %s", name, e) return json.dumps({"error": str(e), "code": "TOOL_ERROR"}) time.sleep(2 ** attempt) except Exception as e: logger.exception("Unexpected error in tool %s", name) return json.dumps({"error": "Internal tool error.", "code": "INTERNAL_ERROR"}) return json.dumps({"error": "Max retries exceeded.", "code": "MAX_RETRIES"})def run_agent_safe(messages: list, tools: list, handlers: dict, llm) -> str: for _ in range(10): resp = llm.chat.completions.create(model="gpt-4o", messages=messages, tools=tools, tool_choice="auto") msg = resp.choices[0].message if not msg.tool_calls: return msg.content messages.append(msg) for tc in msg.tool_calls: result = execute_with_retry(tc, handlers) messages.append({"role": "tool", "tool_call_id": tc.id, "content": result}) return "Agent iteration limit reached."
Important: when a tool returns JSON with an "error" field, the model will usually decide what to do: try with different arguments, ask the user for clarification or admit it can't complete the task. Trust it, give it a clear message, and don't force behaviour in the loop code.
Monitoring Tool Calls in Production
Without monitoring you don't know which tools are slowest, which fail most often and what each agent session costs in tokens. Log every call with: tool name, latency, argument and result sizes, success/failure. That's enough to identify bottlenecks and set up alerts.
import timeimport jsonimport loggingfrom dataclasses import dataclasslogger = logging.getLogger(__name__)@dataclassclass ToolMetric: tool_name: str success: bool latency_ms: float args_bytes: int result_bytes: int error: str | None = Nonedef monitored_call(tool_call, handlers: dict) -> tuple[str, ToolMetric]: name = tool_call.function.name args_json = tool_call.function.arguments t0 = time.perf_counter() try: result = handlers[name](**json.loads(args_json)) result_json = json.dumps(result) ms = (time.perf_counter() - t0) * 1000 metric = ToolMetric(name, True, round(ms, 1), len(args_json.encode()), len(result_json.encode())) logger.info("tool=%s latency_ms=%.1f ok", name, ms) return result_json, metric except Exception as e: err_json = json.dumps({"error": str(e)}) ms = (time.perf_counter() - t0) * 1000 metric = ToolMetric(name, False, round(ms, 1), len(args_json.encode()), len(err_json.encode()), str(e)) logger.error("tool=%s latency_ms=%.1f error=%s", name, ms, e) return err_json, metric
Key metrics to monitor: latency p50/p95 per tool (outliers indicate dependency issues), error rate per tool (>5% = problem with arguments or the dependency), average tool calls per session (too many = model is looping), token cost per session = (prompt + completion) × price/token.
Security — Prompt Injection and Input Validation
Users may try to manipulate the model into calling a tool with dangerous arguments. Two layers of defence: argument validation (Pydantic checks types and ranges — the model can't pass a negative `max_results` if the schema has `ge=1`) and authorisation in the handler (check whether the logged-in user has permission — the model doesn't know who's logged in, you do). Never trust tool call arguments without permission checks.
| Pattern | Round trips | When to use | Note |
|---|---|---|---|
| Single tool | +1 | Simple tasks: fetch info, check status | Easiest to debug |
| Parallel tools | +1 | Independent operations at once | asyncio.gather — saves latency |
| Sequential chain | +N | Result of A needed to decide on B | Each link = extra LLM call |
| Forced choice | +1 | Guarantee a specific tool is used | tool_choice with function name |
---
I build AI agents with tool calling for companies — from simple assistants with product catalogue access to complex pipelines with multiple tools, authorisation logic and production monitoring. Get in touch — I start with an analysis of your use cases and tool architecture design.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
