RETURN_TO_BLOG
AI & Automation 14 min

Tool Calling in Production — Building Reliable AI Agents with External Functions

Tool calling is the mechanism that transforms an LLM from a chatbot into an agent capable of action — calling APIs, querying databases, running code. How to define tools, handle errors, execute in parallel and monitor in production — with full code examples.

Tool calling transforms LLMs from chatbots into agents capable of action. Instead of just answering questions, the model can call an API, query a database, run a calculation or check the state of an external system — and only then formulate a response based on real data. This is the mechanism behind the "intelligence" of modern AI assistants: instead of hallucinating missing facts, the model requests data that actually exists.

How exactly does it work? The model generates JSON with a function name and arguments — but does not execute it. You receive that JSON, execute the function with your own code (with appropriate permissions, validation, logging) and return the result to the model. The model sees the result and formulates its final response. This is why tool calling is safe — the LLM never has direct access to your systems.

/// MECHANIZM TOOL CALLING

Cykl: User → LLM → Tool → LLM → Response

01
User message
Pytanie wymagające zewnętrznych danych lub akcji
02
LLM Decision
Model decyduje: odpowiedz bezpośrednio vs wywołaj narzędzie
03
Tool call JSON
{ name, arguments } — model proponuje, nie wykonuje
04
Execution
Twój kod wykonuje funkcję i zwraca wynik
05
LLM Synthesis
Model widzi wynik i formułuje finalną odpowiedź
+1
round trip per tool call
~200ms
overhead LLM na decyzję
≤10
max iteracji w pętli agenta
100%
Twój kod kontrols execution

Key insight: the model doesn't "decide what to execute" just once. In complex agents the loop (LLM → tool call → result → LLM) can repeat several times. Set an iteration limit (typically 10) to avoid infinite loops when the model gets stuck cycling between tools.

Defining Tools — JSON Schema and Pydantic Validation

The quality of tool descriptions directly affects whether the model invokes them correctly. Too terse a description → the model doesn't know when to use the tool or passes wrong arguments. Too generic → the model uses the tool everywhere, even when it shouldn't. Pydantic eliminates boilerplate when defining JSON Schema and validates arguments before passing them to the handler.

tools_definition.py
from pydantic import BaseModel, Fieldfrom openai import OpenAIimport jsonclient = OpenAI()# ─── Input schemas (Pydantic → JSON Schema automatically) ────────────────────class SearchProductsInput(BaseModel):    query: str = Field(description="Product search phrase in the catalogue")    max_results: int = Field(default=5, ge=1, le=20, description="Max results to return")    category: str | None = Field(default=None, description="Category filter (optional)")class GetOrderStatusInput(BaseModel):    order_id: str = Field(description="Order ID in format ORD-XXXXXX")# ─── Tool registry (schema + description for the model) ──────────────────────TOOLS = [    {        "type": "function",        "function": {            "name": "search_products",            "description": "Searches products in the catalogue. Use when the user asks about availability or wants to compare products. Do NOT use for order status checks.",            "parameters": SearchProductsInput.model_json_schema(),        },    },    {        "type": "function",        "function": {            "name": "get_order_status",            "description": "Retrieves order status by ID. Use only when the user provides an order number (ORD-XXXXXX).",            "parameters": GetOrderStatusInput.model_json_schema(),        },    },]# ─── Tool implementations (your backend code) ────────────────────────────────def search_products(query: str, max_results: int = 5, category: str | None = None) -> list[dict]:    # Here: Elasticsearch / Algolia / SQL query    return [{"id": "P001", "name": f"Product: {query}", "price": 99.99, "stock": 12}]def get_order_status(order_id: str) -> dict:    # Here: order database query    return {"order_id": order_id, "status": "shipped", "delivery": "2026-06-07"}TOOL_HANDLERS: dict = {"search_products": search_products, "get_order_status": get_order_status}# ─── Agent loop (max 10 iterations) ──────────────────────────────────────────def run_agent(user_message: str) -> str:    messages = [{"role": "user", "content": user_message}]    for _ in range(10):        resp = client.chat.completions.create(            model="gpt-4o", messages=messages, tools=TOOLS, tool_choice="auto"        )        msg = resp.choices[0].message        if not msg.tool_calls:            return msg.content        messages.append(msg)        for tc in msg.tool_calls:            result = TOOL_HANDLERS[tc.function.name](**json.loads(tc.function.arguments))            messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)})    return "Agent iteration limit reached."

Three rules for a good tool description: (1) say when to use AND when NOT to — the model needs contrast to decide well; (2) describe argument formats (e.g. "ORD-XXXXXX") — reduces parsing errors; (3) one tool = one responsibility — don't combine "search and order" into a single function.

Parallel Tool Calls — When the Model Needs Several Tools at Once

GPT-4o and Claude can return a list of multiple tool_calls in a single response. If you execute them sequentially (one after another), you waste time unnecessarily — independent queries can be executed in parallel with asyncio.gather. The difference with 3 tools at 500ms latency each: sequential = 1.5s, parallel = ~500ms.

/// WZORCE WYWOŁAŃ NARZĘDZI

3 główne wzorce tool calling

Single tool
Query → Tool → Result
Proste zapytania, jedna operacja
Round trips+1
Latency~latency narzędzia
Parallel tools
Query → [Tool A || Tool B] → Results
Niezależne operacje — zbierz razem
Round trips+1
Latency~max(latencies)
Sequential chain
Query → Tool A → Tool B → Result
Wynik A potrzebny do wywołania B
Round trips+N
Latency~sum(latencies)
parallel_tools.py
import asyncioimport jsonfrom openai import AsyncOpenAIclient = AsyncOpenAI()async def execute_tool_async(tool_call, handlers: dict) -> dict:    handler = handlers[tool_call.function.name]    args = json.loads(tool_call.function.arguments)    # If handler is async (httpx, asyncpg) — await; otherwise run_in_executor    if asyncio.iscoroutinefunction(handler):        result = await handler(**args)    else:        loop = asyncio.get_running_loop()        result = await loop.run_in_executor(None, lambda: handler(**args))    return {"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result)}async def run_agent_parallel(user_message: str, tools: list, handlers: dict) -> str:    messages = [{"role": "user", "content": user_message}]    for _ in range(10):        resp = await client.chat.completions.create(            model="gpt-4o", messages=messages, tools=tools, tool_choice="auto"        )        msg = resp.choices[0].message        if not msg.tool_calls:            return msg.content        messages.append(msg)        tool_results = await asyncio.gather(            *[execute_tool_async(tc, handlers) for tc in msg.tool_calls]        )        messages.extend(tool_results)    return "Iteration limit reached."

When the model decides to run in parallel: the model itself detects independent queries. You can force tool use with `tool_choice="required"`, but you can't force it to call a specific pair — you only control whether tools are available at all.

Error Handling — What Happens When a Tool Fails

Tool errors fall into three types: retryable (temporary network issue, API timeout — retry with backoff), permanent (invalid arguments, missing permissions — don't retry, return a clear error for the model) and unexpected (code bug — log it, return a generic error). Key: always return something to the model via the `tool` role — never break the loop without a result.

tool_error_handler.py
import jsonimport timeimport logginglogger = logging.getLogger(__name__)class ToolError(Exception):    def __init__(self, message: str, retryable: bool = False):        self.retryable = retryable        super().__init__(message)def execute_with_retry(tool_call, handlers: dict, max_retries: int = 2) -> str:    name = tool_call.function.name    args = json.loads(tool_call.function.arguments)    handler = handlers.get(name)    if handler is None:        return json.dumps({"error": f"Tool '{name}' not found.", "code": "UNKNOWN_TOOL"})    for attempt in range(max_retries + 1):        try:            result = handler(**args)            return json.dumps(result)        except ToolError as e:            if not e.retryable or attempt == max_retries:                logger.error("Tool %s failed: %s", name, e)                return json.dumps({"error": str(e), "code": "TOOL_ERROR"})            time.sleep(2 ** attempt)        except Exception as e:            logger.exception("Unexpected error in tool %s", name)            return json.dumps({"error": "Internal tool error.", "code": "INTERNAL_ERROR"})    return json.dumps({"error": "Max retries exceeded.", "code": "MAX_RETRIES"})def run_agent_safe(messages: list, tools: list, handlers: dict, llm) -> str:    for _ in range(10):        resp = llm.chat.completions.create(model="gpt-4o", messages=messages, tools=tools, tool_choice="auto")        msg = resp.choices[0].message        if not msg.tool_calls:            return msg.content        messages.append(msg)        for tc in msg.tool_calls:            result = execute_with_retry(tc, handlers)            messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})    return "Agent iteration limit reached."

Important: when a tool returns JSON with an "error" field, the model will usually decide what to do: try with different arguments, ask the user for clarification or admit it can't complete the task. Trust it, give it a clear message, and don't force behaviour in the loop code.

Monitoring Tool Calls in Production

Without monitoring you don't know which tools are slowest, which fail most often and what each agent session costs in tokens. Log every call with: tool name, latency, argument and result sizes, success/failure. That's enough to identify bottlenecks and set up alerts.

tool_monitoring.py
import timeimport jsonimport loggingfrom dataclasses import dataclasslogger = logging.getLogger(__name__)@dataclassclass ToolMetric:    tool_name: str    success: bool    latency_ms: float    args_bytes: int    result_bytes: int    error: str | None = Nonedef monitored_call(tool_call, handlers: dict) -> tuple[str, ToolMetric]:    name = tool_call.function.name    args_json = tool_call.function.arguments    t0 = time.perf_counter()    try:        result = handlers[name](**json.loads(args_json))        result_json = json.dumps(result)        ms = (time.perf_counter() - t0) * 1000        metric = ToolMetric(name, True, round(ms, 1), len(args_json.encode()), len(result_json.encode()))        logger.info("tool=%s latency_ms=%.1f ok", name, ms)        return result_json, metric    except Exception as e:        err_json = json.dumps({"error": str(e)})        ms = (time.perf_counter() - t0) * 1000        metric = ToolMetric(name, False, round(ms, 1), len(args_json.encode()), len(err_json.encode()), str(e))        logger.error("tool=%s latency_ms=%.1f error=%s", name, ms, e)        return err_json, metric

Key metrics to monitor: latency p50/p95 per tool (outliers indicate dependency issues), error rate per tool (>5% = problem with arguments or the dependency), average tool calls per session (too many = model is looping), token cost per session = (prompt + completion) × price/token.

Security — Prompt Injection and Input Validation

Users may try to manipulate the model into calling a tool with dangerous arguments. Two layers of defence: argument validation (Pydantic checks types and ranges — the model can't pass a negative `max_results` if the schema has `ge=1`) and authorisation in the handler (check whether the logged-in user has permission — the model doesn't know who's logged in, you do). Never trust tool call arguments without permission checks.

PatternRound tripsWhen to useNote
Single tool+1Simple tasks: fetch info, check statusEasiest to debug
Parallel tools+1Independent operations at onceasyncio.gather — saves latency
Sequential chain+NResult of A needed to decide on BEach link = extra LLM call
Forced choice+1Guarantee a specific tool is usedtool_choice with function name

---

I build AI agents with tool calling for companies — from simple assistants with product catalogue access to complex pipelines with multiple tools, authorisation logic and production monitoring. Get in touch — I start with an analysis of your use cases and tool architecture design.

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

Senior Full-Stack Engineer & AI Architect

8+ years building AI systems, automations, and scalable web applications that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...