Can the model call a tool with wrong arguments?

Yes — especially when the schema is poorly described or the tool has many optional parameters. Three ways to reduce this: (1) precise `description` with a format example (e.g. "ORD-XXXXXX"), (2) Pydantic validation with `ge`/`le`/`pattern` constraints, (3) `strict: true` in the function schema — the model must provide all required fields and can't add unknown ones.

What are the differences between OpenAI function calling and Anthropic tool use?

The APIs are similar but differ in details: OpenAI uses `tool_calls` in the message and `role: "tool"` in the response; Anthropic uses `tool_use` content blocks and `tool_result` content blocks. The loop logic is identical. Claude (Anthropic) is generally more cautious — less likely to call a tool "just in case". GPT-4o is more aggressive about using tools, which you can moderate through the `description` field.

How do I set a timeout for a tool call?

In the handler — not in the agent loop. Example with httpx: `async with httpx.AsyncClient(timeout=5.0) as client: ...`. If the tool exceeds the timeout, raise ToolError(retryable=True) and handle it in execute_with_retry. Return the timeout error to the model — it can decide to inform the user or retry.

How many tools can an agent have?

Technically unlimited, but practically: above 15–20 tools the model's decision quality drops (too large a schema in context). Better approach for large toolsets: dynamic tool selection (embeddings + cosine similarity against the user's question → Top-5 relevant tools) or specialist sub-agents with smaller toolsets.

How do I test agents with tools?

Three levels: (1) unit tests for the handler — test the function independently from the LLM (fast, deterministic), (2) mock tool calls — replace API responses with pre-defined tool_call lists and verify the agent loop handles them correctly, (3) end-to-end tests with the real model and a set of "golden" scenarios (user message → expected final answer) — run these infrequently as a regression suite.

RETURN_TO_BLOG

2026-06-04AI & Automation 14 min

Tool Calling in Production — Building Reliable AI Agents with External Functions

Paweł Wiszniewski

SEO & GEO Specialist · AI Engineer

Tool calling transforms LLMs from chatbots into agents capable of action. Instead of just answering questions, the model can call an API, query a database, run a calculation or check the state of an external system — and only then formulate a response based on real data. This is the mechanism behind the "intelligence" of modern AI assistants: instead of hallucinating missing facts, the model requests data that actually exists.

Tool calling is the mechanism that transforms an LLM from a chatbot into an agent capable of action — calling APIs, querying databases, running code. How to define tools, handle errors, execute in parallel and monitor in production — with full code examples.

How exactly does it work? The model generates JSON with a function name and arguments — but does not execute it. You receive that JSON, execute the function with your own code (with appropriate permissions, validation, logging) and return the result to the model. The model sees the result and formulates its final response. This is why tool calling is safe — the LLM never has direct access to your systems.

/// TOOL CALLING MECHANISM

Cycle: User → LLM → Tool → LLM → Response

User message

A question requiring external data or an action

›

LLM Decision

Model decides: answer directly vs call a tool

›

Tool call JSON

{ name, arguments } — the model proposes, does not execute

›

Execution

Your code runs the function and returns the result

›

LLM Synthesis

The model sees the result and formulates the final answer

round trip per tool call

~200ms

LLM overhead per decision

≤10

max iterations in the agent loop

100%

your code controls execution

Key insight: the model doesn't "decide what to execute" just once. In complex agents the loop (LLM → tool call → result → LLM) can repeat several times. Set an iteration limit (typically 10) to avoid infinite loops when the model gets stuck cycling between tools.

Defining Tools — JSON Schema and Pydantic Validation

The quality of tool descriptions directly affects whether the model invokes them correctly. Too terse a description → the model doesn't know when to use the tool or passes wrong arguments. Too generic → the model uses the tool everywhere, even when it shouldn't. Pydantic eliminates boilerplate when defining JSON Schema and validates arguments before passing them to the handler.

tools_definition.py

from pydantic import BaseModel, Fieldfrom openai import OpenAIimport jsonclient = OpenAI()# ─── Input schemas (Pydantic → JSON Schema automatically) ────────────────────class SearchProductsInput(BaseModel):    query: str = Field(description="Product search phrase in the catalogue")    max_results: int = Field(default=5, ge=1, le=20, description="Max results to return")    category: str | None = Field(default=None, description="Category filter (optional)")class GetOrderStatusInput(BaseModel):    order_id: str = Field(description="Order ID in format ORD-XXXXXX")# ─── Tool registry (schema + description for the model) ──────────────────────TOOLS = [    {        "type": "function",        "function": {            "name": "search_products",            "description": "Searches products in the catalogue. Use when the user asks about availability or wants to compare products. Do NOT use for order status checks.",            "parameters": SearchProductsInput.model_json_schema(),        },    },    {        "type": "function",        "function": {            "name": "get_order_status",            "description": "Retrieves order status by ID. Use only when the user provides an order number (ORD-XXXXXX).",            "parameters": GetOrderStatusInput.model_json_schema(),        },    },]# ─── Tool implementations (your backend code) ────────────────────────────────def search_products(query: str, max_results: int = 5, category: str | None = None) -> list[dict]:    # Here: Elasticsearch / Algolia / SQL query    return [{"id": "P001", "name": f"Product: {query}", "price": 99.99, "stock": 12}]def get_order_status(order_id: str) -> dict:    # Here: order database query    return {"order_id": order_id, "status": "shipped", "delivery": "2026-06-07"}TOOL_HANDLERS: dict = {"search_products": search_products, "get_order_status": get_order_status}# ─── Agent loop (max 10 iterations) ──────────────────────────────────────────def run_agent(user_message: str) -> str:    messages = [{"role": "user", "content": user_message}]    for _ in range(10):        resp = client.chat.completions.create(            model="gpt-4o", messages=messages, tools=TOOLS, tool_choice="auto"        )        msg = resp.choices[0].message        if not msg.tool_calls:            return msg.content        messages.append(msg)        for tc in msg.tool_calls:            result = TOOL_HANDLERS[tc.function.name](**json.loads(tc.function.arguments))            messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)})    return "Agent iteration limit reached."

Three rules for a good tool description: (1) say when to use AND when NOT to — the model needs contrast to decide well; (2) describe argument formats (e.g. "ORD-XXXXXX") — reduces parsing errors; (3) one tool = one responsibility — don't combine "search and order" into a single function.

Parallel Tool Calls — When the Model Needs Several Tools at Once

GPT-4o and Claude can return a list of multiple tool_calls in a single response. If you execute them sequentially (one after another), you waste time unnecessarily — independent queries can be executed in parallel with asyncio.gather. The difference with 3 tools at 500ms latency each: sequential = 1.5s, parallel = ~500ms.

/// TOOL CALLING PATTERNS

3 main tool calling patterns

Single tool→

Query → Tool → Result

Simple queries, one operation

Round trips+1

Latency~tool latency

Parallel tools⇉

Query → [Tool A || Tool B] → Results

Independent operations — gather together

Round trips+1

Latency~max(latencies)

Sequential chain↻

Query → Tool A → Tool B → Result

Result of A needed to call B

Round trips+N

Latency~sum(latencies)

parallel_tools.py

import asyncioimport jsonfrom openai import AsyncOpenAIclient = AsyncOpenAI()async def execute_tool_async(tool_call, handlers: dict) -> dict:    handler = handlers[tool_call.function.name]    args = json.loads(tool_call.function.arguments)    # If handler is async (httpx, asyncpg) — await; otherwise run_in_executor    if asyncio.iscoroutinefunction(handler):        result = await handler(**args)    else:        loop = asyncio.get_running_loop()        result = await loop.run_in_executor(None, lambda: handler(**args))    return {"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result)}async def run_agent_parallel(user_message: str, tools: list, handlers: dict) -> str:    messages = [{"role": "user", "content": user_message}]    for _ in range(10):        resp = await client.chat.completions.create(            model="gpt-4o", messages=messages, tools=tools, tool_choice="auto"        )        msg = resp.choices[0].message        if not msg.tool_calls:            return msg.content        messages.append(msg)        tool_results = await asyncio.gather(            *[execute_tool_async(tc, handlers) for tc in msg.tool_calls]        )        messages.extend(tool_results)    return "Iteration limit reached."

When the model decides to run in parallel: the model itself detects independent queries. You can force tool use with `tool_choice="required"`, but you can't force it to call a specific pair — you only control whether tools are available at all.

Error Handling — What Happens When a Tool Fails

Tool errors fall into three types: retryable (temporary network issue, API timeout — retry with backoff), permanent (invalid arguments, missing permissions — don't retry, return a clear error for the model) and unexpected (code bug — log it, return a generic error). Key: always return something to the model via the `tool` role — never break the loop without a result.

tool_error_handler.py

import jsonimport timeimport logginglogger = logging.getLogger(__name__)class ToolError(Exception):    def __init__(self, message: str, retryable: bool = False):        self.retryable = retryable        super().__init__(message)def execute_with_retry(tool_call, handlers: dict, max_retries: int = 2) -> str:    name = tool_call.function.name    args = json.loads(tool_call.function.arguments)    handler = handlers.get(name)    if handler is None:        return json.dumps({"error": f"Tool '{name}' not found.", "code": "UNKNOWN_TOOL"})    for attempt in range(max_retries + 1):        try:            result = handler(**args)            return json.dumps(result)        except ToolError as e:            if not e.retryable or attempt == max_retries:                logger.error("Tool %s failed: %s", name, e)                return json.dumps({"error": str(e), "code": "TOOL_ERROR"})            time.sleep(2 ** attempt)        except Exception as e:            logger.exception("Unexpected error in tool %s", name)            return json.dumps({"error": "Internal tool error.", "code": "INTERNAL_ERROR"})    return json.dumps({"error": "Max retries exceeded.", "code": "MAX_RETRIES"})def run_agent_safe(messages: list, tools: list, handlers: dict, llm) -> str:    for _ in range(10):        resp = llm.chat.completions.create(model="gpt-4o", messages=messages, tools=tools, tool_choice="auto")        msg = resp.choices[0].message        if not msg.tool_calls:            return msg.content        messages.append(msg)        for tc in msg.tool_calls:            result = execute_with_retry(tc, handlers)            messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})    return "Agent iteration limit reached."

Important: when a tool returns JSON with an "error" field, the model will usually decide what to do: try with different arguments, ask the user for clarification or admit it can't complete the task. Trust it, give it a clear message, and don't force behaviour in the loop code.

Monitoring Tool Calls in Production

Without monitoring you don't know which tools are slowest, which fail most often and what each agent session costs in tokens. Log every call with: tool name, latency, argument and result sizes, success/failure. That's enough to identify bottlenecks and set up alerts.

tool_monitoring.py

import timeimport jsonimport loggingfrom dataclasses import dataclasslogger = logging.getLogger(__name__)@dataclassclass ToolMetric:    tool_name: str    success: bool    latency_ms: float    args_bytes: int    result_bytes: int    error: str | None = Nonedef monitored_call(tool_call, handlers: dict) -> tuple[str, ToolMetric]:    name = tool_call.function.name    args_json = tool_call.function.arguments    t0 = time.perf_counter()    try:        result = handlers[name](**json.loads(args_json))        result_json = json.dumps(result)        ms = (time.perf_counter() - t0) * 1000        metric = ToolMetric(name, True, round(ms, 1), len(args_json.encode()), len(result_json.encode()))        logger.info("tool=%s latency_ms=%.1f ok", name, ms)        return result_json, metric    except Exception as e:        err_json = json.dumps({"error": str(e)})        ms = (time.perf_counter() - t0) * 1000        metric = ToolMetric(name, False, round(ms, 1), len(args_json.encode()), len(err_json.encode()), str(e))        logger.error("tool=%s latency_ms=%.1f error=%s", name, ms, e)        return err_json, metric

Key metrics to monitor: latency p50/p95 per tool (outliers indicate dependency issues), error rate per tool (>5% = problem with arguments or the dependency), average tool calls per session (too many = model is looping), token cost per session = (prompt + completion) × price/token.

Security — Prompt Injection and Input Validation

Users may try to manipulate the model into calling a tool with dangerous arguments. Two layers of defence: argument validation (Pydantic checks types and ranges — the model can't pass a negative `max_results` if the schema has `ge=1`) and authorisation in the handler (check whether the logged-in user has permission — the model doesn't know who's logged in, you do). Never trust tool call arguments without permission checks.

Pattern	Round trips	When to use	Note
Single tool	+1	Simple tasks: fetch info, check status	Easiest to debug
Parallel tools	+1	Independent operations at once	asyncio.gather — saves latency
Sequential chain	+N	Result of A needed to decide on B	Each link = extra LLM call
Forced choice	+1	Guarantee a specific tool is used	tool_choice with function name

---

I build AI agents with tool calling for companies — from simple assistants with product catalogue access to complex pipelines with multiple tools, authorisation logic and production monitoring. Get in touch — I start with an analysis of your use cases and tool architecture design.

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

Service

AI App Development

Custom AI software and AI-powered web applications. MVP development, full stack engineering, and AI systems programming from scratch to production.

View service Service

Web Engineering

Digital brutalism architecture. Sites that are not templates, but manifestos.

View service

/// SOURCES

/// RELATED_RECORDS

AI & Automation

Vibe Coding: Complete Guide to AI Coding Tools 2026

Claude Code, Cursor, GitHub Copilot, Codex CLI, Gemini CLI, Lovable, Bolt.new — 60% of all new code worldwide is AI-generated (Gartner, 2026). A complete map of 11 vibe coding tools across 3 categories, with pricing, use cases, and a selection guide for businesses.

18 min

AI & Automation

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

OpenAI Deep Research, Perplexity, and web-browsing agents are reshaping desk research: a report that takes an analyst 4–8 hours, an agent finishes in 5–20 minutes with source citations. I explain how these tools work, when they genuinely replace a human and when they don't, what ROI looks like, how to build your own research-automation pipeline, and when it makes sense to let the agent do it instead of an employee.

15 min

AI & Automation

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

AI cuts CV screening time by 75%, but recruitment systems are classified as high-risk AI under the EU AI Act — with a full compliance package: human oversight, transparency, technical documentation, EU database registration. I explain what AI in HR can safely do (screening as a filter, chatbot, onboarding), where the line is (autonomous decisions without a human), which tools work for SMEs, and how to avoid legal exposure.

17 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

Cycle: User → LLM → Tool → LLM → Response

Defining Tools — JSON Schema and Pydantic Validation

Parallel Tool Calls — When the Model Needs Several Tools at Once

3 main tool calling patterns

Error Handling — What Happens When a Tool Fails

Monitoring Tool Calls in Production

Security — Prompt Injection and Input Validation

/// RELATED_SERVICES

AI App Development

Web Engineering

/// SOURCES

/// RELATED_RECORDS

Vibe Coding: Complete Guide to AI Coding Tools 2026

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

Signal received?

TerminateSilence

Terminate
Silence