A small, honest Python SDK for AI agents on local models. Streaming. Multi-turn by default. Markdown skills and memory. Auto-tuned per model. Single dependency. Actually faster than writing your own loop (HTTP connection reuse).
Not magic. The "framework catches what models can't" thesis is disproven for modern models — 0/40 adversarial cases triggered a real guardrail rescue. What it IS: the plumbing you'd write yourself, but done once, tested, and measured.
Token-by-token streaming with semantic events: TokenEvent, ToolCallEvent, ToolResultEvent, RunCompleteEvent. Works with tool-using agents, not just chat.
Queries Ollama /api/show on init. Small models get stripped defaults. Context window from the model, not a guess. Opt-out with auto_tune=False.
Multi-turn just works. Pluggable strategies: SlidingWindow, TokenWindow, UnlimitedHistory. Optional session persistence across process restarts.
Markdown files in .freeagent/memory/. Single memory tool (read/write/append/search/list). Auto-load files in system prompt. Daily logs.
Markdown SKILL.md with frontmatter. Bundled defaults. User skills extend or override. +25% accuracy on qwen3:4b (measured).
agent.trace() shows every model call, tool call, retry, and validation event with relative timestamps. The debugger you always wanted.
freeagent ask qwen3:8b "hello" — one-shot with live streaming. freeagent chat — REPL. freeagent models — list. No extra deps (stdlib argparse).
Ollama, vLLM, OpenAI-compat (LM Studio, LocalAI, TGI). Provider protocol means any new backend is one file.
Connect to MCP servers via stdio or HTTP. Tool schemas auto-converted. Description truncation for small models. pip install freeagent-sdk[mcp]
from freeagent import Agent agent = Agent(model="llama3.1:8b") print(agent.run("What is Python?"))
from freeagent import Agent, tool from freeagent.tools import system_info, calculator @tool def weather(city: str) -> dict: """Get weather for a city. city: The city name """ return {"temp": 72, "condition": "sunny"} agent = Agent( model="qwen3:8b", tools=[weather, system_info, calculator], ) print(agent.run("Weather in Portland and disk space?"))
Memory is on by default. The agent gets a memory tool. Conversation is multi-turn by default. Telemetry is captured automatically.
pip install freeagent-sdk gives you a freeagent command.
$ freeagent ask qwen3:8b "What's the capital of France?"
Paris is the capital of France.Streams tokens as they arrive. Exits when done.
$ freeagent chat qwen3:8b
freeagent > What is Python?
Python is a high-level programming language...
freeagent > When was it released?
Python was first released in 1991...
freeagent > /trace
Trace for run 2 (qwen3:8b, native):
+ 0ms run_start "When was it released?"
...
freeagent > /exitSlash commands: /clear reset conversation, /trace show last run, /exit quit.
$ freeagent models Name Size Modified gemma4:e2b 7.2GB 2026-04-06 qwen3:8b 5.2GB 2025-12-09 llama3.1:latest 4.9GB 2025-12-12
Print the installed FreeAgent version.
Show the trace of the last recorded run.
Token-level streaming with semantic events. Works for both chat and tool-using agents.
from freeagent import Agent, TokenEvent, ToolCallEvent, ToolResultEvent agent = Agent(model="qwen3:8b", tools=[weather]) for event in agent.run_stream("Weather in Tokyo?"): if isinstance(event, TokenEvent): print(event.text, end="", flush=True) elif isinstance(event, ToolCallEvent): print(f"\n[Calling {event.name}...]") elif isinstance(event, ToolResultEvent): print(f"[{event.name} -> {'ok' if event.success else 'fail'}]")
async for event in agent.arun_stream("Weather in Tokyo?"): if isinstance(event, TokenEvent): print(event.text, end="", flush=True)
model, mode. Fired once at the start of each run.
iteration. Fired at the start of each agent loop cycle.
text, iteration. Each token as it arrives from the model.
name, args. Fired when the model requests a tool call.
name, result, success, duration_ms. After tool execution.
tool_name, errors. Fired when a tool call fails validation.
tool_name, retry_count. Fired on tool call retry.
response, elapsed_ms, metrics. Fired once when the run finishes.
Multi-turn conversations work out of the box. The agent remembers prior turns automatically using pluggable strategies.
from freeagent import Agent agent = Agent(model="qwen3:8b", tools=[weather]) agent.run("What's the weather in Tokyo?") agent.run("Convert that to Celsius") # remembers Tokyo was 85°F
from freeagent import Agent, SlidingWindow, TokenWindow # Default: SlidingWindow(max_turns=20) agent = Agent(model="qwen3:8b") # Token-based budget (small context models) agent = Agent(model="qwen3:4b", conversation=TokenWindow(max_tokens=3000)) # Stateless mode (each run() independent) agent = Agent(model="qwen3:8b", conversation=None)
# Saves to .freeagent/sessions/my-chat.json agent = Agent(model="qwen3:8b", session="my-chat") agent.run("Hello!") # Later, in a new process — restores conversation agent = Agent(model="qwen3:8b", session="my-chat")
Default. Keep the last N turns. Predictable token usage. SlidingWindow(max_turns=20)
Keep history that fits a token budget. Fills from newest to oldest. TokenWindow(max_tokens=3000)
Keep everything. Use with caution on small models — will overflow context window.
Subclass ConversationManager. Implement prepare(), commit(), clear().
FreeAgent queries Ollama's /api/show on init to detect model capabilities and auto-tune the framework. Small models get stripped defaults. Context window comes from the model, not a guess. Engine selection uses real capabilities, not a hardcoded list.
from freeagent import Agent # Auto-tuned: 2B model → strips skills + memory tool agent = Agent(model="gemma4:e2b") print(agent.model_info.parameter_size) # 5.1B print(agent.model_info.is_small) # True (MoE pattern) print(len(agent.skills)) # 0 print(agent.config.context_window) # 131072 (real) # Auto-tuned: 8B model → keeps full defaults agent = Agent(model="qwen3:8b") print(agent.model_info.parameter_size) # 8.2B print(len(agent.skills)) # 2 print(agent.config.context_window) # 40960 (real)
# Force bundled skills on a small model agent = Agent(model="gemma4:e2b", bundled_skills=True, memory_tool=True) # Disable auto-tune entirely agent = Agent(model="qwen3:8b", auto_tune=False)
Strip bundled skills and memory tool. Tiny models get overwhelmed by the extra context.
gemma3n:eXb, gemma4:eXb — treated as small regardless of actual param count (effective size matters).
Keep full defaults. This is the sweet spot where skills help and the memory tool doesn't overwhelm.
Set from model_info.context_length. qwen3:4b has 262k, llama3.1:latest has 131k — no more guessing.
Uses capabilities.includes("tools"), not a hardcoded model name list.
Auto-tune silently no-ops for vLLM/OpenAI-compat providers. Defaults are used as specified.
Every run is automatically traced. agent.trace() shows a complete timeline of what happened — model calls, tool calls, retries, validation errors — with relative timestamps.
from freeagent import Agent, tool @tool def adder(a: int, b: int) -> int: """Add two numbers.""" return a + b agent = Agent(model="qwen3:4b", tools=[adder]) agent.run("What is 47 + 23?") print(agent.trace())
User input, final response, elapsed ms, iteration count.
Iteration number, content preview, tool call count. Timings show the model's generation latency.
Tool name, arguments, success, duration_ms, result preview.
When a tool call fails validation — tool name and specific error messages.
When the agent retries after a validation error, with the retry count.
When the circuit breaker fires or the run hits its timeout.
# One-line summary print(agent.last_run.summary()) # Run 1: qwen3:4b (native) 12191ms, 2 iters, 1 tool calls # Markdown report print(agent.last_run.to_markdown()) # Raw trace events for te in agent.last_run.trace_events: print(te.timestamp, te.event_type, te.data)
Hooks let you observe and modify agent behavior at every stage. Register via decorator or direct call. Hooks never crash the agent — exceptions are silently caught.
before_run · after_run
before_model · after_model
before_tool · after_tool
on_validation_error · on_retry · on_error
on_loop · on_max_iter · on_timeout
memory_load · memory_save · memory_update
@agent.on("before_tool") def log_tool(ctx): print(f"→ Calling {ctx.tool_name}({ctx.args})") @agent.on("after_tool") def cache_result(ctx): if ctx.result and ctx.result.success: agent.memory.set( f"cache.{ctx.tool_name}", ctx.result.data ) @agent.on("on_error") def alert(ctx): send_to_slack(f"Agent error: {ctx.error}")
@agent.on("before_tool") def use_cache(ctx): # Skip the actual tool call if we have cached data cached = agent.memory.get(f"cache.{ctx.tool_name}") if cached: ctx.skip = True # tool won't execute @agent.on("after_run") def sanitize(ctx): # Override the final response ctx.override_response = ctx.response.replace("password", "***")
from freeagent import log_hook, cost_hook # Logging — prints every lifecycle event logger = log_hook(verbose=True) agent.on("before_run", logger) agent.on("before_tool", logger) agent.on("after_tool", logger) agent.on("after_run", logger) # Cost tracking — counts tool calls per tool track, stats = cost_hook() agent.on("before_tool", track) agent.run("check disk space") print(stats()) # → {"calls": 1, "tools": {"system_info": 1}, "errors": 0}
File-backed JSON store that persists between runs. Auto-loads on agent start, auto-saves after each run. Memory context is injected into the system prompt so the model knows what it remembers.
agent = Agent( model="llama3.1:8b", tools=[weather], memory_path="~/.freeagent/memory.json", # persists to disk ) # Pre-load preferences agent.memory.set("user.name", "Alice", source="user") agent.memory.set("user.units", "metric", source="user") # The model sees this in its system prompt: # ## Your Memory (facts you remember): # - user.name: Alice # - user.units: metric agent.run("What's the weather?") # Model knows to use metric because it's in memory
from freeagent import Memory mem = Memory(path="~/.freeagent/memory.json") mem.set("user.name", "Alice") # create/update mem.get("user.name") # → "Alice" mem.get("missing", "default") # → "default" mem.has("user.name") # → True mem.delete("user.name") # → True mem.search("user.") # → all keys starting with "user." mem.all() # → full dict mem.keys() # → list of keys len(mem) # → entry count "user.name" in mem # → True # Each entry tracks metadata: # created_at, updated_at, access_count, source
# Auto-cache tool results in memory @agent.on("after_tool") def auto_cache(ctx): if ctx.result and ctx.result.success: agent.memory.set( f"cache.{ctx.tool_name}.{hash(str(ctx.args))}", ctx.result.data, source="tool" ) # Use cache to skip redundant calls @agent.on("before_tool") def check_cache(ctx): key = f"cache.{ctx.tool_name}.{hash(str(ctx.args))}" if agent.memory.has(key): ctx.skip = True # don't re-run
Write a function with type hints and a docstring. FreeAgent builds the JSON schema, Ollama spec, and ReAct description automatically.
@tool def lookup_user(username: str) -> dict: """Look up a user by username. username: The username to look up """ return {"name": "Alice", "role": "engineer"} # Auto-generated: lookup_user.name # → "lookup_user" lookup_user.schema() # → JSON schema from type hints lookup_user.to_ollama_spec() # → Ollama tool format lookup_user.to_react_description() # → human-readable for ReAct
Keep schemas flat. One required field is ideal. Use strings over enums. Provide defaults. Every field you add is a chance for a small model to fail.
Skills are markdown directories with SKILL.md files containing YAML frontmatter. They get injected into the system prompt automatically.
--- name: nba-analyst description: Basketball statistics expert version: 1.0 tools: [search, calculator] --- You are an NBA analyst. Always cite your sources. When comparing players, use per-game averages.
agent = Agent( model="qwen3:8b", tools=[search, calculator], skills=["./my-skills"], # directory of skill folders )
general-assistant and tool-user load automatically. ~157 tokens total.
Extend bundled skills. Duplicate names override (last wins).
build_skill_context(skills, max_chars=N) — truncates when over budget.
Built-in frontmatter parser handles the subset we need. Zero extra deps.
All providers implement the same 3-method interface: chat(), chat_with_tools(), chat_with_format().
Default. Connects to localhost:11434. Native tool calling + constrained JSON via GBNF.
OpenAI-compatible with vLLM defaults. VLLMProvider(model="qwen3-8b")
Any OpenAI-compatible server: LM Studio, LocalAI, TGI. Custom API keys and headers.
from freeagent import Agent, VLLMProvider provider = VLLMProvider(model="qwen3-8b") agent = Agent(model="qwen3-8b", provider=provider, tools=[my_tool])
The OpenAI-compat provider includes automatic recovery for common small-model issues:
Strips <think>...</think> from qwen3/deepseek responses.
Recovers arguments from code fences, embedded JSON in text, malformed strings.
Built into every agent — no setup: agent.metrics
agent.run("What's the weather?") print(agent.metrics) # quick summary print(agent.metrics.tool_stats()) # per-tool breakdown agent.metrics.to_json("m.json") # export
Optional OpenTelemetry: pip install freeagent-sdk[otel] — traces and metrics flow automatically.
The key insight: asking a small model to think AND produce JSON in one shot fails. Split reasoning (free text) from structured output (constrained JSON). The model thinks naturally, then gives just the arguments with grammar constraints.
Model says "gret"? → "Did you mean 'greet'?"
"42" → 42, "true" → True. Auto-fixed.
"Missing field 'city'. Schema: {city: string}." Concrete errors, not generic retries.
Same tool+args 3x = stuck. Max iterations = stop. Timeout = partial result.
Tested with the same eval suite across 4 models and 100+ runs. Full results in evaluation/.
Conversation manager delivers 87% on multi-turn conversations out of the box. SlidingWindow default needs no configuration.
FreeAgent boosts llama3.1 tool calling accuracy by +13% through fuzzy name matching and type coercion.
gemma4:e2b (2B) achieves 80% on multi-turn via text-based ReAct. Matches llama3.1 (8B) at 1/4 the size. No parse errors.
Bundled tool-user skill improves qwen3:4b by +20%. Skills are neutral for larger models — they don't need the guidance.
All models understand the single-tool action pattern (4-5/5 usage rate). Write operations had a .md extension bug (now fixed).
All failures are accuracy issues (wrong answer, wrong tool), never framework errors. The guardrails work. 100+ eval runs, zero crashes.
freeagent/ ├── pyproject.toml ├── README.md ├── freeagent/ │ ├── __init__.py ← Agent, tool, Memory, hooks exports │ ├── agent.py ← Agent class w/ hooks + memory integration │ ├── hooks.py ← 13 events, HookRegistry, log_hook, cost_hook │ ├── memory.py ← Memory class, MemoryEntry, persistence │ ├── tool.py ← @tool decorator, schema gen │ ├── config.py ← AgentConfig, model profiles │ ├── messages.py ← Message types, error feedback │ ├── validator.py ← fuzzy match, coercion, field checks │ ├── circuit_breaker.py ← loop detect, iteration limits │ ├── engines/ │ │ └── __init__.py ← NativeEngine + ReactEngine │ ├── providers/ │ │ └── ollama.py ← stdlib-only Ollama client │ └── tools/ │ ├── system_info.py ← disk, cpu, os │ ├── calculator.py ← safe math │ └── shell.py ← sandboxed commands └── examples/ ├── 01_hello.py ├── 02_builtin_tools.py ├── 03_custom_tool.py ├── 04_hooks.py ← NEW: hooks demo └── 05_memory.py ← NEW: memory demo
Token-by-token streaming via agent.run_stream() / arun_stream(). Works for tool-using agents, not just chat. Semantic events: TokenEvent, ToolCallEvent, etc.
Queries Ollama /api/show. Small models get stripped defaults. Context window from the model. Engine selection from real capabilities.
agent.trace() full timeline with relative timestamps. run_start, model_call_*, tool_*, validation_*, run_end.
freeagent ask, freeagent chat, freeagent models, freeagent trace, freeagent version. Stdlib argparse, no extra deps.
Multi-turn by default. SlidingWindow, TokenWindow, UnlimitedHistory, session persistence.
Markdown-backed files, single memory tool, auto_load, daily logs, caching.
Markdown SKILL.md with frontmatter, bundled defaults, user extensions.
Ollama (streaming), vLLM (streaming), OpenAI-compat (streaming).
Stdio + streamable HTTP transports, schema conversion.
Built-in metrics, optional OTEL export, per-tool stats, full trace events.
First-class Pydantic schema support via agent.run_structured(). Already works under the hood via the constrained JSON path.
Agent-as-tool composition. Spawn a specialist from inside a tool. Shared conversation context or isolated.
allow / deny / ask per tool. Safety rails for production deployments.