A non-coding coding agent

We love coding agents. They can build a full-featured SaaS and potentially make you a millionaire if you leave them running overnight with the right prompt. They will burn your GPU or your budget, include a few unprompted vulnerabilities, will bloat your code risking your sanity once you start debugging it, will put emojis in your comments, and will ultimately make you question your life choices.

So I thought, if they are so cool — it must be interesting to build one myself and steal the fame of Anthropic.

But as you know, this blog is often on the edge of absurdic programming, so the agent we’re building today is probably the first non-coding coding agent.

The agent would be called Socreates (yes, with a typo), and it is a Socratic agent. It will catch your mistakes, challenge your decisions, act as a rubber duck with brutal opinions — but it will never touch your code. You have to type it all yourself. And I’ve heard many developers actually enjoyed writing code in the good old days. Some even called themselves “coders”.

A gent

Before diving into code, let’s clarify what a “coding agent” actually is.

We know that LLM is just a next-token predictor. A reasoning model is the same LLM trained to spend more time on intermediate steps. An agent is a control loop that uses LLM to decide what to inspect, which tools to call, and when to stop.

This agentic loop is why Claude Code feels way more capable than the same model in a chat window.

A coding agent in its simplest form has onnly a few core jobs:

Some agents do more, like delegating certain tasks to bounded sub-agents, orchestrating them and doing things in parallel, but we keep it simple. One loop, four tools, no dependencies.

The loop

The loop itself is almost trivial:

This is the entire “agent” part. Everything else is plumbing and parsing to make various components work together (but isn’t it the essence of modern programming?)

LLM

Models are getting better and better. Some people can afford running them locally, some can afford running in the cloud, other can afford having a job that pays for Claude API keys. To make swapping an LLM easier we define an interface for all of them, and it’s a rather simple one:

type LLM interface {
    Chat(ctx context.Context, req ChatRequest) (*ChatResponse, error)
}

type ChatRequest struct {
    Messages []Message
    Tools    []Tool
}

type ChatResponse struct {
    Content   string
    ToolCalls []ToolCall
    Usage     Usage
}

ChatRequest is a conversation history and available tools schema. ChatResponse is text content and/or structured tool calls (if the model wants to do something). Usage tracks token consumption so we can print stats after each turn and decide whether it’s worth it. The agent wouldn’t know if it’s talking to a silly 7B model or DeepSeek in the cloud.

Both Ollama and OpenAI-compatible APIs support “native tool calling” via API. You send a tools array describing available functions as a JSON Schema, and the model responds with structured tool_calls. The conversation may approximately look like:

> system: "You are a coding companion..."
> user: "review my code"
< assistant: {tool_calls: [{function: {name: "list_files", arguments: "{}"}}]}
> tool: "[F] main.go\n[D] pkg/"    (tool_call_id: "call_1")
< assistant: {tool_calls: [{function: {name: "read_file", arguments: "{\"path\":\"main.go\"}"}}]}
> tool: "[main.go: lines 1-1000 of 1000]\n   1: package main\n..."   (tool_call_id: "call_2")
< assistant: "Why did you put all 1000 lines in one file? Where are all the tests?"

Each provider needs its own HTTP client because the wire formats slightly differ:

These are boring details and implementations are boring, too. You may check it on Github (link at the end).

System Prompt

Now you may feel like a real markdown engineer.

In coding agents the prompt is usually assembled from multiple layers: a stable prefix (instructions + tool schemas + workspace summary), then the changing session state (recent history + user request). That’s how you end up using cached tokens (cheap ones) for the prefix and expensive ones for the rest of it.

Socreates is simple enough and the system prompt (“stable prefix”) is just a brief message with a workspace path injected with a printf:

You are a coding companion — a sharp, critical reviewer who catches bugs
and challenges decisions. You NEVER write code. The developer types all code;
you ask questions, spot issues, and verify correctness using tools.

Workspace root: socreates (use "." or relative paths like "main.go" in all tool calls)

## TOOL USAGE
- ALL paths are relative to workspace root. NEVER use absolute paths.
- read_file shows "[file: lines X-Y of Z]" — plan reads to cover the file in 1-2 calls max.
- search returns up to 30 matches. Use specific patterns.
- RESPOND within 2-3 tool rounds. Do not explore exhaustively.

## BEHAVIOR
1. No code, no snippets, no pseudocode — ever.
2. Be concise: 2-4 points per response. No preamble, no filler.
3. Cite specific lines when pointing out issues.
4. Challenge assumptions: "What if X is nil?", "Did you handle the error on line N?"
5. When code looks correct, say so in one line and stop.

Context: task=review my code files=[main.go, tools.go]

This is the part that LLMs wrote for me after arguing with each other for some time, and I highly doubt it’s a good one. But it does the job. Unfortunately, this prompt is the essence of the product. You change the “personality” or the rules - and you get a lobotomised junior rubber duck instead of an experienced critic.

The last line (or lines, in practice) are session memory – the agent keeps track of the current task description and recently touched files, appending them to the system prompt so the model has continuity across tool rounds without needing to re-read everything. This is the dynamic part of the prompt.

Tool schemas are passed via the API’s native tools parameter, so at least those are well-structured and not random markdown. Tools have brief descriptions and examples on how to call them.

Looooop

We interrogate our LLM in a loop. For a number of iterations we send the accumulated system prompt + conversation so far, until it stops “thinking” and gives us a final answer. We have a cap for the number of iterations, otherwise some witty models would be happy to drain your budget on every simple request. We politely follow model’s demands and call necessary tools, ideally – in parallel.

func (a *Agent) Chat(ctx context.Context, input string) (string, error) {
    a.messages = append(a.messages, Message{Role: RoleUser, Content: input})

    for step := range a.maxSteps {
        a.truncateHistory()
        system := buildSystemPrompt(a.cwd, a.memory)
        if step >= a.maxSteps-2 { // LLM has been warned, time to wrap it up!
            system += "\n\n[SYSTEM: Step limit reached. You MUST respond now.]"
        }
        msgs := append([]Message{{Role: RoleSystem, Content: system}}, a.messages...)

        // Too late. Don't run the tools at all on a final iteration
        var tools []Tool
        if step < a.maxSteps-1 {
            tools = a.tools
        }

        resp, err := a.llm.Chat(ctx, ChatRequest{Messages: msgs, Tools: tools})
        if err != nil { return "", err }

        if len(resp.ToolCalls) == 0 { // Final answer, we're done!
            a.messages = append(a.messages, Message{Role: RoleAssistant, Content: resp.Content})
            return resp.Content, nil
        }

        // LLM needs more information: call talls
        a.messages = append(a.messages, Message{Role: RoleAssistant, ToolCalls: resp.ToolCalls})
        results := executeInParallel(resp.ToolCalls)
        for _, r := range results {
            a.messages = append(a.messages, Message{Role: RoleTool, Content: r.Content, ToolCallID: r.ID})
        }
    }
    return "I'm lost. Try rephrasing?", nil
}

User message is added once before the loop, not on every iteration. This mistake costed me a few cents on DeepSeek.

Step limit is rather generous (to my taste) - I stop after 10 iterations. I’ve heard most agents expect only a couple of tool calls per iteration and only 3-5 iterations. But the models I tried were slow thinkers.

Warning the LLM about iteration limit helped to avoid scenarios where it has been thinking and not giving any proper answer at the end.

Parallel tool calling is likely not so important for a toy agent – reading files is quick – but it’s Go, so at least some concurrency is expected.

Context compaction

Token budgets keep me awake. Without compaction a long conversation exceeds the model’s context window making it dumber, but also starts costing real money. Socreates uses a naive two-pass approach:

const maxHistoryTokens = 16000 // around 64KB of text, which is enough for everyone, right?

func (a *Agent) truncateHistory() {
    // Pass 1: Trim historical tool outputs (keep first 400 chars)
    for i := range a.messages {
        if a.messages[i].Role == RoleTool && estimateTokens(a.messages[i].Content) > 100 {
            a.messages[i].Content = a.messages[i].Content[:400] + "\n...[truncated]"
        }
        if a.historyTokens() <= maxHistoryTokens { return }
    }
    // Pass 2: Drop complete req/res turns from the head of the list
    for len(a.messages) > 2 && a.historyTokens() > maxHistoryTokens {
        cut := 1
        if a.messages[0].Role == RoleAssistant && len(a.messages[0].ToolCalls) > 0 {
            // we must not break message + tool connections, DeepSeek rejects invalid references in tool calls
            // so we cut some extra, if we have to
            for cut < len(a.messages) && a.messages[cut].Role == RoleTool {
                cut++
            }
        }
        a.messages = a.messages[cut:]
    }
}

I was surprised that we can’t drop assistant response message without dropping its tool responses. Orphaned tool responses seem to be invalid for OpenAI/DeepSeek protocol. But that only makes compaction more aggressive, which might work in our favour cost-wise.

Tools

A plain LLM can suggest commands in markdown. An agent with tools receives them in a structured way and executes them. This allows us to validate inputs and have at least some safety boundaries for tool calling. At least the model can’t hallucinate arbitrary actions without us noticing.

Since our agent is a non-coding one – it can not write files. I think we could go pretty far with just four tools:

Every tool output is truncated after 16K characters (~4K tokens). We also inform the model about file sizes, so it has a chance to use tools wisely. Our read_file tool also returns a continuation hint: [150 more lines. Use start=501 to continue.], telling the model how to proceed. At least with DeepSeek I’ve seen it helping.

Our path resolution is simple, but good enough to avoid reading from ./pkg/../../../etc/passwd.

func resolvePath(root, path string) string {
    abs := filepath.Clean(filepath.Join(root, path))
    if rel, err := filepath.Rel(root, abs); err != nil || strings.HasPrefix(rel, "..") {
        return ""
    }
    return abs
}

And yes, there is an option to auto-approve all commands if you run it in an isolated environment that satisfies your levels of paranoia.

Memory

A coding agent should survive across turns and restarts. Our session state is a full conversation: user messages, assistant responses, tool calls, and tool results. Like a “transcript”, stores as JSONL inside .socreates/session.json. One session at a time. I’m not good at multi-tasking anyway.

type Session struct {
    ID       string    `json:"id"`
    Created  time.Time `json:"created"`
    Memory   Memory    `json:"memory"`
    Messages []Message `json:"messages"`
}

type Memory struct {
    Task  string   `json:"task"`  // First user message
    Files []string `json:"files"` // Last files touched by tools
}

On /reset, the current session is archived (renamed) and a fresh one starts. On restart, the agent loads its last session and continues where it left off. The Memory provides some clues appended to the system prompt, so that the model would know what we were working on without reading the entire history.

In other words, the stored transcript is complete (every message, full), but what we send in the context to an LLM is compacted and truncated. A bounded window I think it’s called.

REPL

The CLI is just stdin/stdout, no TUI, no fancy blinking animations. I run it with rlwrap for some line editing, so it looks like this:

> review my error handling
  (thinking...)
  -> list_files(map[path:.])
  -> read_file(map[path:main.go])

What happens on line 247 if llm.Chat returns an error? You return it immediately,
but the user message was already appended to history on line 249. Doesn't that
mean a failed request still pollutes the conversation state?

  [tokens: 12340 in, 156 out | session: 24680 in, 312 out]

I’ve tested it with qwen, llama, gemma and deepseek API, with mixed results, of course. But it’s not complete rubish, so I’m happy.

At least it was a nice experiment: how would a coding agent look like if you use it as a fellow programmer and don’t let it touch your keyboard but are happy to hear their grunting and nagging.

If you want to give it a try – it’s on Github: github.com/zserge/socreates. Suggestions, contributions and feedback are always welcome!

I hope you’ve enjoyed this article. You can follow – and contribute to – on Github, Mastodon, Twitter or subscribe via rss.

May 25, 2026

See also: The old way to the modern web services and more.