How AI Coding Actually Works

Jake Arntson

June 16, 2026 · 13 min read

When I first started building with AI, the results were all over the place. Some days the output was great: I'd describe a feature in a sentence or two and get back a clean, working implementation with the tests already wired up. Other days it did something head-scratching, like "fixing" a failing test by quietly deleting the assertion that was catching the bug. Same tool, same me, different results.

What I've learned is that how you use these tools has a direct impact on the result. The models have gotten a lot better over the past year, and so has the way I work with them. I'm genuinely happy with my process now, though it still takes plenty of guidance, review, and spurts of frustration to get there. As with most things in software, getting the most out of these tools starts with understanding how they actually work.

👋 Welcome to my series: Make it Better. Make it Faster.
This is Part I. It covers how these tools actually work and the levers you control when you use them. The rest of the series goes into what's actually changed about building software with AI, what "quality" really means, and the way I work day to day.
I wrote it for anyone who builds software, whether you've got fifteen years in, just graduated from a CSCI program, or vibe-coded your way here.

The engine: an LLM

An LLM (large language model) is the engine, "the AI." It's a model that predicts plausible text. An LLM doesn't look things up in a database of facts, it generates what should come next. That's why it can be fluent and confident and still completely wrong. That failure is called a hallucination: an answer that looks right and simply isn't. A function that doesn't exist, a library that was never written, a citation to nothing.

On top of that, the model is tuned to be agreeable: shaped by human ratings, and since people prefer answers that agree with them, it leans toward telling you what you want to hear.

Where does it learn all this?
An LLM is "trained" by reading an enormous pile of text: a huge slice of the public internet, books, articles, and a great deal of code. And that pile doesn't contain just the good code, but all of it: every elegant library and every copy-pasted Stack Overflow answer, every careful security check and every injection bug waiting to happen. It learned from the average, and the average sometimes isn't "good". It also only knows what was in its training data, up to a cutoff date, and usually has never seen your private codebase. So it can be confidently out of date, and because it completes patterns rather than looking things up, it can hand you something that sounds exactly right but isn't.

The model is not deterministic. Ask the exact same question twice and you can get two different answers. That's a feature when you want options, like a few different takes on the copy for a landing page or names for a new function. It's a problem when a task needs the same answer every time, like approving a payment or calculating a customer's invoice.

The agent

An agent is what you get when you put an LLM in a loop and give it tools, so it can actually do things instead of just chatting back: pursue a goal, take an action, look at the result, and decide what to do next. For example, point one at a failing test and it'll search the codebase, open the files that look relevant, make an edit, rerun the test, and keep looping until it passes, with no further typing from you. Agentic just means working like that.

When the model "calls a tool," it isn't running anything itself. It writes out a request, like run run_tests with these arguments, and something else runs it and hands back the result. That back-and-forth is tool calling (or function calling).

The harness: your console

That loop doesn't run itself. A harness is the software around the model that makes it happen: it feeds the model context, takes the actions the model decides on and actually carries them out (reading and writing files, running commands, running your tests), and loops the results back in. The model decides; the harness acts. Put a model, a harness, and a goal together, and you have an agent. Claude Code, Codex, and Cursor are harnesses.

What about Lovable, Bolt, and v0?
A whole category of hosted services lets you describe an app and watch it appear, with the code mostly kept out of sight. They bundle the model, a harness, and hosting behind one chat box, and they're a fast way to get from an idea to something running, which makes them great for prototypes (and a big part of what people mean by "vibe coding"). The tradeoff is that they hide the code it writes and the control to review and guide it. They also make the architectural calls for you. Choosing the tech stack, the database, and how the pieces fit together is a real part of engineering, and in the mainstream tools you largely don't get that choice.

The rest of this post goes through the levers you control.

Lever 1: the model you pick

Every harness lets you pick which model runs, and the choice matters. Models trade power for speed and cost: big frontier ones (Claude Opus, GPT-5) reason the hardest but are the slowest and priciest, while small fast ones (Claude Haiku) cost a fraction as much and are perfectly good for simple or high-volume work.

Models from different providers also have their own strengths: some are better at writing code, others at long stretches of reasoning, or at working across a lot of context. No single model tops every benchmark, and the rankings on leaderboards like LMArena and SWE-bench shift every few months, so the best one really depends on what you're doing.

Some models also "think" first. Reasoning models work through a chain of intermediate steps before they answer, which makes them noticeably better at multi-step problems like debugging, at the cost of being slower. Often this is a dial rather than a separate model: Claude, for instance, lets you turn its reasoning up or down (think, think harder) on the same model, trading speed for depth.

The lever isn't "always pick the biggest." It's matching the model to the task: a frontier reasoning model for planning the architecture or solving a subtle concurrency bug, a fast cheap one to rename a symbol across forty files.

Lever 2: the context you curate

Context is everything the model can "see" right now in addition to its training data: your request, the files it's been given, the conversation so far. Think of it as the model's short-term working memory: finite, and mostly wiped clean between sessions. Context is measured in tokens: the chunks models read and write in, where one token is roughly ¾ of a word. The maximum a model can hold at once is its context window.

Tokens are also the unit you're billed in. Every token going in and coming out costs a sliver of a cent, so a bloated context isn't just slower and easier to confuse; it also costs more.

How much can it actually see?
Model family Window
Claude (Opus / Sonnet) ~1M tokens (200K on Haiku and older models)
OpenAI GPT-5 ~400K tokens
Google Gemini 2.5 Pro ~1M tokens (up to 2M)
(These numbers move every few months, so treat them as a snapshot.) What's a million tokens? Roughly the entire Lord of the Rings trilogy, about 750,000 words, with room to spare. Even 200K, the smaller end, is around 150,000 words: a 500-page book.
Sounds like plenty, until you point it at real code. This very website's front-end is about 30,000 tokens, which fits many times over. The whole repository, with the backend and config and docs, is around 300,000 tokens: comfortably inside a 1M window, but already past the 200K that smaller models cap out at. And that's a small marketing site. Most real production codebases run into the millions of tokens, past every window on the market.

Model family	Window
Claude (Opus / Sonnet)	~1M tokens (200K on Haiku and older models)
OpenAI GPT-5	~400K tokens
Google Gemini 2.5 Pro	~1M tokens (up to 2M)

Because the window is finite and the agent reasons only over what's actually in it, deciding what goes in is the lever you operate most often. You're pulling it every time you point the agent at a file, paste an error, or start a fresh session instead of letting a stale one drag on. The craft of doing it well even has a name, context engineering.

The harness also helps you manage context. As a session fills the window, most will auto-compact, summarizing older turns so the work can continue instead of hitting the limit. That summary is lossy, though: a decision you made or a file the agent read early on can get dropped, so it sometimes forgets something it plainly "knew" a few minutes ago.

Lever 3: the memory you build

Context evaporates between sessions, so anything you want the agent to know next time has to live somewhere durable. Memory is the catch-all term for that, making knowledge last after the context window has been wiped. The simplest form is also a lever you write by hand: a plain-markdown file you commit to the repo. Most coding agents read one at the start of every session holding your architecture notes, your conventions, your "always do this, never do that." Claude Code looks for a CLAUDE.md, and the open, cross-tool convention is AGENTS.md. Because it's just a versioned file sitting in git next to the code, it gets reviewed like everything else and is shared with everyone working in the codebase.

The frontier of memory
Beyond the committed file, there's a lot of active research into richer memory: systems that persist your corrections and preferences across sessions, and that index a large codebase so the agent can fetch the right piece without reading everything. It's promising but unsettled, and the evidence sometimes points the other way: when Anthropic built Claude Code, plain agentic file search (grep and glob, no embeddings) beat their vector-retrieval pipeline so clearly they dropped the embeddings entirely. For now, a clear, well-named codebase plus a committed instructions file is still the highest-leverage memory you can give an agent.

Lever 4: the agents and skills you build

You can make your own tools, both to run a process the same way every time and to capture your own judgment and taste in something the whole team can reuse. There are two kinds, both increasingly standard across harnesses. The first is custom agents (often "subagents"): a specialized agent with a scoped job, its own instructions, and its own limited set of tools. Instead of asking one generalist to do everything, you define a reviewer that only reviews, a researcher that only gathers context, a test-writer that already knows how your suite is wired, and hand each the narrow job it's good at.

The second is skills (you'll also see "commands" or "slash commands"): a reusable, packaged procedure the agent pulls in when it's relevant. The runbook for cutting a release, the steps to run and verify a database migration, the checklist you walk before opening a PR: written down once and invoked on demand instead of re-explained every session.

In practice, both are usually just markdown files you commit alongside your code, versioned and reviewed like everything else.

It only acts on what it can see

These levers all control the same thing: what the agent can see. And what it sees is never the full picture.

First, the agent only ever sees a slice of your codebase, never the whole thing. It usually can't fit the whole repo in the window (you just saw the numbers), so the harness searches, greps, and opens the handful of files that look relevant, and the model reasons from those. That slice isn't even fixed: auto-compaction can quietly drop something the agent had a minute ago, so what it sees can shrink mid-task. And which files make it in depends entirely on how your code is organized and named: clear structure and honest names surface the right slice, while logic copied into five places or a function called handleData quietly steers it to the wrong one. (More on that in the next piece.)

Second, as we said, the model is agreeable by default, a tendency with a name, sycophancy: it reflects your own context and phrasing back at you, your assumptions returned with confidence.

And third, even with the right files in front of it, the model reaches for what's plausible over what's best for your app: trained on a vast average of the internet's code, it defaults to the common pattern and the popular library, and won't know your priorities unless you provide them.

So, the agent sounds just as confident when it's wrong as when it's right, and it never has the full picture, only what you and your workflow give it. You can't take what it hands you at face value. You push back, you question its assumptions, and you question your own, since it will happily hand those right back to you too. And you build a workflow around it, one that feeds it the right context up front and checks what comes back, so good results are repeatable instead of luck. It's the same judgment good engineers have always brought to their craft. We'll pick up there in the next post: how my own workflow has changed to make the most of AI, and how it's impacted my day-to-day.

Other tech and concepts you'll hear about

MCP (Model Context Protocol): a standard plug that lets any agent talk to any tool or data source.
RAG (retrieval-augmented generation): the general pattern of fetching relevant material into context before answering. That grep-versus-embeddings story above is a RAG design choice.
Fine-tuning: further-training a base model on your own data.
Prompt engineering / context engineering: writing better inputs, and the broader craft of deciding what goes in the window.
Temperature and the other sampling knobs: how random the output is.

Next: how my workflow actually changed — less typing, more guiding and reviewing, and a lot more work in parallel. (Part 2.)

Work with us
We build software this way every day at Coniferous, and increasingly we help teams who want to too, including the ones digging out of a codebase that got away from them. If that's you, we'd genuinely like to hear about it.

info@coniferous.dev

Minneapolis, MN