Integrating AI into development

Integrating AI into development — but with judgment

AI speeds you up a lot in analysis, prototyping, and documentation. It lets you shift your cognitive focus: you can spend much less time writing code and more time thinking about which path to choose when implementing a new solution. The problem starts when we use the chat as a universal dumpster: we throw everything into the same thread, mix goals together, and then act surprised when the output degrades or doesn't match the solution we actually wanted.

AI does not replace technical thinking — it is a tool that should help amplify it. If there is no clear objective, the model fills the gaps with its statistical average, and that almost always leads to noise, overengineering, or solutions that don't fit what we really need.

Operational speed: short iterations and fast proof-of-concepts.
Shift in focus: less boilerplate, more architecture and decisions.
Technical quality: better documentation and exploration of alternatives.

What context is and how LLMs process it

Context is everything that goes into the model on each call: system instructions, history, user messages, tools, and previous results.

It's key to understand that the model doesn't "memorize" anything — it doesn't have persistent memory between turns the way humans do. On every iteration, the full conversation context is sent again.

Each iteration is a new request. The system sends the full context package again, which is why the payload grows on every turn: more tokens, more cost, and more chances for noise.

If you don't manage that load, the model doesn't "remember better" just because you chat longer: it only receives more accumulated text. In other words, it's not real memory — it's continuous rereading of history, with risk of truncation and loss of focus or key information.

New turn = new request to the LLM.
History gets resent (user + assistant + relevant tool calls).
When the context window approaches the limit, information gets trimmed or summarized.
More turns do not mean more stable memory — many times they mean more noise.

Graphical representation of an LLM's context window — On every turn, previous history is injected again; as the context window approaches its limit, truncation appears — in this example beyond 200k tokens. Source: https://docs.anthropic.com/en/docs/anthropic-platform/context-limits

Finite context and noise

In long chats, three effects show up: window overflow, anchoring to previous mistakes, and accumulated verbosity. This doesn't just reduce precision — it also makes quality unstable between runs.

Another key idea is that even if we don't hit the model's input token limit, performance usually gets dramatically worse long before that because of accumulated noise.

If the model made a wrong assumption early and we keep correcting it inside the same thread, it often stays biased by that dirty state and it's hard to recover. This hits even harder when the model already produced a solution, especially in code. What usually happens is that if the initial output wasn't the right one, instead of recreating it from scratch, the model tends to patch it rather than regenerate it properly.

Too much irrelevant history = less focus on the current task.
Chains of corrections = higher chance of contradictions.
The output may sound confident even when the foundation is contaminated.

How a long conversation degrades

Renderizando diagrama…

What the paper LLMs Get Lost In Multi-Turn Conversations says

The paper confirms a field intuition: when you move from a full single-turn prompt to an underspecified multi-turn interaction, average performance drops sharply. The interesting part is that most of that drop is explained by increased unreliability, not just loss of capability.

They also test recap/snowball strategies (repeating information), and they help partially, but they still don't match the clean single-turn scenario. In other words: it patches the problem, but it doesn't fix it at the root.

Main figure from the paper about multi-turn performance drop — Figure 1 from the paper: difference between single-turn and multi-turn, focusing on aptitude vs reliability.

Types of simulation: full, sharded, concat, recap, snowball — Figure 4 from the paper: simulation types used to measure degradation and mitigations.

How to improve context using separate chats

The most effective and cheapest improvement is to separate conversations by objective. One chat for planning, another for implementation, another for review. This reduces cross-contamination and gives you shorter, more verifiable prompts.

If the model gets lost, instead of patching endlessly in the same thread, consolidate everything into a clean prompt and restart. In practice, that usually performs better than correcting forever.

1 task = 1 chat.
If you switch module or endpoint, start a new chat.
Before retrying: consolidate requirements into a single summary.
Avoid pasting irrelevant code from other areas.

Chat-per-objective strategy

Renderizando diagrama…

Simple example using chat 1 for planning and a new clean chat 2 for implementation — First you plan and consolidate a prompt in Chat 1; then you implement in a clean Chat 2 to reduce context noise.

From context to the agent harness

If we want to go one step further, we need to consider some other concepts. The layer that really defines consistency is the agent harness: the system of context, tools, and guardrails surrounding the model.

When I say harness, I mean it literally: an operational harness — a structure that holds, guides, and limits the agent so it doesn't drift, just like a safety harness prevents dangerous movements when working at height.

The difference between random results and repeatable results is usually not the model itself. It's usually about how much useful context you feed into each iteration and how much noise you keep out.

Harness = context + tools + validation + living rules.
You don't design it once: you iterate on it with every real error that shows up.
The goal is not for the agent to be brilliant once, but reliable every time.
In larger teams, the harness stops being optional and becomes an operational requirement.

Skills and AGENTS.md

AGENTS.md acts as the repo's base contract: architecture, conventions, verification commands, and limits. Skills are specialized modules that the agent loads only when it touches that layer (frontend, backend, testing, etc.).

Practical result: less repetition in prompts, less style drift, and more speed to get into the task. It's documentation for humans and agents at the same time.

Standardizes technical decisions and style.
Reduces unnecessary tokens in every request.
Allows faster onboarding for new agents or team members.
Lowers the chance that the model invents patterns.

The 4 levers of the agent harness

A solid harness combines four levers that reinforce each other. If one is missing, the system still works, but it loses stability as complexity grows.

Custom Rules: stack, conventions, anti-patterns, and clear repository limits.
MCP Servers: access to real knowledge and tools outside the local codebase.
Skills: knowledge and execution on demand without loading everything into context all the time.
Spec-Driven Development: specify before implementing to reduce ambiguity.
The combination of all 4 gives better quality control and lower iteration cost.

Delegation with agents: why it improves context

If we still want to improve agent performance, we can use sub-agents that a main agent (the orchestrator) delegates specific tasks to.

What does that give us? Basically, we delegate tokens to a model that lives only for that iteration, letting us save those planning, implementation, and review tokens.

With an orchestrator and sub-agents, you separate roles: one analyzes and plans, another executes, another verifies. Each sub-agent is born with clean context and a narrow focus, solves its task, and then ends.

The orchestrator only knows what it asked for and the result of each sub-agent, without accumulating all the noise from the intermediate iterations.

This architecture avoids the "god" agent that accumulates everything: tickets, debates, old code, previous mistakes, and cross-cutting decisions.

Orchestrator and sub-agents

Renderizando diagrama…

Feedback loop and hooks: where quality gets enforced

The harness defines what the agent knows, but the feedback loop defines when it can call a task done. If there is no automated validation, everything depends on goodwill.

Tests, lint, typecheck, and builds are not post-processing — they are part of the agent's self-correction loop.

Every validation failure returns actionable context for correction.
Stop hooks can block completion if checks don't pass.
With strong feedback, repetitive manual supervision goes down.
Speed goes up when the agent converges on its own, not when it types faster.

Scaling across teams: global standard + repo-level tuning

At scale, the challenge isn't only technical — it's consistency across stacks, versions, and legacy repos. That's why it's worth separating organizational base rules from local project rules.

Common foundation for security and quality, local layer to avoid breaking legacy contexts and each team's particular workflows.

Org rules: security, approved libraries, review standards.
Repo rules: concrete architecture, local conventions, and exceptions.
Automated CI review to prioritize findings before human review.

Comparison of approaches to handle context

Not all approaches behave the same. The big differences show up in scalability, consistency, and cognitive cost.

Visual comparison between a single agent, isolated chat, and delegation with an orchestrator and sub-agents — Evolution of context management: from one saturated thread to an orchestrator and sub-agent model with less noise and better control.

Approach	Context	Advantages	Risks
1 agent for everything	Single long cumulative thread	Simple to start	Gets dirty fast, reliability drops, more hallucination
1 chat per objective	Separated by task	Better focus and recovery from mistakes	Requires manual discipline
agent.md + skills	Base rules + modular knowledge	Technical consistency and less re-explaining	Needs living documentation maintenance
Orchestrator + delegation	Distributed context by role	Scales better, less noise per agent	Higher initial operational complexity
Complete agent harness	Rules + MCPs + skills + specs + feedback	High reliability and repeatable results	Requires continuous maintenance discipline

Concrete ways to solve the problem

This checklist gave me the highest return when working with AI without breaking flow.

Define the goal and acceptance criteria before asking for code.
Separate Discovery -> Plan -> Implementation -> Verification.
Use recap or consolidation before retrying in a new chat.
Store stable patterns in agent.md and skills.
Keep rules modular and avoid unnecessary always-on instructions.
Delegate subtasks to specialized agents when complexity grows.
Measure resolution time, rework, and discarded-response rate.

Risks and limits we shouldn't sugarcoat

The paper itself points out simulation limitations. We shouldn't sell it as absolute truth, but as strong evidence aligned with real usage experience.

Security and responsibility still stay on the human side: no credentials in prompts and nothing goes to production without serious technical review.

Don't share API keys, tokens, or sensitive data.
Don't delegate decisions you can't audit.
Don't confuse speed of output with quality of architecture.