Integrating AI into development
A practical guide based on a presentation and academic paper to use AI in development without drowning in noise, hallucinations, or context overflow.
Integrating AI into development, with judgment
AI speeds you up a lot in analysis, prototyping, and documentation. It lets you shift your cognitive focus: you can spend much less time writing code and more time thinking about which path to choose when implementing a new solution. The problem starts when we use the chat as a universal dumpster: we throw everything into the same thread, mix goals together, and then act surprised when the output degrades or doesn't match the solution we actually wanted.
AI does not replace technical thinking; it is a tool that should help amplify it. If there is no clear objective, the model fills the gaps with its statistical average, and that almost always leads to noise, overengineering, or solutions that don't fit what we really need.
- Operational speed: short iterations and fast proof-of-concepts.
- Shift in focus: less boilerplate, more architecture and decisions.
- Technical quality: better documentation and exploration of alternatives.
What context is and how LLMs process it
Context is everything that goes into the model on each call: system instructions, history, user messages, tools, and previous results.
It's key to understand that the model doesn't "memorize" anything: it doesn't have persistent memory between turns the way humans do. On every iteration, the full conversation context is sent again.
Each iteration is a new request. The system sends the full context package again, which is why the payload grows on every turn: more tokens, more cost, and more chances for noise.
If you don't manage that load, the model doesn't "remember better" just because you chat longer: it only receives more accumulated text. In other words, it's not real memory; it's continuous rereading of history, with risk of truncation and loss of focus or key information.
- New turn = new request to the LLM.
- History gets resent (user + assistant + relevant tool calls).
- When the context window approaches the limit, information gets trimmed or summarized.
- More turns do not mean more stable memory; many times they mean more noise.
Finite context and noise
In long chats, three effects show up: window overflow, anchoring to previous mistakes, and accumulated verbosity. This doesn't just reduce precision; it also makes quality unstable between runs.
Another key idea is that even if we don't hit the model's input token limit, performance usually gets dramatically worse long before that because of accumulated noise.
If the model made a wrong assumption early and we keep correcting it inside the same thread, it often stays biased by that dirty state and it's hard to recover. This hits even harder when the model already produced a solution, especially in code. What usually happens is that if the initial output wasn't the right one, instead of recreating it from scratch, the model tends to patch it rather than regenerate it properly.
- Too much irrelevant history = less focus on the current task.
- Chains of corrections = higher chance of contradictions.
- The output may sound confident even when the foundation is contaminated.
How a long conversation degrades
Renderizando grafico...
What the paper LLMs Get Lost In Multi-Turn Conversations says
The paper confirms a field intuition: when you move from a full single-turn prompt to an underspecified multi-turn interaction, average performance drops sharply. The interesting part is that most of that drop is explained by increased unreliability, not just loss of capability.
They also test recap/snowball strategies (repeating information), and they help partially, but they still don't match the clean single-turn scenario. In other words: it patches the problem, but it doesn't fix it at the root.


How to improve context using separate chats
The most effective and cheapest improvement is to separate conversations by objective. One chat for planning, another for implementation, another for review. This reduces cross-contamination and gives you shorter, more verifiable prompts.
If the model gets lost, instead of patching endlessly in the same thread, consolidate everything into a clean prompt and restart. In practice, that usually performs better than correcting forever.
- 1 task = 1 chat.
- If you switch module or endpoint, start a new chat.
- Before retrying: consolidate requirements into a single summary.
- Avoid pasting irrelevant code from other areas.
Chat-per-objective strategy
Renderizando grafico...
From context to the agent harness
If we want to go one step further, we need to consider some other concepts. The layer that really defines consistency is the agent harness: the system of context, tools, and guardrails surrounding the model.
When I say harness, I mean it literally: an operational harness: a structure that holds, guides, and limits the agent so it doesn't drift, just like a safety harness prevents dangerous movements when working at height.
The difference between random results and repeatable results is usually not the model itself. It's usually about how much useful context you feed into each iteration and how much noise you keep out.
- Harness = context + tools + validation + living rules.
- You don't design it once: you iterate on it with every real error that shows up.
- The goal is not for the agent to be brilliant once, but reliable every time.
- In larger teams, the harness stops being optional and becomes an operational requirement.
Skills and AGENTS.md
AGENTS.md acts as the repo's base contract: architecture, conventions, verification commands, and limits. Skills are specialized modules that the agent loads only when it touches that layer (frontend, backend, testing, etc.).
Practical result: less repetition in prompts, less style drift, and more speed to get into the task. It's documentation for humans and agents at the same time.
- Standardizes technical decisions and style.
- Reduces unnecessary tokens in every request.
- Allows faster onboarding for new agents or team members.
- Lowers the chance that the model invents patterns.
The 4 levers of the agent harness
A solid harness combines four levers that reinforce each other. If one is missing, the system still works, but it loses stability as complexity grows.
- Custom Rules: stack, conventions, anti-patterns, and clear repository limits.
- MCP Servers: access to real knowledge and tools outside the local codebase.
- Skills: knowledge and execution on demand without loading everything into context all the time.
- Spec-Driven Development: specify before implementing to reduce ambiguity.
- The combination of all 4 gives better quality control and lower iteration cost.
Delegation with agents: why it improves context
If we still want to improve agent performance, we can use sub-agents that a main agent (the orchestrator) delegates specific tasks to.
What does that give us? Basically, we delegate tokens to a model that lives only for that iteration, letting us save those planning, implementation, and review tokens.
With an orchestrator and sub-agents, you separate roles: one analyzes and plans, another executes, another verifies. Each sub-agent is born with clean context and a narrow focus, solves its task, and then ends.
The orchestrator only knows what it asked for and the result of each sub-agent, without accumulating all the noise from the intermediate iterations.
This architecture avoids the "god" agent that accumulates everything: tickets, debates, old code, previous mistakes, and cross-cutting decisions.
Orchestrator and sub-agents
Renderizando grafico...
Feedback loop and hooks: where quality gets enforced
The harness defines what the agent knows, but the feedback loop defines when it can call a task done. If there is no automated validation, everything depends on goodwill.
Tests, lint, typecheck, and builds are not post-processing; they are part of the agent's self-correction loop.
- Every validation failure returns actionable context for correction.
- Stop hooks can block completion if checks don't pass.
- With strong feedback, repetitive manual supervision goes down.
- Speed goes up when the agent converges on its own, not when it types faster.
Scaling across teams: global standard + repo-level tuning
At scale, the challenge isn't only technical; it's consistency across stacks, versions, and legacy repos. That's why it's worth separating organizational base rules from local project rules.
Common foundation for security and quality, local layer to avoid breaking legacy contexts and each team's particular workflows.
- Org rules: security, approved libraries, review standards.
- Repo rules: concrete architecture, local conventions, and exceptions.
- Automated CI review to prioritize findings before human review.
Comparison of approaches to handle context
Not all approaches behave the same. The big differences show up in scalability, consistency, and cognitive cost.
| Approach | Context | Advantages | Risks |
|---|---|---|---|
| 1 agent for everything | Single long cumulative thread | Simple to start | Gets dirty fast, reliability drops, more hallucination |
| 1 chat per objective | Separated by task | Better focus and recovery from mistakes | Requires manual discipline |
| agent.md + skills | Base rules + modular knowledge | Technical consistency and less re-explaining | Needs living documentation maintenance |
| Orchestrator + delegation | Distributed context by role | Scales better, less noise per agent | Higher initial operational complexity |
| Complete agent harness | Rules + MCPs + skills + specs + feedback | High reliability and repeatable results | Requires continuous maintenance discipline |
Concrete ways to solve the problem
This checklist gave me the highest return when working with AI without breaking flow.
- Define the goal and acceptance criteria before asking for code.
- Separate Discovery -> Plan -> Implementation -> Verification.
- Use recap or consolidation before retrying in a new chat.
- Store stable patterns in agent.md and skills.
- Keep rules modular and avoid unnecessary always-on instructions.
- Delegate subtasks to specialized agents when complexity grows.
- Measure resolution time, rework, and discarded-response rate.
Risks and limits we shouldn't sugarcoat
The paper itself points out simulation limitations. We shouldn't sell it as absolute truth, but as strong evidence aligned with real usage experience.
Security and responsibility still stay on the human side: no credentials in prompts and nothing goes to production without serious technical review.
- Don't share API keys, tokens, or sensitive data.
- Don't delegate decisions you can't audit.
- Don't confuse speed of output with quality of architecture.
Sources
- Presentation: Software Development with AI and Agents
- Base document: Software Development with AI and Agents
- Claude Docs: Context windowsSource for the context accumulation diagram across turns.
- Paper: LLMs Get Lost In Multi-Turn ConversationMulti-turn simulation study focused on aptitude vs reliability.
- Post on X: The Coding Agent HarnessPractical context engineering framework with 4 levers, feedback loop, and adoption patterns at scale.
- Blog: Context RotComplementary reading on degradation from accumulating irrelevant context.
- Video: ChatGPT Has Alzheimer's. Help It.
- Video: The Skills System That Changed How I Work with AI
- Video: How to Be TONY STARK with AI