Here's a pattern you might recognize. A few days ago you ran a set of prompts and got excellent results. Today you run the exact same prompts and you're fighting the agent just to get close.
Same prompts. Same account. Same model. Worse output. If you've used Toffu (or any AI agent) for more than a month, you've probably felt some version of this. The agent that nailed your weekly report in January needs three corrections to get it right in June. It "forgets" your brand voice mid-task. It resurrects a budget you changed weeks ago.
The instinct is to conclude the agent got dumber. It didn't. Its context got noisier. And unlike a model problem, this one you can actually fix - in about two minutes, using Toffu itself.
What context rot is
Every AI agent works from a context window: the working memory it reads before answering you. In Toffu's case that includes your custom instructions, relevant memories, and the conversation so far. The model reads all of it, every single time, and weighs all of it.
Picture a whiteboard. Early in a project it holds the plan. Three months later it holds the plan, four dead ideas, a budget that changed twice, and "ASK DAVE??" in the corner. Nobody erased anything. Now imagine a very diligent intern who re-reads the entire whiteboard before every task and treats everything on it as equally current. That's your agent.
The mechanics back this up: as context grows, models attend to it less reliably, and contradictory instructions don't cancel out - they compete. The model picks one, and which one can vary run to run. That's why rot feels like randomness: "sometimes it's formal, sometimes it's casual, I never know which I'll get."
The rot accumulates in two places.
Rot source #1: custom instructions
Custom instructions are standing directives - "always report in USD", "our tone is direct, no exclamation marks". They're the most powerful lever in Toffu and the most common rot site, because of one asymmetry: instructions get added constantly and removed almost never.
Three failure patterns show up over and over:
- Contradictions. "Always use formal tone" from January sits next to "keep it casual, we're rebranding" from April. Both are live. The agent obeys one of them, semi-randomly, and you experience it as inconsistency.
- Expired one-offs. "For the March launch, ignore the budget cap" was correct for three weeks in March. Saved as a permanent instruction, it's now a standing order to ignore budget caps.
- Misfiled facts. A metric, a campaign ID, last quarter's CPA target - data points that were true once, frozen into directives. Instructions should say how to behave; facts that age belong elsewhere.
The symptom of instruction rot is the agent following a rule you forgot existed. If Toffu does something weird and your reaction is "why would it do that?", there's a decent chance you (or a teammate) told it to, months ago.
This is half the answer to the pattern at the top of this post. The prompts were identical, but prompts aren't the whole input - they run on top of every standing instruction in every scope, yours and the shared ones. If anything got appended in the days between the great run and the frustrating one - by you, by a teammate, by a one-off exception that got saved - the same prompts are now executing against different standing orders. Prompt parity is not context parity.
Rot source #2: the 200-message thread
The second rot site is the marathon conversation. You know the one: it started as a campaign launch, then you pivoted the strategy, renamed everything, pasted in a CSV, drafted four versions of the copy, and kept going because the thread "has all the context."
It does have all the context. That's the problem. The abandoned strategy is still in there. The three rejected drafts are still in there. The old campaign name appears forty times and the new one six. When the model generates its next answer, all of that exerts pull. This is why long threads regress: you correct the budget, and twenty messages later the old number creeps back. The thread outvotes you. It also gets slower and mushier with every tangent.
And here's the other half. The excellent results from a few days ago happened inside a particular conversation, with everything that conversation had accumulated - the corrections, the examples, the back-and-forth that tuned the output. The prompts alone never contained the magic; the thread did. Replay the same prompts in a different context and the magic doesn't come with them.
The rule of thumb: a thread is for a task, not for a relationship. When the task changes, the thread should too.
The fix: one command
You could audit all this by hand. You don't have to. Toffu ships with a system skill called context-rot-cleanup that does it for you.
Trigger it by asking:
Run the context-rot-cleanup skill.
Or just describe the symptom in your own words - "you've gotten worse lately, figure out why", "clean up my context", "why did these exact prompts work on Sunday and not today" - and Toffu will load the skill itself.
Here's what it does:
- Audits your custom instructions across every scope you have (user, team, company) and sorts each line into keep, rewrite, delete, or extract into a skill - flagging contradictions, expired one-offs, duplicates, facts masquerading as directives, and step-by-step procedures that should be skills instead of weight on every message.
- Audits memory for stale entries: last quarter's metrics stored as eternal truth, notes about campaigns that ended, facts that contradict current instructions.
- Shows you a cleanup table with a verdict and a reason per line, plus the full proposed replacement. It never applies anything without your approval.
- Applies the cleanup once you approve, and echoes back exactly what's now stored so there's no mystery state.
The whole exchange takes about two minutes, most of which is you reading the table.
If the rot is the conversation itself rather than your stored context, the same skill handles that too. At the end of a long thread, ask:
This thread is getting long. Run context-rot-cleanup and give me a
handoff summary so I can start fresh.
Toffu produces a handoff block - decisions made, current settings, open work, next step - that you paste as the first message of a new conversation. Fresh window, zero rot, nothing lost.
A handoff carries in-flight state. A recipe that finally works deserves a better container - the first habit below.
Habits that keep it from coming back
Cleanup beats no cleanup, but hygiene beats both. One habit matters more than the rest.
When a thread finally produces the result you wanted - the report format that landed, the audience breakdown that worked, the post structure the team liked - save it before you walk away:
That last result was exactly right. Save what worked here as a skill
so any new chat can use it.
This is the escape from the mega-thread trap. Threads grow to 200 messages because nobody wants to lose the context that finally works. A skill keeps the win and leaves the rot behind: the working procedure becomes a clean, reusable block any fresh conversation can load. It's also how the pattern from the opener stops happening - the good result lives in a skill, not in one lucky thread you can't replay. If the recipe runs on a cadence, graduate it to a scheduled task.
The rest of the hygiene:
- One thread per task. New campaign, new thread. When in doubt, hand off and start fresh.
- Put things in the right drawer. A proven procedure is a skill. A standing rule is an instruction. A fact about the world is a memory. If it's only true this week, it's a message.
- Treat repeated corrections as a smell. If you've corrected Toffu twice on the same thing, don't correct it a third time - run the cleanup. Something stored is fighting you.
- Schedule the audit. Ask Toffu to remind you to run
context-rot-cleanupmonthly. (The skill deliberately won't rewrite instructions from inside a scheduled run - you stay in the loop on what gets erased.)
"Dumber" is usually "drowning"
When an agent's quality drops over months of use, the model is almost never what changed. What changed is everything you and your team piled into its working memory along the way. The difference between the agent and the whiteboard is that this one cleans itself when you ask.
Run the context-rot-cleanup skill.
Two minutes. Then go see what your prompts were actually capable of all along.



