Token Cost Engineering: How PEtFiSh Saves 20% on Every Long Session
May 2026 · Research Report
If you run AI coding agents on real tasks, you've noticed: long sessions get expensive. Not because the model is slow, but because compaction — the mechanism that summarizes conversation history when context fills up — is shockingly costly. Each compaction event burns 50K-80K tokens in overhead.
We ran two controlled experiments to understand and reduce this cost. The results surprised us.
The Problem: Why v0.11.0 Regressed 37%
PEtFiSh v0.11.0 introduced a tiered architecture for agent rules: instead of one 1037-line inline file, rules were split into a 57-line entry point plus 7 on-demand sub-files. Cleaner, more maintainable.
But A/B testing revealed a 36.6% token regression. The reason: dynamically loaded rules land in uncached conversation context. They accumulate with each tool call, inflating the context window faster, triggering more compactions (2→3), each costing 50-80K tokens.
The fix wasn't "go back to inline." It was understanding where rules live in the LLM's memory architecture.
Experiment 1: System Prompt Injection
We built two plugins using OpenCode's experimental.chat.system.transform hook to move rules back into the cached system prompt prefix:
- All-rules — inject all 7 rule files (~9.4K tokens) into system prompt. 71 lines of code, zero config.
- Smart-rules — dynamically match rules to the active topic. 131 lines, requires a mapping registry.
Results (21 messages, 3 topics, claude-sonnet-4)
| Metric | Baseline (v0.10.x) | All-Rules Plugin | Delta |
|---|---|---|---|
| Total tokens | 586,917 | 475,039 | -19.1% |
| Input tokens | 455,533 | 327,834 | -28.0% |
| Compactions | 2 | 1 | -50% |
| Peak context | 152,990 | 145,530 | -4.9% |
Smart-rules achieved 12.3% savings but proved fragile — silent failures on missing mappings, false-positive keyword matching, manual maintenance burden. For rule sets under 30K tokens, all-rules wins on every dimension.
Key Insight
The 20-token overhead of injecting all rules into system prompt is negligible. What matters is that cached prefix content doesn't count toward compaction threshold accumulation. One fewer compaction = 50-80K tokens saved. The economics are overwhelming.
Experiment 2: Topic-Aware Compaction
A separate study asked: when compaction does fire, can PEtFiSh's topic management make it smarter?
The fish-trail topic system already tracks what you're working on — which topics are active, their relationships, their summaries. We built a Phase 2 plugin that restructures the compaction prompt using this topic data, telling the model: "here are 3 topics, compress each separately, prioritize the active one."
Results (21 messages, 3 interleaved topics, claude-sonnet-4)
| Metric | Baseline | Topic Plugin | Delta |
|---|---|---|---|
| Total tokens | 857,115 | 683,522 | -20.3% |
| API calls | 140 | 89 | -36.4% |
| Wall time | 49 min | 30 min | -39.4% |
| Cache reads | 10.6M | 5.3M | -49.9% |
| Recall quality | Pass | Pass | No loss |
The Surprise: Behavioral Change
We expected savings from better compression ratios. That's not what happened.
The primary mechanism is behavioral change. When the model receives topic-structured context, it produces more focused responses — fewer intermediate tool calls (4.2/msg vs 6.7/msg), more consolidated answers. This cascades: fewer API calls → less cache reads → faster wall time.
This is why we shelved Phase 3 (pre-computed summaries that skip the LLM): it can't trigger this behavioral effect. The model needs to process topic-structured context during compaction, not just receive a pre-built summary.
What We Learned
- Compaction frequency dominates token cost. Everything else — prompt size, output length, caching strategy — is secondary. Reduce compactions and costs drop dramatically.
- Cached prefix is free real estate. Rules in system prompt cost almost nothing (cache reads are ~10x cheaper than input tokens). Rules in conversation context are a ticking time bomb toward the next compaction.
- Topic structure changes model behavior. Not just compression quality — the model actually becomes more efficient when it has structured context about what it's doing.
- Simple beats clever. All-rules (71 lines, zero config) beat Smart-rules (131 lines, registry dependency) on both cost and reliability. Don't optimize what doesn't need optimizing.
Limitations
- Tested on
claude-sonnet-4only. Other models may differ. - 21-message sessions (3 topics). Larger sessions may show different patterns.
- Single-user scenarios. Multi-window concurrent sessions untested.
- OpenCode's plugin hooks are marked
experimental— though 11+ external projects use them in production.
Try It
Both plugins ship with PEtFiSh. The system prompt plugin is included in the companion pack. The topic-aware compaction plugin is included in the context pack (fish-trail).
# Install PEtFiSh with both plugins curl -fsSL https://raw.githubusercontent.com/kylecui/petfish.ai/master/remote-install.sh \ | bash -s -- --pack companion,context --detect
Full research data, A/B test harness, and raw results are in the GitHub repo:
- Experiment 1 (system prompt injection):
evals/v011-sysprompt-plugin-report/PAPER.md - Experiment 2 (topic-aware compaction):
research/topic-aware-compaction/06_outputs/research-report.md
All experiments ran on claude-sonnet-4 via the github-copilot provider in OpenCode.
><(((^>