컨텍스트 압축 및 캐싱

Hermes Agent는 이중 압축 시스템과 Anthropic 프롬프트 캐싱을 사용하여 긴 대화 전반에 걸쳐 컨텍스트 창 사용을 효율적으로 관리합니다.

소스 파일: agent/context_engine.py(ABC), agent/context_compressor.py(기본 엔진), agent/prompt_caching.py, gateway/run.py(세션 위생), run_agent.py(_compress_context 검색)

플러그형 컨텍스트 엔진

컨텍스트 관리는 contextEngine ABC(agent/context_engine.py)를 기반으로 구축되었습니다. 내장된 contextCompressor은 기본 구현이지만 플러그인은 이를 대체 엔진(예: 무손실 컨텍스트 관리)으로 대체할 수 있습니다.

context:
  engine: "compressor"    # default — built-in lossy summarization
  engine: "lcm"           # example — plugin providing lossless context

엔진은 다음을 담당합니다.

압축이 실행되어야 하는 시기 결정(should_compress())
압축 수행 중(compress())
선택적으로 에이전트가 호출할 수 있는 도구를 노출합니다(예: lcm_grep)
API 응답에서 토큰 사용량 추적

선택은 config.yaml의 컨텍스트.engine을 통해 구성 기반으로 이루어집니다. 해결 순서:

plugins/context_engine/<name>/ 디렉터리를 확인하세요.
일반 플러그인 시스템 확인(register_context_engine())
내장된 contextCompressor으로 대체

플러그인 엔진은 자동으로 활성화되지 않습니다. 사용자는 플러그인 이름에 context.engine을 명시적으로 설정해야 합니다. 기본 "compressor"은 항상 내장을 사용합니다.

hermes plugins → Provider Plugins → context Engine을 통해 구성하거나 config.yaml을 직접 편집하세요.

컨텍스트 엔진 플러그인을 빌드하려면 컨텍스트 엔진 플러그인을 참조하세요.

이중 압축 시스템

Hermes에는 독립적으로 작동하는 두 개의 별도 압축 레이어가 있습니다.

                     ┌──────────────────────────┐
  Incoming message   │   Gateway Session Hygiene │  Fires at 85% of context
  ─────────────────► │   (pre-agent, rough est.) │  Safety net for large sessions
                     └─────────────┬────────────┘
                                   │
                                   ▼
                     ┌──────────────────────────┐
                     │   Agent ContextCompressor │  Fires at 50% of context (default)
                     │   (in-loop, real tokens)  │  Normal context management
                     └──────────────────────────┘

1. 게이트웨이 세션 위생(85% 임계값)

gateway/run.py에 위치합니다(Session hygiene: auto-compress 검색). 이것은 안전망입니다. 에이전트가 메시지를 처리하기 전에 실행됩니다. 세션 시 API 오류를 방지합니다. 턴 사이에 너무 커지는 경우(예: Telegram/Discord에서 밤새 축적되는 경우)

임계값: 모델 컨텍스트 길이의 85%로 고정됨
토큰 소스: 지난 차례의 실제 API 보고 토큰을 선호합니다. 뒤로 넘어지다 대략적인 문자 기반 견적(estimate_messages_tokens_rough)
실행: len(history) >= 4 및 압축이 활성화된 경우에만
목적: 에이전트 자체 압축기에서 벗어난 세션을 포착합니다.

게이트웨이 위생 임계값은 의도적으로 에이전트의 압축기보다 높습니다. 50%(에이전트와 동일)로 설정하면 매 턴마다 조기 압축이 발생합니다. 긴 게이트웨이 세션에서.

2. Agent 컨텍스트Compressor(50% 임계값, 구성 가능)

agent/context_compressor.py에 위치합니다. 이것은 기본 압축입니다. 정확한 정보에 액세스하여 에이전트의 도구 루프 내에서 실행되는 시스템 API에서 보고한 토큰 수입니다.

구성

모든 압축 설정은 compression 키 아래의 config.yaml에서 읽습니다.

compression:
  enabled: true              # Enable/disable compression (default: true)
  threshold: 0.50            # Fraction of context window (default: 0.50 = 50%)
  target_ratio: 0.20         # How much of threshold to keep as tail (default: 0.20)
  protect_last_n: 20         # Minimum protected tail messages (default: 20)

# Summarization model/provider configured under auxiliary:
auxiliary:
  compression:
    model: null              # Override model for summaries (default: auto-detect)
    provider: auto           # Provider: "auto", "openrouter", "nous", "main", etc.
    base_url: null           # Custom OpenAI-compatible endpoint

매개변수 세부사항

매개변수	기본값	범위	설명
`threshold`	`0.50`	0.0-1.0	프롬프트 토큰 ≥ `threshold × 컨텍스트_length`인 경우 압축이 트리거됩니다.
`target_ratio`	`0.20`	0.10-0.80	테일 보호 토큰 예산 제어: `threshold_tokens × target_ratio`
`protect_last_n`	`20`	≥1	최근 메시지의 최소 개수는 항상 보존됩니다.
`protect_first_n`	`3`	(하드코딩됨)	시스템 프롬프트 + 첫 번째 교환이 항상 보존됨

계산된 값(기본적으로 컨텍스트 모델의 경우)

context_length       = 200,000
threshold_tokens     = 200,000 × 0.50 = 100,000
tail_token_budget    = 100,000 × 0.20 = 20,000
max_summary_tokens   = min(200,000 × 0.05, 12,000) = 10,000

압축 알고리즘

contextCompressor.compress() 메서드는 4단계 알고리즘을 따릅니다.

1단계: 이전 도구 결과 정리(저렴함, LLM 호출 없음)

보호된 꼬리 외부의 이전 도구 결과(>200자)는 다음으로 대체됩니다.

[Old tool output cleared to save context space]

이는 장황한 도구에서 상당한 토큰을 절약하는 저렴한 사전 패스입니다. 출력(파일 내용, 터미널 출력, 검색 결과).

2단계: 경계 결정

┌─────────────────────────────────────────────────────────────┐
│  Message list                                               │
│                                                             │
│  [0..2]  ← protect_first_n (system + first exchange)        │
│  [3..N]  ← middle turns → SUMMARIZED                        │
│  [N..end] ← tail (by token budget OR protect_last_n)        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

꼬리 보호는 토큰 예산 기반입니다. 끝에서 뒤로 걸어갑니다. 예산이 소진될 때까지 토큰을 축적합니다. 고정으로 돌아감 protect_last_n 예산으로 더 적은 수의 메시지를 보호할 수 있는지 계산합니다.

tool_call/tool_result 그룹이 분할되지 않도록 경계가 정렬됩니다. _align_boundary_backward() 메소드는 연속적인 도구 결과를 통과합니다. 그룹을 그대로 유지하면서 부모 보조 메시지를 찾으세요.

3단계: 구조화된 요약 생성

Summary model 컨텍스트 length {#phase-3-generate-structured-summary}

요약 모델에는 최소한 주 에이전트 모델만큼 큰컨텍스트 창이 있어야 합니다. 전체 중간 섹션은 단일 call_llm(task="compression") 호출을 통해 요약 모델로 전송됩니다. 요약 모델의 context가 더 작은 경우 API는 context 길이 오류를 반환합니다. _generate_summary()이 이를 포착하고 경고를 기록한 후 None을 반환합니다. 그런 다음 압축기는요약 없이 중간 회전을 삭제하여 대화 내용을 자동으로 잃습니다. 이는 압축 품질 저하의 가장 일반적인 원인입니다.

중간 회전은 구조화된 보조 LLM을 사용하여 요약됩니다. 템플릿:

## Goal
[What the user is trying to accomplish]

## Constraints & Preferences
[User preferences, coding style, constraints, important decisions]

## Progress
### Done
[Completed work — specific file paths, commands run, results]
### In Progress
[Work currently underway]
### Blocked
[Any blockers or issues encountered]

## Key Decisions
[Important technical decisions and why]

## Relevant Files
[Files read, modified, or created — with brief note on each]

## Next Steps
[What needs to happen next]

## Critical Context
[Specific values, error messages, configuration details]

요약 예산은 압축되는 콘텐츠의 양에 따라 확장됩니다.

수식: content_tokens × 0.20(_SUMMARY_RATIO 상수)
최소: 2,000개 토큰
최대: min(context_length × 0.05, 12,000) 토큰

4단계: 압축된 메시지 수집

압축된 메시지 목록은 다음과 같습니다.

헤드 메시지(첫 번째 압축 시 시스템 프롬프트에 메모가 추가됨)
요약 메시지(연속적인 동일 역할 위반을 방지하기 위해 선택된 역할)
테일 메시지(수정되지 않음)

분리된 tool_call/tool_result 쌍은 _sanitize_tool_pairs()에 의해 정리됩니다.

제거된 호출을 참조하는 도구 결과 → 제거됨
결과가 제거된 도구 호출 → 스텁 결과가 삽입됨

반복적인 재압축

후속 압축에서는 이전 요약이 다음과 같이 LLM으로 전달됩니다. 처음부터 요약하기보다는 업데이트하라는 지침이 있습니다. 이는 보존합니다 여러 압축에 대한 정보 — 항목이 "진행 중"에서 "완료"로 이동합니다. 새로운 진행 상황이 추가되고 더 이상 사용되지 않는 정보가 제거됩니다.

압축기 인스턴스의 _previous_summary 필드는 마지막 요약을 저장합니다. 이를 위해 텍스트.

전/후 예

압축 전(45개 메시지, ~ 토큰)

[0] system:    "You are a helpful assistant..." (system prompt)
[1] user:      "Help me set up a FastAPI project"
[2] assistant: <tool_call> terminal: mkdir project </tool_call>
[3] tool:      "directory created"
[4] assistant: <tool_call> write_file: main.py </tool_call>
[5] tool:      "file written (2.)"... 30 more turns of file editing, testing, debugging...
[38] assistant: <tool_call> terminal: pytest </tool_call>
[39] tool:      "8 passed, 2 failed\n..."  ( output)
[40] user:      "Fix the failing tests"
[41] assistant: <tool_call> read_file: tests/test_api.py </tool_call>
[42] tool:      "import pytest\n..."  ()
[43] assistant: "I see the issue with the test fixtures..."
[44] user:      "Great, also add error handling"

압축 후(25개 메시지, ~45,000개 토큰)

[0] system:    "You are a helpful assistant...
               [Note: Some earlier conversation turns have been compacted...]"
[1] user:      "Help me set up a FastAPI project"
[2] assistant: "[CONTEXT COMPACTION] Earlier turns were compacted...

               ## Goal
               Set up a FastAPI project with tests and error handling

               ## Progress
               ### Done
               - Created project structure: main.py, tests/, requirements.txt
               - Implemented 5 API endpoints in main.py
               - Wrote 10 test cases in tests/test_api.py
               - 8/10 tests passing

               ### In Progress
               - Fixing 2 failing tests (test_create_user, test_delete_user)

               ## Relevant Files
               - main.py — FastAPI app with 5 endpoints
               - tests/test_api.py — 10 test cases
               - requirements.txt — fastapi, pytest, httpx

               ## Next Steps
               - Fix failing test fixtures
               - Add error handling"
[3] user:      "Fix the failing tests"
[4] assistant: <tool_call> read_file: tests/test_api.py </tool_call>
[5] tool:      "import pytest\n..."
[6] assistant: "I see the issue with the test fixtures..."
[7] user:      "Great, also add error handling"

프롬프트 캐싱(인류적)

출처: agent/prompt_caching.py

캐싱을 통해 다중 턴 대화에서 입력 토큰 비용을 ~75% 절감합니다. 대화 접두어. Anthropic의 cache_control 중단점을 사용합니다.

전략: system_and_3

Anthropic은 요청당 최대 4개의 cache_control 중단점을 허용합니다. 헤르메스 "system_and_3" 전략을 사용합니다:

Breakpoint 1: System prompt           (stable across all turns)
Breakpoint 2: 3rd-to-last non-system message  ─┐
Breakpoint 3: 2nd-to-last non-system message   ├─ Rolling window
Breakpoint 4: Last non-system message          ─┘

작동 방식

apply_anthropic_cache_control() 메시지를 전체 복사하고 삽입합니다. cache_control 마커:

# Cache marker format
marker = {"type": "ephemeral"}
# Or for 1-hour TTL:
marker = {"type": "ephemeral", "ttl": "1h"}

마커는 콘텐츠 유형에 따라 다르게 적용됩니다.

콘텐츠 유형	마커가 가는 곳
문자열 내용	`[{"type": "text", "text":..., "cache_control":...}]`로 변환됨
목록 콘텐츠	마지막 요소의 dict에 추가됨
없음/비어 있음	`msg["cache_control"]`로 추가됨
도구 메시지	`msg["cache_control"]`으로 추가됨(네이티브 Anthropic에만 해당)

캐시 인식 디자인 패턴

안정적인 시스템 프롬프트: 시스템 프롬프트는 중단점 1이며 캐시됩니다. 모든 차례. 대화 도중에 변형을 피하세요(압축하면 메모가 추가됩니다) 첫 번째 압축에서만).
메시지 순서가 중요합니다: 캐시 히트에는 접두사 일치가 필요합니다. 추가 또는 중간에 메시지를 제거하면 이후의 모든 캐시가 무효화됩니다.
압축 캐시 상호 작용: 압축 후 캐시가 무효화됩니다. 압축된 영역의 경우 시스템 프롬프트 캐시는 유지됩니다. 롤링 3-메시지 창은 1~2턴 내에 캐싱을 다시 설정합니다.
TTL 선택: 기본값은 5m(5분)입니다. 장기 실행에는 1h을 사용하세요. 사용자가 차례 사이에 휴식을 취하는 세션입니다.

프롬프트 캐싱 활성화

프롬프트 캐싱은 다음과 같은 경우 자동으로 활성화됩니다.

모델은 Anthropic Claude 모델입니다(모델 이름으로 감지됨).
제공자는 cache_control(네이티브 Anthropic API 또는 OpenRouter)을 지원합니다.

# config.yaml — TTL is configurable (must be "5m" or "1h")
prompt_caching:
  cache_ttl: "5m"

CLI는 시작 시 캐싱 상태를 표시합니다.

💾 Prompt caching: ENABLED (Claude via OpenRouter, 5m TTL)

상황에 따른 압력 경고

중간 상황 압력 경고가 제거되었습니다(run_agent.py의 반복 예산 블록 참조: "중간 압력 경고 없음 - 모델이 복잡한 작업을 조기에 '포기'하게 만들었습니다."). 사전 경고 단계 없이 프롬프트 토큰이 구성된 compression.threshold(기본값 50%)에 도달하면 압축이 실행됩니다. 게이트웨이 세션 위생은 모델 컨텍스트 창의 85%에서 보조 안전망으로 실행됩니다.

플러그형 컨텍스트 엔진​

이중 압축 시스템​

1. 게이트웨이 세션 위생(85% 임계값)​

2. Agent 컨텍스트Compressor(50% 임계값, 구성 가능)​

구성​

매개변수 세부사항​

계산된 값(기본적으로 컨텍스트 모델의 경우)​

압축 알고리즘​

1단계: 이전 도구 결과 정리(저렴함, LLM 호출 없음)​

2단계: 경계 결정​

3단계: 구조화된 요약 생성​

4단계: 압축된 메시지 수집​

반복적인 재압축​

전/후 예​

압축 전(45개 메시지, ~ 토큰)​

압축 후(25개 메시지, ~45,000개 토큰)​

프롬프트 캐싱(인류적)​

전략: system_and_3​

작동 방식​

캐시 인식 디자인 패턴​

프롬프트 캐싱 활성화​

상황에 따른 압력 경고​