AgentDepots – CYBER INTEL SYSTEMS

Intelligence analysis, threat assessment, mission planning, and compliance review all share a common problem: they require the right experts in the right sequence, working without gaps or miscommunication. AgentDepots (AD) deploys purpose-built AI teams on demand — a researcher to gather and verify, an analyst to assess and synthesize, a security reviewer to validate, a technical writer to produce the deliverable — all coordinated automatically, in the correct order, with built-in quality gates before anything moves forward. Every team is assembled specifically for the task at hand and stands down when complete, which means no persistent attack surface, no lingering data exposure, and no unnecessary access. For classified or sensitive workflows, each AgentDepots instance is fully isolated, with its own identity layer, cryptographic credentials, and access boundaries — making it deployable in multi-tenant environments where strict separation between programs, agencies, or classification levels is mandatory. When time, accuracy, and accountability matter at the mission level, AgentDepots delivers a disciplined, auditable team that executes on orders — every time.

Project Orchestration Benchmark (POB)

It’s a scoring system defined specifically to measure how well an AI framework handles the full lifecycle of a project — not just whether it can answer a single question correctly.

The 100-point score is made up of 5 weighted categories:

Category	Weight	What it measures
Plan Quality	20%	Valid dependency graph, correct decomposition, constraint compliance
Assignment Quality	15%	Right task assigned to right agent/role
Coordination	25%	Rework loops, failure recovery, state consistency
Deliverable Quality	30%	Required outputs present, tests pass, human rubric quality
Efficiency	10%	Latency, token cost, wasted retries

The key insight behind the name is that traditional benchmarks (like GSM8K for math, BoolQ for Q&A) only measure single-task capability. POB wraps those same datasets inside a multi-agent orchestration workflow to measure how well the framework — not just the underlying LLM — coordinates, plans, and delivers results. That’s why AgentDepots scores 79.31 POB even on a local model, while faster frameworks score much lower despite calling the same model.

AgentDepots Benchmark Report — Container Fair POB v1

Setup: Same machine · Same Docker image · Same CPU/memory limits · Same dataset slice · Model: glm-4.7-flash 27B (local with RTX GeForce 4090 GPU)
Frameworks tested: AgentDepots, LangChain, LangGraph, AutoGen, ZenML
Scenarios: AG News (topic classification), BoolQ (yes/no QA), GSM8K (math)
Seeds: 42, 43, 44 · Samples per run: 100 · Total runs per framework: 9

Overall Ranking (POB Score 0–100)

Rank	Framework	POB Score	95% CI	Accuracy	Avg Latency	Errors
1	AgentDepots	79.31	75.1–83.6	83.8%	11.9s	0
2	LangChain	70.98	47.6–94.4	50.8%	3.9s	0
3	LangGraph	63.74	40.8–86.7	40.2%	4.0s	0
4	ZenML*	35.02	13.3–56.7	39.8%	4.1s	900/900*
5	AutoGen*	34.99	13.4–56.6	39.8%	4.1s	900/900*

Framework	Root cause
AutoGen*	AutoGen’s agent runtime requires its own conversation-loop infrastructure (GroupChat, AssistantAgent, etc.) that was not compatible with containerized single-API-endpoint setup used for the fair run. The runner fell back to a plain LLM call and tagged it `autogen_adapter_fallback_used`.
ZenML*	ZenML is fundamentally a pipeline/MLOps orchestration tool, not an LLM agent framework. It has no native LLM inference loop — the runner could only use a lightweight shim (`zenml_light_adapter_used`) that calls the model directly.

*Why this matters for the 900/900 error count:

900 samples = 3 scenarios × 100 samples × 3 seeds
Every single sample was flagged because every single AutoGen and ZenML run used the fallback adapter — it wasn’t intermittent, it was the whole run

The scores they achieved (~39.8% accuracy) are essentially bare LLM accuracy with no orchestration — because they never actually executed any orchestration. Their POB score bottoms out at ~35 because they score zero on native coordination and plan quality metrics.

Per-Scenario POB Breakdown

Scenario	AgentDepots	LangChain	LangGraph	ZenML	AutoGen
AG News (topic)	78.77	70.51	60.75	31.49	31.45
BoolQ (yes/no QA)	75.86	91.88	85.35	55.73	55.62
GSM8K (math)	83.30	50.55	45.11	17.84	17.89

POB Sub-Metric Breakdown

Metric (weight)	AgentDepots	LangChain	LangGraph
Plan Quality (20%)	91/100	63/100	54/100
Assignment Quality (15%)	67/100	98/100	97/100
Coordination (25%)	95/100	79/100	74/100
Deliverable Quality (30%)	87/100	47/100	34/100
Efficiency (10%)	12/100	99/100	98/100

Key Findings

AgentDepots wins on accuracy by a large margin: 83.8% vs 50.8% (LangChain) vs 40.2% (LangGraph)
AgentDepots dominates Coordination and Deliverable Quality — the two highest-weighted categories — which is why it wins despite being slower
AgentDepots’ latency is higher (~12s vs ~4s): it does multi-step reasoning and coordination per task; the others are faster but produce wrong or empty answers
No framework had invalid DAGs or state inconsistencies — clean runs all around for AgentDepots, LangChain, LangGraph
Early Gemini run (3 samples, gemini-2.0-flash-lite): abandoned — LangGraph hit 429 rate-limit errors on all 3 samples; that run is excluded from the main comparison

POB Go/No-Go Gate Status (claiming “Orchestration Leader”)

Gate	Threshold	AgentDepots Result	Pass?
Top shared-applicability score	Rank #1	79.31 ✓	Yes
No severe failure concentration	No category collapse	Plan/Coord/Deliv all strong	Yes
Deliverable Quality ≥ 0.80 on hard	≥ 80	87/100 on GSM8K (hard)	Yes
Recovery success ≥ 0.70	≥ 70%	0 errors across 900 samples	Yes
Reproduced across seeds	≥ 2 seed batches	Seeds 42, 43, 44 all consistent	Yes

Conclusion: AgentDepots meets all five go/no-go criteria for claiming orchestration leadership.

Coming soon.