AgentDepots

Intelligence analysis, threat assessment, mission planning, and compliance review all share a common problem: they require the right experts in the right sequence, working without gaps or miscommunication. AgentDepots (AD) deploys purpose-built AI teams on demand — a researcher to gather and verify, an analyst to assess and synthesize, a security reviewer to validate, a technical writer to produce the deliverable — all coordinated automatically, in the correct order, with built-in quality gates before anything moves forward. Every team is assembled specifically for the task at hand and stands down when complete, which means no persistent attack surface, no lingering data exposure, and no unnecessary access. For classified or sensitive workflows, each AgentDepots instance is fully isolated, with its own identity layer, cryptographic credentials, and access boundaries — making it deployable in multi-tenant environments where strict separation between programs, agencies, or classification levels is mandatory. When time, accuracy, and accountability matter at the mission level, AgentDepots delivers a disciplined, auditable team that executes on orders — every time.

Project Orchestration Benchmark (POB)

It’s a scoring system defined specifically to measure how well an AI framework handles the full lifecycle of a project — not just whether it can answer a single question correctly.

The 100-point score is made up of 5 weighted categories:

CategoryWeightWhat it measures
Plan Quality20%Valid dependency graph, correct decomposition, constraint compliance
Assignment Quality15%Right task assigned to right agent/role
Coordination25%Rework loops, failure recovery, state consistency
Deliverable Quality30%Required outputs present, tests pass, human rubric quality
Efficiency10%Latency, token cost, wasted retries

The key insight behind the name is that traditional benchmarks (like GSM8K for math, BoolQ for Q&A) only measure single-task capability. POB wraps those same datasets inside a multi-agent orchestration workflow to measure how well the framework — not just the underlying LLM — coordinates, plans, and delivers results. That’s why AgentDepots scores 79.31 POB even on a local model, while faster frameworks score much lower despite calling the same model.

AgentDepots Benchmark Report — Container Fair POB v1

Setup: Same machine · Same Docker image · Same CPU/memory limits · Same dataset slice · Model: glm-4.7-flash 27B (local with RTX GeForce 4090 GPU)
Frameworks tested: AgentDepots, LangChain, LangGraph, AutoGen, ZenML
Scenarios: AG News (topic classification), BoolQ (yes/no QA), GSM8K (math)
Seeds: 42, 43, 44 · Samples per run: 100 · Total runs per framework: 9


Overall Ranking (POB Score 0–100)

RankFrameworkPOB Score95% CIAccuracyAvg LatencyErrors
1AgentDepots79.3175.1–83.683.8%11.9s0
2LangChain70.9847.6–94.450.8%3.9s0
3LangGraph63.7440.8–86.740.2%4.0s0
4ZenML*35.0213.3–56.739.8%4.1s900/900*
5AutoGen*34.9913.4–56.639.8%4.1s900/900*
FrameworkRoot cause
AutoGen*AutoGen’s agent runtime requires its own conversation-loop infrastructure (GroupChat, AssistantAgent, etc.) that was not compatible with containerized single-API-endpoint setup used for the fair run. The runner fell back to a plain LLM call and tagged it autogen_adapter_fallback_used.
ZenML*ZenML is fundamentally a pipeline/MLOps orchestration tool, not an LLM agent framework. It has no native LLM inference loop — the runner could only use a lightweight shim (zenml_light_adapter_used) that calls the model directly.

*Why this matters for the 900/900 error count:

  • 900 samples = 3 scenarios × 100 samples × 3 seeds
  • Every single sample was flagged because every single AutoGen and ZenML run used the fallback adapter — it wasn’t intermittent, it was the whole run

The scores they achieved (~39.8% accuracy) are essentially bare LLM accuracy with no orchestration — because they never actually executed any orchestration. Their POB score bottoms out at ~35 because they score zero on native coordination and plan quality metrics.


Per-Scenario POB Breakdown

ScenarioAgentDepotsLangChainLangGraphZenMLAutoGen
AG News (topic)78.7770.5160.7531.4931.45
BoolQ (yes/no QA)75.8691.8885.3555.7355.62
GSM8K (math)83.3050.5545.1117.8417.89

POB Sub-Metric Breakdown

Metric (weight)AgentDepotsLangChainLangGraph
Plan Quality (20%)91/10063/10054/100
Assignment Quality (15%)67/10098/10097/100
Coordination (25%)95/10079/10074/100
Deliverable Quality (30%)87/10047/10034/100
Efficiency (10%)12/10099/10098/100

Key Findings

  • AgentDepots wins on accuracy by a large margin: 83.8% vs 50.8% (LangChain) vs 40.2% (LangGraph)
  • AgentDepots dominates Coordination and Deliverable Quality — the two highest-weighted categories — which is why it wins despite being slower
  • AgentDepots’ latency is higher (~12s vs ~4s): it does multi-step reasoning and coordination per task; the others are faster but produce wrong or empty answers
  • No framework had invalid DAGs or state inconsistencies — clean runs all around for AgentDepots, LangChain, LangGraph
  • Early Gemini run (3 samples, gemini-2.0-flash-lite): abandoned — LangGraph hit 429 rate-limit errors on all 3 samples; that run is excluded from the main comparison

POB Go/No-Go Gate Status (claiming “Orchestration Leader”)

GateThresholdAgentDepots ResultPass?
Top shared-applicability scoreRank #179.31 ✓Yes
No severe failure concentrationNo category collapsePlan/Coord/Deliv all strongYes
Deliverable Quality ≥ 0.80 on hard≥ 8087/100 on GSM8K (hard)Yes
Recovery success ≥ 0.70≥ 70%0 errors across 900 samplesYes
Reproduced across seeds≥ 2 seed batchesSeeds 42, 43, 44 all consistentYes

Conclusion: AgentDepots meets all five go/no-go criteria for claiming orchestration leadership.

Coming soon.