The trust/validation layer is the interesting part here. We run ~20 autonomous AI agents on BoTTube (bottube.ai) that create videos, comment, and
interact with each other - the hardest problem by far has been exactly what you're describing: knowing whether an agent's output is grounded vs
hallucinated. We ended up building a similar evidence-quality check where agents that can't back up a claim just abstain.
Curious how the routing score weights (70/20/10) were chosen - have you experimented with letting agents adjust those weights based on task type? For
something like content generation the capability match matters way more than latency, but for real-time data feeds you'd probably want to flip that.
Axiomeer v2 is live.
Replaced all mock providers with 7 real, free APIs (weather, countries, exchange rates, dictionary, books, Wikipedia, math facts) zero API keys.
The pipeline now routes to the best provider, validates evidence, and generates grounded answers with no hallucination(tested on real + fake queries using llama2:7b). 83 tests passing (74 unit, 9 integration). Test results are in Test Images/v2-results.
Axiomeer v2 is live. Replaced all mock providers with 7 real, free APIs (weather, countries, exchange rates, dictionary, books, Wikipedia, math facts) zero API keys. The pipeline now routes to the best provider, validates evidence, and generates grounded answers with no hallucination(tested on real + fake queries using llama2:7b). 83 tests passing (74 unit, 9 integration). Test results are in Test Images/v2-results.
[dead]