I love this idea, but have a hypothesis that 90% of agents that people actually use today would fail this test inadvertently (false negative).
Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output.
Any thoughts on how you'd change the test structure with this in mind?
The tests should have negative weights based on how often that issue encountered and impact. The 2. SPI should have like 8 negative points out of 10 as most common blocker. And whole test inverse score.
Would love to see some results for different providers. The tests looks super logically thought out, but could use a TL;DR (too lazy; didn't run) output.
I love this idea, but have a hypothesis that 90% of agents that people actually use today would fail this test inadvertently (false negative).
Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output.
Any thoughts on how you'd change the test structure with this in mind?
The tests should have negative weights based on how often that issue encountered and impact. The 2. SPI should have like 8 negative points out of 10 as most common blocker. And whole test inverse score.
Would love to see some results for different providers. The tests looks super logically thought out, but could use a TL;DR (too lazy; didn't run) output.
Claude Web Opus 4.6 Extended: 14 / 20 points
x:CANARY-SPA-JSONLY-prism x:CANARY-CONNEG-MD-sigma
See also https://dacharycarey.com/2026/04/06/designing-agent-reading-...
Thanks! We'll put this in the toptext as well.