This was a big shift in what “tested” meant to me.
There is a difference between testing that a tool runs and testing that it keeps finding the right things. A11Y Cat eventually starts doing both, and I think that second step is one of the more serious engineering upgrades in the repo.
The detection-quality harness is where that shows up. There is a fixture corpus, explicit expected-results files, a scorecard harness, and Playwright coverage that runs those fixtures through multiple execution profiles. That means the tool is no longer only asking “did the command execute?” It is asking “did the findings still match the expected rule ids, issue types, confidence levels, WCAG metadata, selectors, and evidence fields?”
That is a much stronger question.
I like this part of the project because it is practical. It does not pretend to solve all truth in one benchmark. It creates a bounded corpus and then treats drift seriously. There is even an explicit environment variable needed to update baselines. Without it, drift is a failure. That is exactly the kind of friction a useful quality harness should add.
The harness itself also shows the repo getting more precise about what counts as a result. Expected findings are not just blobs of text. They have fields for rule id, issue type, WCAG criterion and level, confidence, limitation metadata, selector, and computed-data expectations. That is the project saying: if these things matter in the product, they should matter in regression checking too.
I also think the rule scorecard is a clever touch. It gives the repo a way to talk about coverage quality per rule family instead of treating the entire detection surface as one pass/fail monolith. That feels much more useful for real improvement work.
This phase makes the earlier verification story stronger too. Once the repo already has syntax checks, browser suites, docs checks, and release validation, the detection-quality corpus adds a different layer: correctness drift. Not perfect correctness. But more explicit than “the UI still opened.”
Visual evidence
The detection-quality harness is primarily fixtures, expected-result files, and scorecards, so there is no meaningful UI screenshot that would honestly represent the core work here. I am leaving this phase without a visual rather than pretending a generic panel screenshot proves the harness.
What I was really learning here
I was learning that tests can still let quality drift if they mostly prove that the product executes. I needed a smaller, sharper harness that checked whether the findings themselves were still meaningfully right.
Evidence
- Commits:
27afee8– runtime, taxonomy, and release gate updates including detection-quality harness work125f8c1– release gate regressions fixed and dist export tests stabilized
- Files:
../../fixtures/accuracy/corpus.json../../expected-results/../../rule-scorecard/harness.js../../tests/playwright/detection-quality.spec.js../../rule-scorecard/latest-scorecard.json../../package.json

