Most QA benchmarks count bugs found. We measure whether a tool understands why a bug matters, its business impact, workflow cascade, and domain context.
A tool that flags a missing WHERE clause is useful. A tool that also tells you that missing clause will delete reminders for every user in your system, break three downstream services, and violate your data isolation policy, that is a different category of intelligence.
Workflow Bench is our framework for measuring this deeper understanding. It evaluates QA tools not just on what they catch, but on how much they understand about the system they are testing.
Each bug is evaluated through three lenses: what category it belongs to, how deeply a tool understands it, and what business outcomes that understanding enables.
QEFix Taxonomy
A vertical-specific classification of the top 10 bug categories for each software domain. Instead of generic labels, bugs are mapped to categories like "Scheduling Logic Failures" or "Token Lifecycle Issues."
Functional QA Intelligence Index
A composite score measuring four dimensions of QA intelligence: requirements traceability, workflow awareness, domain knowledge, and learning from history.
Quality Outcomes
The practical impact: production bug escape rate, mean time to detection, workflow coverage breadth, and false positive ratio.
The Functional QA Intelligence Index measures how deeply a tool understands a code change, not just whether it spots a bug, but whether it grasps the requirements, workflows, domain rules, and historical patterns involved.
FQI = 0.25 x RI + 0.30 x WI + 0.25 x DI + 0.20 x LIWorkflow Intelligence is weighted highest (30%) because cross-service cascade detection is the hardest capability and the most valuable in production.
Does the tool connect code changes back to user stories and acceptance criteria?
Does the tool identify which end-to-end user workflows are affected?
Does the tool understand the business rules and domain constraints at risk?
Does the tool learn from past incidents and prioritize by real business impact?
The benchmark uses 15 real production bugs from 5 open-source repositories. Each bug was traced from a fix commit back to the PR that introduced it, the same methodology used by Greptile and Entelligence.
Repositories span five languages and five software verticals, ensuring the framework is tested against diverse domain logic, from scheduling workflows to authentication flows to monitoring pipelines.
Open source scheduling infrastructure
Error tracking and performance monitoring
Monitoring and observability platform
Identity and access management
Community discussion platform
Here is a real bug from Cal.com. The example below shows the difference between surface-level detection and the kind of deep, workflow-aware analysis our framework measures.
The deleteMany query in the booking cancellation handler lacks a proper WHERE clause, causing it to delete workflow reminders for ALL users instead of just the cancelling user's reminders.
Found: deleteMany call in cancellation handler has no WHERE filter on userId. Generated test: 'Verify that cancelling booking A does not delete reminders for booking B.' Did not identify downstream impacts on notification service or calendar sync.
CRITICAL: Unscoped deleteMany will cascade across all tenants. Identified 3 downstream workflow impacts: (1) notification pipeline will send false cancellations to ~all active users, (2) Google Calendar sync will remove events for unrelated bookings, (3) analytics aggregation will report incorrect cancellation metrics. Traced to user story US-142: 'Cancel booking without side effects.' Business rule BR-017 violated: 'Data mutations must be scoped to the acting user's tenant.'
We are running the full Workflow Bench evaluation against leading QA tools. Results will include actual tool outputs, scores, and verifiable PR links, just like the benchmarks from Greptile and Entelligence.