Workflow Benchβ„’ Preview

How We Measure QA Intelligence

Most QA benchmarks count bugs found. We measure whether a tool understands why a bug matters, its business impact, workflow cascade, and domain context.

Bug detection rate alone is a misleading metric

A tool that flags a missing WHERE clause is useful. A tool that also tells you that missing clause will delete reminders for every user in your system, break three downstream services, and violate your data isolation policy, that is a different category of intelligence.

Workflow Bench is our framework for measuring this deeper understanding. It evaluates QA tools not just on what they catch, but on how much they understand about the system they are testing.

0
Open-Source Repos
0
Real Production Bugs
0
Scoring Dimensions
0
Sub-Metrics Evaluated

Three-Layer Evaluation Framework

Each bug is evaluated through three lenses: what category it belongs to, how deeply a tool understands it, and what business outcomes that understanding enables.

01

What to Test

QEFix Taxonomy

A vertical-specific classification of the top 10 bug categories for each software domain. Instead of generic labels, bugs are mapped to categories like "Scheduling Logic Failures" or "Token Lifecycle Issues."

50 categories across 5 verticals
02

How Deep Is the Understanding

Functional QA Intelligence Index

A composite score measuring four dimensions of QA intelligence: requirements traceability, workflow awareness, domain knowledge, and learning from history.

4 dimensions, 12 sub-metrics
03

What Business Outcomes Result

Quality Outcomes

The practical impact: production bug escape rate, mean time to detection, workflow coverage breadth, and false positive ratio.

4 outcome metrics

Four Dimensions of QA Intelligence

The Functional QA Intelligence Index measures how deeply a tool understands a code change, not just whether it spots a bug, but whether it grasps the requirements, workflows, domain rules, and historical patterns involved.

COMPOSITE FORMULA
FQI = 0.25 x RI + 0.30 x WI + 0.25 x DI + 0.20 x LI

Workflow Intelligence is weighted highest (30%) because cross-service cascade detection is the hardest capability and the most valuable in production.

Requirements Intelligence

25%

Does the tool connect code changes back to user stories and acceptance criteria?

Story-to-Test TraceabilityAcceptance Criteria CoverageNegative Scenario Generation

Workflow Intelligence

30%

Does the tool identify which end-to-end user workflows are affected?

End-to-End Workflow CoverageCross-Service Cascade DetectionState Transition Awareness

Domain Intelligence

25%

Does the tool understand the business rules and domain constraints at risk?

Business Rule AwarenessDomain Constraint ValidationRegulatory Compliance Awareness

Learning Intelligence

20%

Does the tool learn from past incidents and prioritize by real business impact?

Historical Pattern RecognitionTest Impact AnalysisRisk Prioritization Accuracy

Test Dataset

The benchmark uses 15 real production bugs from 5 open-source repositories. Each bug was traced from a fix commit back to the PR that introduced it, the same methodology used by Greptile and Entelligence.

Repositories span five languages and five software verticals, ensuring the framework is tested against diverse domain logic, from scheduling workflows to authentication flows to monitoring pipelines.

πŸ“…
Cal.comTypeScript

Open source scheduling infrastructure

10 bug categories
3 bugs tested
πŸ”
SentryPython

Error tracking and performance monitoring

10 bug categories
3 bugs tested
πŸ“Š
GrafanaGo

Monitoring and observability platform

10 bug categories
3 bugs tested
πŸ”
KeycloakJava

Identity and access management

10 bug categories
3 bugs tested
πŸ’¬
DiscourseRuby

Community discussion platform

10 bug categories
3 bugs tested
ILLUSTRATIVE EXAMPLE, SIMULATED OUTPUT

What Deeper Analysis Looks Like

Here is a real bug from Cal.com. The example below shows the difference between surface-level detection and the kind of deep, workflow-aware analysis our framework measures.

Unscoped deleteMany deletes all WorkflowReminders

The deleteMany query in the booking cancellation handler lacks a proper WHERE clause, causing it to delete workflow reminders for ALL users instead of just the cancelling user's reminders.

CRITICAL
Booking cancellation flowReminder notification pipelineCalendar sync for all users
SURFACE-LEVEL DETECTION

Found: deleteMany call in cancellation handler has no WHERE filter on userId. Generated test: 'Verify that cancelling booking A does not delete reminders for booking B.' Did not identify downstream impacts on notification service or calendar sync.

Spots the code defect
No workflow cascade analysis
No business rule connection
No historical pattern matching
DEEP WORKFLOW-AWARE ANALYSIS

CRITICAL: Unscoped deleteMany will cascade across all tenants. Identified 3 downstream workflow impacts: (1) notification pipeline will send false cancellations to ~all active users, (2) Google Calendar sync will remove events for unrelated bookings, (3) analytics aggregation will report incorrect cancellation metrics. Traced to user story US-142: 'Cancel booking without side effects.' Business rule BR-017 violated: 'Data mutations must be scoped to the acting user's tenant.'

Spots the code defect
Traces cascade across 3 downstream services
Connects to 3 business rules
Links to user story and acceptance criteria
CASCADE IMPACTS IDENTIFIED
1
Notification service sends cancellation emails to wrong users
2
Calendar integrations remove events for unrelated bookings
3
Analytics reports show inflated cancellation rates
COMING SOON

Real Benchmark Results

We are running the full Workflow Bench evaluation against leading QA tools. Results will include actual tool outputs, scores, and verifiable PR links, just like the benchmarks from Greptile and Entelligence.