WorkflowBench™ Preview

How We Measure QA Intelligence

Most QA benchmarks count bugs found. We measure whether a tool understands why a bug matters, its business impact, workflow cascade, and domain context.

Think of It This Way

Every industry needs a standard classification for what goes wrong. Security has one. Functional testing didn't -- until now.

OWASP Top 10
CYBERSECURITY

The universal standard for classifying the most critical security vulnerabilities. Every security team, compliance framework, and pen-test report references it.

A01Broken Access Control
A03Injection
A05Security Misconfiguration
A07Auth Failures
IMPACT

Standardized security testing globally. Referenced by PCI-DSS, SOC2, ISO 27001.

=
QEFix™ Top 10
QUALITY ENGINEERING FUNCTIONAL ISSUES EXPLORER

QEFix™ (Quality Engineering Functional Issues eXplorer) is the first standard classification of the most critical functional issues per SaaS vertical. Domain-specific, not generic. 25+ verticals covered.

PM-07Workflow State Machine Errors
DT-01Event Ingestion Failures
CC-02Permission Escalation
IA-03Token Lifecycle Issues
IMPACT

Standardizing functional testing across SaaS verticals. Domain-specific bug prioritization.

Before OWASP, security testing was ad-hoc. Before QEFix™, functional testing was ad-hoc. Both bring order by defining what matters most for your specific context.

Bug detection rate alone is a misleading metric

A tool that flags a missing WHERE clause is useful. A tool that also tells you that missing clause will delete reminders for every user in your system, break three downstream services, and violate your data isolation policy -- that is a different category of intelligence.

WorkflowBench™ evaluates QA tools not just on what they catch, but on how much they understand about the system they are testing.

0
Open-Source Repos
0
Real Production Bugs
0
Scoring Dimensions
0
Sub-Metrics Evaluated

Three-Layer Evaluation Framework

Each bug is evaluated through three lenses: what category it belongs to, how deeply a tool understands it, and what business outcomes that understanding enables.

1

What to Test

QEFix™ Taxonomy

QEFix™ (Quality Engineering Functional Issues eXplorer) classifies the top 10 bug categories for each SaaS vertical. A scheduling app breaks differently than an auth platform.

50 categories across 5 verticals
2

How Deep Is the Understanding

FQI™ Intelligence Index

FQI™ (Functional QA Intelligence) measures four dimensions of QA intelligence: requirements traceability, workflow awareness, domain knowledge, and learning from history.

4 dimensions, 12 sub-metrics
3

What Business Outcomes Result

Quality Outcomes

The practical impact: production bug escape rate, time to detection, workflow coverage, and false positive ratio.

5 outcome metrics

Four Dimensions of QA Intelligence

The FQI™ (Functional QA Intelligence) Index measures how deeply a tool understands a code change -- not just whether it spots a bug, but whether it grasps the requirements, workflows, domain rules, and historical patterns involved.

FQI™Score
Requirements25%
Workflow30%
Domain25%
Learning20%
FQI™ = 0.25RI + 0.30WI + 0.25DI + 0.20LI

Workflow Intelligence is weighted highest (30%) because cross-service cascade detection is the hardest capability and the most valuable in production.

Requirements Intelligence

25%

Does the tool connect code changes back to user stories and acceptance criteria?

Story-to-Test TraceabilityAcceptance Criteria CoverageNegative Scenario Generation

Workflow Intelligence

30%

Does the tool identify which end-to-end user workflows are affected?

E2E Workflow CoverageCross-Service Cascade DetectionState Transition Awareness

Domain Intelligence

25%

Does the tool understand the business rules and domain constraints at risk?

Business Rule AwarenessDomain Constraint ValidationRegulatory Compliance Awareness

Learning Intelligence

20%

Does the tool learn from past incidents and prioritize by real business impact?

Historical Pattern RecognitionTest Impact AnalysisRisk Prioritization Accuracy

Test Dataset

The benchmark uses 15 real production bugs from 5 open-source GitHub repositories. Each bug was traced from a fix commit back to the PR that introduced it.

Repositories span five languages and five software verticals, ensuring the framework is tested against diverse domain logic -- from scheduling workflows to authentication flows to monitoring pipelines.

📅
Cal.comTypeScript

Open-source scheduling infrastructure

10 bug categories
3 bugs tested
🔍
SentryPython

Error tracking and performance monitoring

10 bug categories
3 bugs tested
📊
GrafanaGo

Monitoring and observability platform

10 bug categories
3 bugs tested
🔐
KeycloakJava

Identity and access management

10 bug categories
3 bugs tested
View analysis
💬
DiscourseRuby

Community discussion platform

10 bug categories
3 bugs tested
ILLUSTRATIVE EXAMPLE, SIMULATED OUTPUT

What Deeper Analysis Looks Like

Below is a real bug from an open-source repo. The comparison shows the difference between surface-level detection and the kind of deep, workflow-aware analysis our framework measures.

Unscoped deleteMany deletes all WorkflowReminders

The deleteMany query in the booking cancellation handler lacks a proper WHERE clause, causing it to delete workflow reminders for ALL users instead of just the cancelling user's reminders.

CRITICAL
Booking cancellation flowReminder notification pipelineCalendar sync for all users
SURFACE-LEVEL DETECTION

Found: deleteMany call in cancellation handler has no WHERE filter on userId. Generated test: 'Verify that cancelling booking A does not delete reminders for booking B.' Did not identify downstream impacts on notification service or calendar sync.

Spots the code defect
No workflow cascade analysis
No business rule connection
No historical pattern matching
DEEP WORKFLOW-AWARE ANALYSIS

CRITICAL: Unscoped deleteMany will cascade across all tenants. Identified 3 downstream workflow impacts: (1) notification pipeline will send false cancellations to ~all active users, (2) Google Calendar sync will remove events for unrelated bookings, (3) analytics aggregation will report incorrect cancellation metrics. Traced to user story US-142: 'Cancel booking without side effects.' Business rule BR-017 violated: 'Data mutations must be scoped to the acting user's tenant.'

Spots the code defect
Traces cascade across 3 downstream services
Connects to 3 business rules
Links to user story and acceptance criteria
CASCADE IMPACTS IDENTIFIED
1
Notification service sends cancellation emails to wrong users
2
Calendar integrations remove events for unrelated bookings
3
Analytics reports show inflated cancellation rates