WorkflowBench™ Preview

How We Measure QA Intelligence

Most QA benchmarks count bugs found. We measure whether a tool understands why a bug matters, its business impact, workflow cascade, and domain context.

See the Framework View Example

Think of It This Way

Every industry needs a standard classification for what goes wrong. Security has one. Functional testing didn't -- until now.

OWASP Top 10

CYBERSECURITY

The universal standard for classifying the most critical security vulnerabilities. Every security team, compliance framework, and pen-test report references it.

A01Broken Access Control

A03Injection

A05Security Misconfiguration

A07Auth Failures

IMPACT

Standardized security testing globally. Referenced by PCI-DSS, SOC2, ISO 27001.

EQUIVALENT

QEFix™ Top 10

QUALITY ENGINEERING FUNCTIONAL ISSUES EXPLORER

QEFix™ (Quality Engineering Functional Issues eXplorer) is the first standard classification of the most critical functional issues per SaaS vertical. Domain-specific, not generic. 25+ verticals covered.

PM-07Workflow State Machine Errors

DT-01Event Ingestion Failures

CC-02Permission Escalation

IA-03Token Lifecycle Issues

IMPACT

Standardizing functional testing across SaaS verticals. Domain-specific bug prioritization.

Before OWASP, security testing was ad-hoc. Before QEFix™, functional testing was ad-hoc. Both bring order by defining what matters most for your specific context.

Bug detection rate alone is a misleading metric

A tool that flags a missing WHERE clause is useful. A tool that also tells you that missing clause will delete reminders for every user in your system, break three downstream services, and violate your data isolation policy -- that is a different category of intelligence.

WorkflowBench™ evaluates QA tools not just on what they catch, but on how much they understand about the system they are testing.

Open-Source Repos

Real Production Bugs

Scoring Dimensions

Sub-Metrics Evaluated

Three-Layer Evaluation Framework

Each bug is evaluated through three lenses: what category it belongs to, how deeply a tool understands it, and what business outcomes that understanding enables.

What to Test

QEFix™ Taxonomy

QEFix™ (Quality Engineering Functional Issues eXplorer) classifies the top 10 bug categories for each SaaS vertical. A scheduling app breaks differently than an auth platform.

50 categories across 5 verticals

How Deep Is the Understanding

FQI™ Intelligence Index

FQI™ (Functional QA Intelligence) measures four dimensions of QA intelligence: requirements traceability, workflow awareness, domain knowledge, and learning from history.

4 dimensions, 12 sub-metrics

What Business Outcomes Result

Quality Outcomes

The practical impact: production bug escape rate, time to detection, workflow coverage, and false positive ratio.

5 outcome metrics

What to Test

QEFix™ Taxonomy

QEFix™ (Quality Engineering Functional Issues eXplorer) classifies the top 10 bug categories for each SaaS vertical. A scheduling app breaks differently than an auth platform.

50 categories across 5 verticals

How Deep Is the Understanding

FQI™ Intelligence Index

FQI™ (Functional QA Intelligence) measures four dimensions of QA intelligence: requirements traceability, workflow awareness, domain knowledge, and learning from history.

4 dimensions, 12 sub-metrics

What Business Outcomes Result

Quality Outcomes

The practical impact: production bug escape rate, time to detection, workflow coverage, and false positive ratio.

5 outcome metrics

Four Dimensions of QA Intelligence

The FQI™ (Functional QA Intelligence) Index measures how deeply a tool understands a code change -- not just whether it spots a bug, but whether it grasps the requirements, workflows, domain rules, and historical patterns involved.

FQI™Score

Requirements25%

Workflow30%

Domain25%

Learning20%

FQI™ = 0.25RI + 0.30WI + 0.25DI + 0.20LI

Workflow Intelligence is weighted highest (30%) because cross-service cascade detection is the hardest capability and the most valuable in production.

Requirements Intelligence

25%

Does the tool connect code changes back to user stories and acceptance criteria?

Story-to-Test TraceabilityAcceptance Criteria CoverageNegative Scenario Generation

Workflow Intelligence

30%

Does the tool identify which end-to-end user workflows are affected?

E2E Workflow CoverageCross-Service Cascade DetectionState Transition Awareness

Domain Intelligence

25%

Does the tool understand the business rules and domain constraints at risk?

Business Rule AwarenessDomain Constraint ValidationRegulatory Compliance Awareness

Learning Intelligence

20%

Does the tool learn from past incidents and prioritize by real business impact?

Historical Pattern RecognitionTest Impact AnalysisRisk Prioritization Accuracy

Test Dataset

The benchmark uses 15 real production bugs from 5 open-source GitHub repositories. Each bug was traced from a fix commit back to the PR that introduced it.

Repositories span five languages and five software verticals, ensuring the framework is tested against diverse domain logic -- from scheduling workflows to authentication flows to monitoring pipelines.

📅

Cal.comTypeScript

Open-source scheduling infrastructure

10 bug categories

3 bugs tested

🔍

SentryPython

Error tracking and performance monitoring

10 bug categories

3 bugs tested

📊

GrafanaGo

Monitoring and observability platform

10 bug categories

3 bugs tested

🔐

KeycloakJava

Identity and access management

10 bug categories

3 bugs tested

View analysis

💬

DiscourseRuby

Community discussion platform

10 bug categories

3 bugs tested

ILLUSTRATIVE EXAMPLE, SIMULATED OUTPUT

What Deeper Analysis Looks Like

Below is a real bug from an open-source repo. The comparison shows the difference between surface-level detection and the kind of deep, workflow-aware analysis our framework measures.

Unscoped deleteMany deletes all WorkflowReminders

The deleteMany query in the booking cancellation handler lacks a proper WHERE clause, causing it to delete workflow reminders for ALL users instead of just the cancelling user's reminders.

CRITICAL

Booking cancellation flowReminder notification pipelineCalendar sync for all users

SURFACE-LEVEL DETECTION

Found: deleteMany call in cancellation handler has no WHERE filter on userId. Generated test: 'Verify that cancelling booking A does not delete reminders for booking B.' Did not identify downstream impacts on notification service or calendar sync.

Spots the code defect

No workflow cascade analysis

No business rule connection

No historical pattern matching

DEEP WORKFLOW-AWARE ANALYSIS

CRITICAL: Unscoped deleteMany will cascade across all tenants. Identified 3 downstream workflow impacts: (1) notification pipeline will send false cancellations to ~all active users, (2) Google Calendar sync will remove events for unrelated bookings, (3) analytics aggregation will report incorrect cancellation metrics. Traced to user story US-142: 'Cancel booking without side effects.' Business rule BR-017 violated: 'Data mutations must be scoped to the acting user's tenant.'

Spots the code defect

Traces cascade across 3 downstream services

Connects to 3 business rules

Links to user story and acceptance criteria

CASCADE IMPACTS IDENTIFIED

Notification service sends cancellation emails to wrong users

Calendar integrations remove events for unrelated bookings

Analytics reports show inflated cancellation rates

VERIFIED ON OPEN SOURCE

See It In Action

We ran OrangePro's analysis against real pull requests in public repositories. Every finding is verifiable — click through to see the trace matrix, evidence, and the exact GitHub searches.

🏢

Twenty CRM

TypeScript25.7k \u2605

verified gaps

i18n label persistence had 50% of requirements untested

Requirement Coverage5 of 10 requirements untested

Zero RBAC tests for the exact behavior the PR was fixing