WorkflowBench™ Documentation

The Three-Layer Quality Framework

WorkflowBench™ is built on a three-layer framework called QEFix™ (Quality Engineering Functional Issues eXplorer). It measures functional QA intelligence -- not just code-level bug detection.

This page explains each layer, how they connect, and what data artifacts are needed to run benchmarks.

1What to Test

Vertical-specific quality categories

2How Well You Test

Functional QA Intelligence index

3What It Means

Quality outcome metrics

Layer 1

What to Test: Vertical-Specific Quality Categories

Like OWASP Top 10 for security, QEFix™ defines the Top 10 functional quality issues for each SaaS vertical. A scheduling app has fundamentally different failure modes than an identity management platform or a monitoring tool.

Each vertical's categories are derived from analysis of real production incidents, bug reports, and post-mortems in that domain.

TypeScriptCal.com

Project Management / Scheduling SaaS

PM-01Scheduling Logic Failures

PM-02Timezone & Localization Errors

PM-03Data Integrity & Accuracy

PM-04Integration Sync Failures

PM-05Notification & Reminder Failures

+ 5 more categories

PythonSentry

Developer Tools / Monitoring SaaS

DT-01Event Ingestion Pipeline Failures

DT-02Alerting & Notification Logic Errors

DT-03Data Retention & Cleanup Failures

DT-04Integration & Webhook Failures

DT-05Query & Search Performance

+ 5 more categories

RubyDiscourse

Community / Communication SaaS

CC-01Content Moderation Failures

CC-02User Permission & Role Errors

CC-03Notification & Email Delivery

CC-04Search & Discovery Failures

CC-05Plugin & Extension Conflicts

+ 5 more categories

JavaKeycloak

Identity & Access Management SaaS

IA-01Authentication Flow Failures

IA-02Authorization & RBAC Errors

IA-03Token Lifecycle Issues

IA-04Federation & SSO Failures

IA-05Session Management Bugs

+ 5 more categories

GoGrafana

Analytics / Observability SaaS

AO-01Dashboard Rendering Failures

AO-02Data Source Query Errors

AO-03Alerting Rule Evaluation

AO-04Plugin & Extension Failures

AO-05Time Range & Timezone Handling

+ 5 more categories

Layer 2

How Well You Test: FQI™ (Functional QA Intelligence)

The FQI™ (Functional QA Intelligence) index measures testing capability across four dimensions, each with three sub-metrics. Together, they answer: "How deeply does this QA approach understand your system?"

Each sub-metric is scored 0 to 100 based on specific rubrics. Dimension scores are averaged from their three sub-metrics, then combined using weighted averaging into a single composite score.

Requirements Intelligence

25% of overall score

RI1Story-to-Test Traceability

Can the tool connect a code change back to the user story or requirement it implements?

RI2Acceptance Criteria Coverage

Does the tool generate tests that cover the acceptance criteria, not just the code diff?

RI3Negative Scenario Generation

Does the tool generate tests for what should NOT happen?

Workflow Intelligence

30% of overall score

WI1E2E Workflow Coverage

Does the tool identify which complete user workflows are affected by the change?

WI2Cross-Service Cascade Detection

Does the tool predict downstream failures in connected services?

WI3State Transition Awareness

Does the tool understand how the change affects state machines and lifecycle transitions?

Domain Intelligence

25% of overall score

DI1Business Rule Awareness

Does the tool identify business rules that are violated or at risk?

DI2Domain Constraint Validation

Does the tool validate domain-specific constraints?

DI3Regulatory/Compliance Awareness

Does the tool flag potential compliance issues?

Learning Intelligence

20% of overall score

LI1Historical Pattern Recognition

Does the tool connect the current change to similar past incidents?

LI2Test Impact Analysis

Does the tool identify which existing tests are invalidated or need updating?

LI3Risk Prioritization Accuracy

Does the tool rank findings by actual business impact?

How Scores Are Generated

Evaluation Process

1.Each tool receives the same PR diff, user story, and context artifacts.
2.Tool output is anonymized and evaluated by an LLM-as-judge panel.
3.Each of the 12 sub-metrics is scored 0 to 100 based on specific rubrics.
4.Dimension scores are averaged from their 3 sub-metrics.
5.The composite intelligence score is the weighted sum of 4 dimensions.

Example Rubric: Story-to-Test Traceability

90-100Explicitly links test to user story with acceptance criteria mapping

70-89References user story but does not map all acceptance criteria

40-69Implicitly related to requirements but no explicit traceability

0-39No connection to user stories or requirements

Layer 3

What It Means: Quality Outcome Metrics

The outcome metrics translate testing intelligence into business-meaningful results. They answer: "What does better QA intelligence actually deliver?"

These metrics are designed to be tracked over time. WorkflowBench™ projects expected outcomes based on intelligence scores.

Defect Catch Rate

Percentage of bugs caught before they reach production users.

Caught pre-prod / Total

Target> 95%

First-Pass Rate

How often pull requests pass QA on the first attempt.

PRs passing first time / Total

Target> 80%

Early Detection

Ratio of bugs found in development vs. production.

Dev defects / Prod defects

Target> 10:1

PR Quality Index

Composite score measuring overall pull request health.

Coverage + Tests + Docs + Review

Target> 75/100

Response Time

How quickly a bug is flagged after the code is committed.

Detection - Commit timestamp

Target< 4 hours

How the Three Layers Connect

Layer 1

What to Test

Defines quality categories for each vertical

Layer 2

How Well You Test

Measures intelligence across 4 dimensions

Layer 3

What It Means

Translates capability into business value

Running Benchmarks

Required Data Artifacts

To run a WorkflowBench evaluation, you provide specific data artifacts. The more context you provide, the more dimensions of the intelligence index can be measured.

Artifact	Required	What It Enables
PR Diffs / Code Changes The actual code changes being evaluated. Minimum input for any benchmark run.	Required	Basic code-level detection, partial workflow tracing
User Stories / Requirements The requirements or user stories that the code changes implement.	Optional	Story-to-test traceability, acceptance criteria coverage
Business Rules / Domain Constraints Domain-specific rules that must be upheld (e.g., data isolation, SLA guarantees).	Optional	Business rule awareness, domain constraint validation
Architecture / Service Map Service dependencies, data flow diagrams, integration points.	Optional	End-to-end workflow coverage, cross-service cascade detection
Historical Bugs / Incidents Past bugs, incidents, and their resolutions. Enables pattern matching across time.	Optional	Historical pattern recognition
Existing Test Suite Current test cases and their coverage.	Optional	Test impact analysis, coverage gap detection
Production Logs / Metrics Error rates, usage patterns, SLA data.	Optional	Risk prioritization accuracy
Compliance / Regulatory Requirements GDPR, SOC-2, HIPAA, or industry-specific compliance requirements.	Optional	Regulatory and compliance awareness

ARTIFACT → DIMENSION MAPPING

PR Diffs

RequirementsWorkflow

User Stories

Requirements

Business Rules

Domain

Architecture

Workflow

Past Incidents

Learning

Test Suite

RequirementsLearning

Prod Logs

LearningWorkflow

How a Benchmark Run Works

The WorkflowBench™ evaluation follows a structured pipeline to ensure fair, reproducible comparisons.

Collect Artifacts

Gather PR diffs, user stories, business rules, and other context from your repository.

Run Each Tool

Each QA tool receives identical inputs. Outputs are collected and anonymized.

Score with LLM Judge

Anonymized outputs are evaluated against the 12 sub-metric rubrics by an LLM panel.

Generate Report

Intelligence scores are aggregated and a comparison report is generated with full transparency.

Collect Artifacts

Gather PR diffs, user stories, business rules, and other context from your repository.

Run Each Tool

Each QA tool receives identical inputs. Outputs are collected and anonymized.

Score with LLM Judge

Anonymized outputs are evaluated against the 12 sub-metric rubrics by an LLM panel.

Generate Report

Intelligence scores are aggregated and a comparison report is generated with full transparency.

Framework Preview

Real Benchmark Results Coming Soon

We are running the full WorkflowBench™ evaluation against leading QA approaches. Results will include actual tool outputs, scored by independent LLM-as-judge panels, with full transparency on scoring criteria and limitations.

Back to Overview Explore OrangePro