Back to Overview
Workflow Benchâ„¢ Documentation

The Three-Layer Quality Framework

Workflow Bench is built on a three-layer framework called QEFix (Quality Engineering Functional Issues eXplorer). It measures functional QA intelligence, not just code-level bug detection.

This page explains each layer, how they connect, and what data artifacts are needed to run benchmarks.

1
Layer 1

What to Test: Vertical-Specific Quality Categories

Like OWASP Top 10 for security, QEFix defines the Top 10 functional quality issues for each SaaS vertical. A scheduling app has fundamentally different failure modes than an identity management platform or a monitoring tool.

Each vertical's categories are derived from analysis of real production incidents, bug reports, and post-mortems in that domain. The categories are specific enough to be actionable but general enough to apply across implementations.

TypeScriptCal.com

Project Management / Scheduling SaaS

PM-01Scheduling Logic Failures
PM-02Timezone & Localization Errors
PM-03Data Integrity & Accuracy
PM-04Integration Sync Failures
PM-05Notification & Reminder Failures

+ 5 more categories

PythonSentry

Developer Tools / Monitoring SaaS

DT-01Event Ingestion Pipeline Failures
DT-02Alerting & Notification Logic Errors
DT-03Data Retention & Cleanup Failures
DT-04Integration & Webhook Failures
DT-05Query & Search Performance

+ 5 more categories

RubyDiscourse

Community / Communication SaaS

CC-01Content Moderation Failures
CC-02User Permission & Role Errors
CC-03Notification & Email Delivery
CC-04Search & Discovery Failures
CC-05Plugin & Extension Conflicts

+ 5 more categories

JavaKeycloak

Identity & Access Management SaaS

IA-01Authentication Flow Failures
IA-02Authorization & RBAC Errors
IA-03Token Lifecycle Issues
IA-04Federation & SSO Failures
IA-05Session Management Bugs

+ 5 more categories

GoGrafana

Analytics / Observability SaaS

AO-01Dashboard Rendering Failures
AO-02Data Source Query Errors
AO-03Alerting Rule Evaluation
AO-04Plugin & Extension Failures
AO-05Time Range & Timezone Handling

+ 5 more categories

2
Layer 2

How Well You Test: Functional QA Intelligence

The Functional QA Intelligence index measures testing capability across four dimensions, each with three sub-metrics. Together, they answer: "How deeply does this QA approach understand your system?"

Each sub-metric is scored 0 to 100 based on specific rubrics. Dimension scores are averaged from their three sub-metrics, then combined using weighted averaging into a single composite score.

Requirements Intelligence

25% of overall score

RI1Story-to-Test Traceability

Can the tool connect a code change back to the user story or requirement it implements?

RI2Acceptance Criteria Coverage

Does the tool generate tests that cover the acceptance criteria, not just the code diff?

RI3Negative Scenario Generation

Does the tool generate tests for what should NOT happen?

Workflow Intelligence

30% of overall score

WI1E2E Workflow Coverage

Does the tool identify which complete user workflows are affected by the change?

WI2Cross-Service Cascade Detection

Does the tool predict downstream failures in connected services?

WI3State Transition Awareness

Does the tool understand how the change affects state machines and lifecycle transitions?

Domain Intelligence

25% of overall score

DI1Business Rule Awareness

Does the tool identify business rules that are violated or at risk?

DI2Domain Constraint Validation

Does the tool validate domain-specific constraints?

DI3Regulatory/Compliance Awareness

Does the tool flag potential compliance issues?

Learning Intelligence

20% of overall score

LI1Historical Pattern Recognition

Does the tool connect the current change to similar past incidents?

LI2Test Impact Analysis

Does the tool identify which existing tests are invalidated or need updating?

LI3Risk Prioritization Accuracy

Does the tool rank findings by actual business impact?

How Scores Will Be Generated

Evaluation Process

  1. 1.Each tool receives the same PR diff, user story, and context artifacts.
  2. 2.Tool output is anonymized and evaluated by an LLM-as-judge panel.
  3. 3.Each of the 12 sub-metrics is scored 0 to 100 based on specific rubrics.
  4. 4.Dimension scores are averaged from their 3 sub-metrics.
  5. 5.The composite intelligence score is the weighted sum of 4 dimensions.

Example Rubric: Story-to-Test Traceability

90-100Explicitly links test to user story with acceptance criteria mapping
70-89References user story but does not map all acceptance criteria
40-69Implicitly related to requirements but no explicit traceability
0-39No connection to user stories or requirements
3
Layer 3

What It Means: Quality Outcome Metrics

The outcome metrics translate testing intelligence into business-meaningful results. They answer the question: "What does better QA intelligence actually deliver?"

These metrics are designed to be tracked over time as part of a continuous quality improvement program. Workflow Bench will project expected outcomes based on intelligence scores.

Quality Defect Rate

Percentage of defects caught before production. Measures the effectiveness of the QA process at preventing customer-facing bugs.

(Defects caught pre-prod / Total defects) x 100

Target> 95%

Quality Gate Pass Rate

Percentage of pull requests that pass all quality gates on first attempt. Indicates how well the development process prevents quality issues upstream.

(PRs passing first time / Total PRs) x 100

Target> 80%

Early Defect Detection Index

Ratio of defects found during development vs. post-deployment. Higher values indicate earlier detection, which reduces fix costs exponentially.

Defects found in dev / Defects found in prod

Target> 10:1

Pull Request Quality Index

Composite score measuring the quality of individual PRs across code coverage, test coverage, documentation, and review thoroughness.

w1*Coverage + w2*Tests + w3*Docs + w4*Review

Target> 75/100

Quality Response Time

Time from defect introduction (commit) to defect detection. Measures how quickly the QA system identifies issues in the development pipeline.

Detection timestamp - Commit timestamp

Target< 4 hours

How the Three Layers Connect

Layer 1

What to Test

Defines quality categories for each vertical

Layer 2

How Well You Test

Measures intelligence across 4 dimensions

Layer 3

What It Means

Translates capability into business value

Running Benchmarks

Required Data Artifacts

To run a Workflow Bench evaluation, you provide specific data artifacts. The more context you provide, the more dimensions of the intelligence index can be measured. Here is what each artifact enables.

ArtifactRequiredWhat It Enables

PR Diffs / Code Changes

The actual code changes being evaluated. This is the minimum input for any benchmark run.

RequiredBasic code-level detection, partial workflow tracing

User Stories / Requirements

The requirements or user stories that the code changes implement. Connects code to intent.

OptionalStory-to-test traceability, acceptance criteria coverage

Business Rules / Domain Constraints

Domain-specific rules that must be upheld. Enables the system to validate business logic.

OptionalBusiness rule awareness, domain constraint validation

Architecture / Service Map

Service dependencies, data flow diagrams, integration points. Maps the system topology.

OptionalEnd-to-end workflow coverage, cross-service cascade detection

Historical Bugs / Incidents

Past bugs, incidents, and their resolutions. Enables pattern matching across time.

OptionalHistorical pattern recognition

Existing Test Suite

Current test cases and their coverage. Enables test impact analysis.

OptionalTest impact analysis, coverage gap detection

Production Logs / Metrics

Error rates, usage patterns, SLA data. Enables risk-based prioritization.

OptionalRisk prioritization accuracy

Compliance / Regulatory Requirements

GDPR, SOC-2, HIPAA, or industry-specific compliance requirements.

OptionalRegulatory and compliance awareness

How a Benchmark Run Works

The Workflow Bench evaluation process follows a structured pipeline to ensure fair, reproducible comparisons.

1

Collect Artifacts

Gather PR diffs, user stories, business rules, and other context from your repository. More context enables more intelligence dimensions.

2

Run Each Tool

Each QA approach receives identical inputs. Tools are given the same PR diff, context, and time budget. Outputs are collected and anonymized.

3

Score with LLM-as-Judge

Anonymized outputs are evaluated by an LLM panel against the 12 sub-metric rubrics. Multiple judges reduce individual bias.

4

Generate Report

Intelligence scores are aggregated, outcome metrics are projected, and a comprehensive comparison report is generated with full transparency.

Framework Preview

Real Benchmark Results Coming Soon

We are running the full Workflow Bench evaluation against leading QA approaches. Results will include actual tool outputs, scored by independent LLM-as-judge panels, with full transparency on scoring criteria and limitations.