Workflow Bench is built on a three-layer framework called QEFix (Quality Engineering Functional Issues eXplorer). It measures functional QA intelligence, not just code-level bug detection.
This page explains each layer, how they connect, and what data artifacts are needed to run benchmarks.
Like OWASP Top 10 for security, QEFix defines the Top 10 functional quality issues for each SaaS vertical. A scheduling app has fundamentally different failure modes than an identity management platform or a monitoring tool.
Each vertical's categories are derived from analysis of real production incidents, bug reports, and post-mortems in that domain. The categories are specific enough to be actionable but general enough to apply across implementations.
Project Management / Scheduling SaaS
+ 5 more categories
Developer Tools / Monitoring SaaS
+ 5 more categories
Community / Communication SaaS
+ 5 more categories
Identity & Access Management SaaS
+ 5 more categories
Analytics / Observability SaaS
+ 5 more categories
The Functional QA Intelligence index measures testing capability across four dimensions, each with three sub-metrics. Together, they answer: "How deeply does this QA approach understand your system?"
Each sub-metric is scored 0 to 100 based on specific rubrics. Dimension scores are averaged from their three sub-metrics, then combined using weighted averaging into a single composite score.
25% of overall score
Can the tool connect a code change back to the user story or requirement it implements?
Does the tool generate tests that cover the acceptance criteria, not just the code diff?
Does the tool generate tests for what should NOT happen?
30% of overall score
Does the tool identify which complete user workflows are affected by the change?
Does the tool predict downstream failures in connected services?
Does the tool understand how the change affects state machines and lifecycle transitions?
25% of overall score
Does the tool identify business rules that are violated or at risk?
Does the tool validate domain-specific constraints?
Does the tool flag potential compliance issues?
20% of overall score
Does the tool connect the current change to similar past incidents?
Does the tool identify which existing tests are invalidated or need updating?
Does the tool rank findings by actual business impact?
Evaluation Process
Example Rubric: Story-to-Test Traceability
The outcome metrics translate testing intelligence into business-meaningful results. They answer the question: "What does better QA intelligence actually deliver?"
These metrics are designed to be tracked over time as part of a continuous quality improvement program. Workflow Bench will project expected outcomes based on intelligence scores.
Percentage of defects caught before production. Measures the effectiveness of the QA process at preventing customer-facing bugs.
(Defects caught pre-prod / Total defects) x 100
Percentage of pull requests that pass all quality gates on first attempt. Indicates how well the development process prevents quality issues upstream.
(PRs passing first time / Total PRs) x 100
Ratio of defects found during development vs. post-deployment. Higher values indicate earlier detection, which reduces fix costs exponentially.
Defects found in dev / Defects found in prod
Composite score measuring the quality of individual PRs across code coverage, test coverage, documentation, and review thoroughness.
w1*Coverage + w2*Tests + w3*Docs + w4*Review
Time from defect introduction (commit) to defect detection. Measures how quickly the QA system identifies issues in the development pipeline.
Detection timestamp - Commit timestamp
Layer 1
What to Test
Defines quality categories for each vertical
Layer 2
How Well You Test
Measures intelligence across 4 dimensions
Layer 3
What It Means
Translates capability into business value
To run a Workflow Bench evaluation, you provide specific data artifacts. The more context you provide, the more dimensions of the intelligence index can be measured. Here is what each artifact enables.
| Artifact | Required | What It Enables |
|---|---|---|
PR Diffs / Code Changes The actual code changes being evaluated. This is the minimum input for any benchmark run. | Required | Basic code-level detection, partial workflow tracing |
User Stories / Requirements The requirements or user stories that the code changes implement. Connects code to intent. | Optional | Story-to-test traceability, acceptance criteria coverage |
Business Rules / Domain Constraints Domain-specific rules that must be upheld. Enables the system to validate business logic. | Optional | Business rule awareness, domain constraint validation |
Architecture / Service Map Service dependencies, data flow diagrams, integration points. Maps the system topology. | Optional | End-to-end workflow coverage, cross-service cascade detection |
Historical Bugs / Incidents Past bugs, incidents, and their resolutions. Enables pattern matching across time. | Optional | Historical pattern recognition |
Existing Test Suite Current test cases and their coverage. Enables test impact analysis. | Optional | Test impact analysis, coverage gap detection |
Production Logs / Metrics Error rates, usage patterns, SLA data. Enables risk-based prioritization. | Optional | Risk prioritization accuracy |
Compliance / Regulatory Requirements GDPR, SOC-2, HIPAA, or industry-specific compliance requirements. | Optional | Regulatory and compliance awareness |
The Workflow Bench evaluation process follows a structured pipeline to ensure fair, reproducible comparisons.
Gather PR diffs, user stories, business rules, and other context from your repository. More context enables more intelligence dimensions.
Each QA approach receives identical inputs. Tools are given the same PR diff, context, and time budget. Outputs are collected and anonymized.
Anonymized outputs are evaluated by an LLM panel against the 12 sub-metric rubrics. Multiple judges reduce individual bias.
Intelligence scores are aggregated, outcome metrics are projected, and a comprehensive comparison report is generated with full transparency.
We are running the full Workflow Bench evaluation against leading QA approaches. Results will include actual tool outputs, scored by independent LLM-as-judge panels, with full transparency on scoring criteria and limitations.