WorkflowBench™ is built on a three-layer framework called QEFix™ (Quality Engineering Functional Issues eXplorer). It measures functional QA intelligence -- not just code-level bug detection.
This page explains each layer, how they connect, and what data artifacts are needed to run benchmarks.
Vertical-specific quality categories
Functional QA Intelligence index
Quality outcome metrics
Like OWASP Top 10 for security, QEFix™ defines the Top 10 functional quality issues for each SaaS vertical. A scheduling app has fundamentally different failure modes than an identity management platform or a monitoring tool.
Each vertical's categories are derived from analysis of real production incidents, bug reports, and post-mortems in that domain.
Project Management / Scheduling SaaS
+ 5 more categories
Developer Tools / Monitoring SaaS
+ 5 more categories
Community / Communication SaaS
+ 5 more categories
Identity & Access Management SaaS
+ 5 more categories
Analytics / Observability SaaS
+ 5 more categories
The FQI™ (Functional QA Intelligence) index measures testing capability across four dimensions, each with three sub-metrics. Together, they answer: "How deeply does this QA approach understand your system?"
Each sub-metric is scored 0 to 100 based on specific rubrics. Dimension scores are averaged from their three sub-metrics, then combined using weighted averaging into a single composite score.
25% of overall score
Can the tool connect a code change back to the user story or requirement it implements?
Does the tool generate tests that cover the acceptance criteria, not just the code diff?
Does the tool generate tests for what should NOT happen?
30% of overall score
Does the tool identify which complete user workflows are affected by the change?
Does the tool predict downstream failures in connected services?
Does the tool understand how the change affects state machines and lifecycle transitions?
25% of overall score
Does the tool identify business rules that are violated or at risk?
Does the tool validate domain-specific constraints?
Does the tool flag potential compliance issues?
20% of overall score
Does the tool connect the current change to similar past incidents?
Does the tool identify which existing tests are invalidated or need updating?
Does the tool rank findings by actual business impact?
Evaluation Process
Example Rubric: Story-to-Test Traceability
The outcome metrics translate testing intelligence into business-meaningful results. They answer: "What does better QA intelligence actually deliver?"
These metrics are designed to be tracked over time. WorkflowBench™ projects expected outcomes based on intelligence scores.
Percentage of bugs caught before they reach production users.
Caught pre-prod / Total
How often pull requests pass QA on the first attempt.
PRs passing first time / Total
Ratio of bugs found in development vs. production.
Dev defects / Prod defects
Composite score measuring overall pull request health.
Coverage + Tests + Docs + Review
How quickly a bug is flagged after the code is committed.
Detection - Commit timestamp
Layer 1
What to Test
Defines quality categories for each vertical
Layer 2
How Well You Test
Measures intelligence across 4 dimensions
Layer 3
What It Means
Translates capability into business value
To run a WorkflowBench evaluation, you provide specific data artifacts. The more context you provide, the more dimensions of the intelligence index can be measured.
| Artifact | Required | What It Enables |
|---|---|---|
PR Diffs / Code Changes The actual code changes being evaluated. Minimum input for any benchmark run. | Required | Basic code-level detection, partial workflow tracing |
User Stories / Requirements The requirements or user stories that the code changes implement. | Optional | Story-to-test traceability, acceptance criteria coverage |
Business Rules / Domain Constraints Domain-specific rules that must be upheld (e.g., data isolation, SLA guarantees). | Optional | Business rule awareness, domain constraint validation |
Architecture / Service Map Service dependencies, data flow diagrams, integration points. | Optional | End-to-end workflow coverage, cross-service cascade detection |
Historical Bugs / Incidents Past bugs, incidents, and their resolutions. Enables pattern matching across time. | Optional | Historical pattern recognition |
Existing Test Suite Current test cases and their coverage. | Optional | Test impact analysis, coverage gap detection |
Production Logs / Metrics Error rates, usage patterns, SLA data. | Optional | Risk prioritization accuracy |
Compliance / Regulatory Requirements GDPR, SOC-2, HIPAA, or industry-specific compliance requirements. | Optional | Regulatory and compliance awareness |
The WorkflowBench™ evaluation follows a structured pipeline to ensure fair, reproducible comparisons.
Gather PR diffs, user stories, business rules, and other context from your repository.
Each QA tool receives identical inputs. Outputs are collected and anonymized.
Anonymized outputs are evaluated against the 12 sub-metric rubrics by an LLM panel.
Intelligence scores are aggregated and a comparison report is generated with full transparency.
Gather PR diffs, user stories, business rules, and other context from your repository.
Each QA tool receives identical inputs. Outputs are collected and anonymized.
Anonymized outputs are evaluated against the 12 sub-metric rubrics by an LLM panel.
Intelligence scores are aggregated and a comparison report is generated with full transparency.
We are running the full WorkflowBench™ evaluation against leading QA approaches. Results will include actual tool outputs, scored by independent LLM-as-judge panels, with full transparency on scoring criteria and limitations.