StackEval

Beyond Metrics: Designing a Structured LLM Evaluation Tool

Institution University of Michigan, School of Information

Course SI 699: UX Capstone

Timeline Winter 2026

Team 5 Members

Client Backboard.io ↗

Developers don't evaluate LLMs with spreadsheets. They use intuition, experience, and task-specific reasoning, none of which existing tools support.

StackEval is a decision-support platform built for Backboard.io — an AI infrastructure company — that transforms informal, "vibes-based" model evaluation into a structured, repeatable workflow. Over a semester-long capstone, our team of five UX researchers and designers conducted 5 semi-structured interviews, preference testing with 28 participants, and moderated usability testing with 6 practitioners to understand how AI teams actually choose models in production.

The result: a four-phase evaluation workflow — Define Task, Define Metrics, Model Selection, and Run Evaluation — that reduces cognitive load at every step while preserving the flexibility expert users demand.

5 Expert Interviews

28 Preference Test Participants

6 Usability Test Sessions

4 Phase Workflow

Project Goals

Design an evaluation dashboard that makes comparing LLMs faster, more rigorous, and more accessible to both technical and non-technical stakeholders.

Design a structured, repeatable evaluation workflow.

Build a decision-support environment that supports repeatable comparison, interpretable trade-offs, and calibrated trust within real development workflows.

Ground the design in real practitioner workflows.

Research two core personas, a senior developer and a founding engineer, to surface deep, actionable insights rather than surface-level usability fixes.

Bridge expert intuition and systematic evaluation.

Translate how developers actually evaluate models under real constraints into concrete UX and product recommendations for StackEval.

Phase 1

Exploratory Research

Before shaping any design direction, we examined how LLM evaluation works in practice, where StackEval fell short, and how practitioners reason about model choices under real constraints.

Desk Research

Literature review across academic and industry sources on LLM evaluation
Competitive analysis of LangSmith, LangChain, and benchmark frameworks to identify differentiation opportunities
Exploratory calls with the Backboard.io team to understand product priorities and platform constraints

Heuristic Evaluation

Evaluated the existing StackEval platform against Nielsen's 10 usability heuristics
Each team member conducted an independent pass before consolidating findings
Surfaced two high-friction areas: the output comparison view and the evaluation history view

Screener Survey

Filtered for participants who had systematically compared multiple models and defined reference outputs
Assessed evaluation approach (structured vs. intuition-based) and priorities (cost, tokens, output quality)
Selected five participants with the depth of hands-on experience we needed

User Interviews

Five semi-structured interviews with practitioners who actively use LLMs in professional settings, designed to surface goals, decision-making processes, pain points, and the informal evaluation frameworks developers rely on in practice.

Protocol & Analysis

Each interview lasted ~45 minutes, conducted remotely with recording and notetaker support. We covered five areas: background and AI usage, a recent model evaluation walkthrough, the role of cost and latency, golden output usage, and how memory and context factor into model selection.

After each interview, team members independently extracted single-insight observations onto a shared Miro board, then conducted collaborative affinity mapping. Clusters were split, merged, and renamed until each grouping was tied to direct evidence from multiple participants.

Recruitment & Screening

Participants were recruited through professional networks using a screener that filtered for hands-on experience systematically comparing multiple LLMs. We selected five practitioners spanning software engineering, ML research, and founding roles across different organizational contexts and relationships to model evaluation.

Affinity mapping across five participant interviews

User Interview Findings

Three core themes emerged from affinity mapping across five practitioner interviews.

Across every interview, participants demonstrated sophisticated evaluation instincts, yet almost none of it was written down, shared, or repeatable. Traditional metrics — BLEU, ROUGE, and F1 — were consistently dismissed as artifacts of a previous era, when model tasks were narrow and outputs could be compared to a correct label. Once work becomes generative, there is no single correct string to measure against.

Golden outputs carried the same problem. In real development, codebases evolve, requirements shift, and what counts as good depends on context. What replaced both was a task-based mental model, where participants matched specific models to specific jobs based on accumulated experience. The consequence: this knowledge does not accumulate. When a developer leaves a team or a model updates, the mental model disappears with them.

"Each model gives off its own style or feel. Through a five or six hour work session, I can really tell the difference."

P5 · CTO

"Performance evaluation was very vibes-based but at a high level of understanding the trade-offs."

P2 · ML Researcher

"We don't have a way — everything is so customizable you don't know if the system sucks or if the person's input sucks."

P3 · ML Engineer

"Model evaluation has honestly become trial and error rather than automated benchmarking."

P4 · Founding Engineer

Models are judged not only on the quality of their output but on whether they work in the correct context — whether they understand the codebase, follow the user's reasoning style, and iterate without getting stuck. A model that produces logically correct answers but misinterprets context is a worse fit than one that is slightly less verbose but maintains and works from the correct information.

Trust in evaluation depended more on how a model behaves throughout the process than on final scores. Developers care about transparency into the model's reasoning and contextual indicators — that is what builds confidence in what the model produces.

"Claude Code had fantastic domain knowledge on the actual codebase because it's directly integrated and it has all the docs."

P5 · CTO

"If you end up losing track of the code, you can end up with a codebase you don't understand."

P3 · ML Engineer

"Accuracy and minimizing hallucinations are the primary criteria for model selection."

P4 · Founding Engineer

"Claude tends to follow instructions and doesn't get stuck."

P2 · ML Researcher

Model selection is driven by brand loyalty and community consensus, not systematic comparison. Developers build detailed mental models of trade-offs, but only within a small set of models they already know. The cost of evaluating something unfamiliar — in time, money, and cognitive load — makes sticking with what's trusted the rational default.

With over 17,000 models in Backboard.io's catalog, this default is expensive. Better-fit models never get considered, and the same handful of providers get chosen repeatedly regardless of whether they're the best match for the task. Participants described quickly trying new models when they launched, but true comparative evaluation was rare.

"For now most of our time is spent building features rather than evaluating which model to use in the backend of each feature."

P1 · Software Engineer

"It's largely trial and error."

P4 · Founding Engineer

"I tend to use expensive models for the planning features. If there is a very good plan, it can be handled by most models."

P1 · Software Engineer

"Performance evaluation was very vibes-based but at a high level of understanding the trade-offs."

P3 · ML Engineer

Phase 2

Generative Design Requirements

Our exploratory findings translated into a clear set of design requirements organized around four areas.

Task and Evaluation Definition

Users want ways to include their own metrics to evaluate models in a way that is customized to their project and needs. This structure should ideally be systematically implemented so that it can be repeated throughout the project and organizational needs.

Model Recommendation

The system should encourage model discovery through personalized and tailored model recommendations that fit the users' goals, project needs, and success criteria.

Model Selection

Users need the ability to search and select the models that they need from the 17,000+ model inventory.

Output Comparison

Results need to be scannable within and between models with surfaced metrics.

Priority Matrix

We mapped these requirements onto a prioritization matrix across user value and implementation effort, identifying quick wins like multi-model selection and scannable output comparison as MVP priorities, and strategic priorities like custom metric input and personalized model suggestions as high-value features requiring deeper design investment.

Design Variations

From here, we moved on to brainstorming designs for a system that fit users' goals, alleviated pain points, and fit seamlessly into their workflows determined by preference testing.

Initial Sketches

Set Up Variations

Sequential Layout

Pop-out Modals

Participants valued its step-by-step structure, reduced clutter, and simplified navigation of complex information. The sidebar layout tended to feel overwhelming, with important interactions getting lost in the interface. A notable nuance emerged: a subset of participants identified Layout A as easier to understand while still selecting Layout B as the one they would prefer to use, suggesting that experienced users valued the efficiency of having everything on one page despite the clarity advantage of sequential steps. This feedback led us to implement default settings for each step so expert users could pass through quickly, and opened broader conversations about saving configurations and creating presets.

Output Variations

Sequential Layout

Row-Focused Layout

Synchronized Grid

Participants found the side-by-side comparison of Layouts B and C more intuitive than vertical scanning, and specifically valued Layout B's filters and expandable rows for managing information density. The emphasis on clarity stemmed from the use of gestalt grouping principles and left-to-right readability.

Phase 2 · Evaluation

Usability Testing

With the sequential setup and row-focused output layouts selected, we built a high-fidelity interactive prototype and conducted moderated usability testing with six participants. All had established experience with LLM evaluation, including use of industry tools like LangChain and custom evaluation frameworks.

Protocol and Analysis

Two core tasks mirroring preference testing: evaluation set-up and evaluation output. Participants were first given a scenario to set up the context behind their task.
Participants were asked to set up their evaluation and select 3 models to compare.
Follow-up questions after both tasks gauged confidence levels, expectations, frustrations, and what was most useful about the platform.
Live note-taking completed through a rainbow spreadsheet to capture overall actions and patterns, with direct quotations and observations logged separately.
The rainbow spreadsheet acted as the foundation for qualitative thematic analysis, visualizing patterns across participants and informing top-down analysis of the affinity map.

Recruitment and Screening

Participants were recruited through professional networks and the University of Michigan PhD candidate network.
All participants had established, hands-on experience with LLM evaluation tools and workflows.

Usability Testing Findings

Four core themes emerged from moderated sessions with six practitioners across technical skill levels.

Users wanted more than preset metrics. Most participants with LLM evaluation experience identified LLM-as-a-judge as the most important metric, while others dismissed irrelevant options like BLEU. Many also called for a system prompt field to give models sufficient task context.

Design Direction

Added a system prompt field with AI-assisted draft generation, moved custom evaluators above preset metrics, and integrated an AI copilot for metric selection support.

Ranking criteria on one page and surfacing recommendations on the next made the cause-and-effect relationship invisible. Five of six participants were confused by the criteria section, and four never noticed the recommendations at all.

Design Direction

Merged criteria and recommendations onto one page, with live-updating model suggestions driven by filters in a left panel.

The challenger card went unnoticed during model selection, but every participant engaged with it during output evaluation, with one genuinely reconsidering their choice after seeing its performance. Users won't add unfamiliar models voluntarily, but will seriously consider them when results speak for themselves.

Design Direction

Moved the challenger to the evaluation summary page as a dismissible toggle, with a plain-language explanation and a note that it runs at no cost. Toggling it off grays it out rather than removing it.

Expert users used metrics to validate decisions already formed through manual review. Less technical users relied on the model leaders section and felt confident citing the system's recommendation. Both groups saw the dashboard as a tool for communicating model decisions to stakeholders.

Design Direction

Added an export feature in CSV, JSON, and text report formats, with refined color and CTA placement to surface the most decision-relevant information.

Final Design

The Final StackEval Design

The final design implements a four-phase sequential evaluation workflow, validated through preference testing and refined through usability testing. Each phase addresses a distinct decision in the evaluation process, distributing cognitive load across stages while preserving navigability through a persistent step indicator. The prototype is accessible at stackevalt.lovable.app.

The entry point for configuring an evaluation run. Users select from four task types — Standard QA, Summarization, RAG QA, and LoCoMo — each with a description and input formatting examples that inform model recommendation logic in later steps. An optional system prompt field lets users provide additional context, driven by usability testing where power users identified its absence as a barrier. A "Generate Draft" button powered by the AI copilot reduces the barrier for users unfamiliar with prompt engineering.

Determines how outputs will be scored and compared. Custom evaluators appear first, above preset metrics, reflecting usability test findings where the most experienced participants identified custom evaluation as the most important capability. Ten preset metrics are available with defaults pre-selected, all with tags and descriptions so users can critically assess relevance. This page is skippable, defaulting to the three standard platform metrics.

A three-panel layout consolidating criteria definition, model recommendations, and selection into one view. The left panel lets users set accuracy, cost, speed, and context window priorities, producing live updates to a "Recommended for You" list in the center — making the cause-and-effect relationship that was invisible in earlier iterations immediately clear. Additional filters help users navigate 17,000+ models, with selected models populating a persistent right-side panel.

Combines input configuration, execution, and output comparison in a single view. A challenger model is introduced with a plain-language explanation and a toggleable option to include it at no cost. Once evaluation completes, a model leaders section surfaces the top performer per metric, followed by a row-by-row collapsible output view. An export button supports CSV, JSON, and text report formats so evaluation knowledge can move beyond a single session and across teams.

Design System

Next Steps

Realization Plan

Prototype to Product

Our plan is for Backboard.io to launch the StackEval redesign as a beta and track its performance. Key metrics to monitor would include task completion rates across the four-phase flow, custom evaluator adoption, challenger model engagement, model diversity in evaluations, and export usage. These signals indicate whether the new version is delivering on its core intent: externalizing evaluation workflows, surfacing less-familiar models, and making results shareable.

Further Research

Our scope forced us to narrow our focus to technical users, but StackEval's real user base also includes non-technical stakeholders like product managers and stakeholders from other departments. If we had more time, we would have extended the research in the following directions:

Stakeholder interviews

Talk to non-technical users to understand how they currently receive evaluation results, what decisions those results inform, and where the handoff breaks down.

Usability testing

A focused round with non-technical participants interpreting results and making a recommendation to running an evaluation based on metrics, to surface which elements of the dashboard land and which need reframing.

Technical Next Steps

The first step is a design handoff sync with Backboard.io's product designer to align components, states, and interactions. Beyond handoff, several features require further implementation research to evaluate technical feasibility before engineering build:

Recommendation engine

The criteria-driven "Recommended for You" panel is currently simulated. Production requires scoping a real scoring pipeline that ranks Backboard.io's 17,000+ model catalog against user-defined priorities like accuracy, cost, speed, and context window, with live updates as criteria shift.

Custom evaluators

LLM-as-Judge, Code/Regex, and Embedding Similarity evaluators require backend infrastructure to execute user-defined logic at scale. This includes model routing for LLM-as-Judge, sandboxed execution for Code/Regex, and pricing and rate-limiting models across all three.

Challenger model and AI Copilot

Both features rely on logic that is deliberately left open in the current design. The challenger model requires Backboard.io to define selection criteria — staff picks, trending models, newest releases, or a hybrid — while the Copilot needs deeper scoping around its interaction model, level of agency, and integration with live user state across the flow.

Looking Back

Reflection

Working on StackEval gave us a clearer sense of how much an evaluation tool's value depends on fitting into the workflows users already have, rather than imposing a new structure on top of them. Across every stage of the project, the strongest insights came watching and learning how developers actually think about model comparison, including the informal habits and mental shortcuts that rarely surface in product requirements. Our role was less about inventing a new evaluation framework and more about making the one that already lives in users' heads visible, shareable, and repeatable. That shift in framing shaped both the final design and the way we approached the work as a team.

What we would do differently

Involve non-technical users earlier

The strongest cross-functional signals came from our two least technical participants, and we only surfaced that in usability testing. Including both audiences from the start would have given us more to work with.

Increase testing volume

Our sample sizes were small by necessity, but more rounds with more participants would have strengthened the confidence behind our design decisions.

Co-design with key users

Rather than engaging participants only for testing and evaluation, involving them earlier in brainstorming and ideation would have surfaced ideas shaped directly by their expertise.

Lessons learned

Designing within a technical domain

LLM evaluation can make even experienced practitioners disagree and the landscape shifts faster than most product teams can keep up with. A lot of our initial work was simply learning the domain well enough to ask useful questions, and then making design decisions based on informed assumptions.

Working at pace with real autonomy

The timeline was short and the client was active, which meant we were often making calls before we had complete information. We learned to separate the decisions that needed more research from the ones that just needed a choice, and to keep moving.

Turning limited research into a scalable solution

Five participants give you depth, but depth is easy to mistake for consensus. A lot of our work was pulling back from individual quotes to ask whether a pattern was actually there, or whether we were seeing what we wanted to see.

StackEval

Design an evaluation dashboard that makes comparing LLMs faster, more rigorous, and more accessible to both technical and non-technical stakeholders.

Design a structured, repeatable evaluation workflow.

Ground the design in real practitioner workflows.

Bridge expert intuition and systematic evaluation.

Exploratory Research

Desk Research

Heuristic Evaluation

Screener Survey

User Interviews

Protocol & Analysis

Recruitment & Screening

User Interview Findings

Developers have built a real evaluation system, but it lives entirely in their heads.

A model's output is only as good as its ability to demonstrate contextual match.

Defaulting to familiar models limits exposure to better alternatives.

Generative Design Requirements

Task and Evaluation Definition

Model Recommendation

Model Selection

Output Comparison

Priority Matrix

Design Variations

Initial Sketches

Set Up Variations

Sequential Layout

Pop-out Modals

Output Variations

Sequential Layout

Row-Focused Layout

Synchronized Grid

Usability Testing

Protocol and Analysis

Recruitment and Screening

Usability Testing Findings

Evaluation configuration needs more context and customization.

Criteria definition was disconnected from model recommendations.

The challenger model confused users in selection but proved valuable in results.

The output view supported confident decisions across skill levels.

The Final StackEval Design

Define Task

Define Evaluation Metric

Model Selection

Run Evaluation

Design System

Realization Plan

Prototype to Product

Further Research

Technical Next Steps

Recommendation engine

Custom evaluators

Challenger model and AI Copilot

Reflection

What we would do differently

Lessons learned

Designing within a technical domain

Working at pace with real autonomy

Turning limited research into a scalable solution