Production notes
How this exhibit was made
Response Option Fit Lab is a static editorial exhibit about a narrow problem: how survey response options can fail before analysis begins. It uses worked specimens, drawn from public cognitive-testing materials and clearly labeled editorial vignettes, to make six recurring failure patterns visible. It is not a guide to validated replacement wording, a model of respondent distributions, or a substitute for cognitive testing, expert review, or field evidence.
For people who write survey questions and answer choices and need a sharper prefielding check.
For researchers, editors, and product teams who inherit instruments and need language for diagnosing option-level problems.
For students and instructors who want worked examples in response process, not a general survey-methods textbook.
A large share of survey criticism arrives too late. By the time a draft is challenged, the conversation is often already about estimates, models, or dashboards, even though many failures begin earlier, when a respondent has to map lived experience onto an answer set that is ambiguous, overbroad, overlapping, or more precise than the instrument has earned. Public cognitive-testing work repeatedly shows that apparently small category and wording decisions change how people interpret what is being asked. This exhibit exists to make that pre-analytic layer visible enough to inspect, discuss, and revise before the error hardens into data.
Response Option Fit Lab was written, edited, and verified by Ben Lei.
Initial scaffolding and early editorial drafts were developed with help from Claude Code by Anthropic and Codex by OpenAI. Those tools were used to sketch structure, propose code patterns, and generate candidate prose. Source verification against the cited public documents, final wording, inclusion and exclusion decisions, pedagogy choices, and all claims on this site were made by the author.
AI helped scaffold and draft. Sources, claims, and teaching choices are mine.
Most specimens begin with public working papers and linked PDFs from the U.S. Census Bureau, especially cognitive-testing materials for ACS, AHS, CPS, and NTIA Internet Use Survey instruments. Because Census working papers are discussion documents rather than final agency publications, each specimen is checked manually against the cited PDF before publication and quoted only where wording and attribution are stable. Cross-country reinforcement specimens come from the UK Office for National Statistics where they document the same pattern with measured before/after evidence.
When a publishable transcript or verbatim respondent or interviewer exchange is available, the exhibit labels it Direct quote and preserves the wording except for light truncation marked with ellipses. When the source reports a pattern, summarizes a response, or describes an issue without a usable transcript, the exhibit labels the reconstruction Editorial illustration in place. Editorial illustrations are evidence-shaped teaching devices built from the source description; they are not presented as verbatim records.
Counterexamples are chosen for scope control, not for victory. Their job is to show a nearby form that does not obviously trigger the same diagnosis, so the six patterns do not turn into a hammer that hits every hard question on the page. A counterexample here is not a validated recommendation and should not be read as one. This is closer to The Pudding's method-note practice than to an optimization claim: show the sourcing, show the manual judgment, and keep the claim smaller than the temptation.
This exhibit uses a small, declared set of learning ideas rather than a generic "science of learning" halo. The Workbench asks for a prediction before an explanation because prequestions and pretests can increase later learning when learners then study the answer; the pattern is promising, but the size and generality of the benefit vary by procedure and assessment. Steven Pan and Shana Carpenter are the main grounding here. Their 2023 review supports pre-instruction questioning as a useful design move, while later synthesis also makes clear that benefits are often stronger for the prompted material than for broad untargeted transfer.
The six-pattern sequence stays fixed while cases vary because interleaving helps people learn discriminations between confusable categories better than long blocked runs often do. That is the reason to revisit the same label set across different specimens instead of exhausting one pattern before moving to the next. The relevant grounding is Matthias Brunmair and Tobias Richter's meta-analysis: encouraging overall, but clearly moderated by the type of material.
The loop is segmented because dense multimedia can overload attention. The five-beat shell gives the learner a stable path, and the pattern names provide advance vocabulary before the harder casework begins. That follows the segmenting and pre-training principles associated with Richard Mayer and colleagues, including the Mayer and Pilegard chapter you named.
The exhibit also accepts a mild amount of productive friction. Prediction usually feels worse than rereading, and learners often misread fluent performance as durable learning. That is the useful warning from Nicholas Soderstrom and Robert Bjork, and it is one reason the Workbench does not skip straight to diagnosis.
For learning category-like distinctions, adjacent comparison matters. That is why the micro-cases should stay short and contrastive. The most direct support is Nate Kornell and Robert Bjork's work on spaced and interleaved induction. The retrieval rationale for asking learners to commit before the reveal comes from Henry Roediger and Jeffrey Karpicke.
None of this should be oversold. Prequestioning research still contains live debates about mechanism and about how far benefits generalize beyond the specific prompted material, and retrieval-based transfer gains appear stronger under some conditions than others. A good colophon should say that plainly.
Accessibility is treated here as release work, not as a certification claim. The page is built to common WCAG 2.2 AA expectations, but coverage is bounded by what is actually tested rather than asserted in the abstract.
Tested in CI on every build:
forced-colors;Known gaps and pending work. Manual screen-reader walkthroughs in NVDA, JAWS, and VoiceOver are pending. Manual zoom audits at 200% and 400% are pending. Cross-browser audits in Firefox and Safari are pending. The exhibit is not yet certified against WCAG 2.2 AA; what is published here is the test surface a build cannot ship without.
Accessibility issues are taken seriously. Report bugs at contact@benlei.org or open an issue at github.com/leibenjamin/response-option-fit/issues, and please include browser, device, assistive technology, and reproduction steps.
This exhibit has no analytics, no ad tech, no cookies, and no third-party runtime requests. The only persistent storage is optional settings data saved to localStorage after an explicit opt-in. Workbench progress is not saved in this release. Stored data stays in the browser for this origin until you export it, replace it, or clear it. Export writes a local file; import reads one back on-device; Clear removes the stored data for this site.
The performance budget is a release-review target, not a hard law: no network calls after initial load, JavaScript at or below 100 KB gzip, CSS at or below 35 KB gzip. Current release: JS 80.15 KB gzip, CSS 9.11 KB gzip, measured 2026-05-02. Regressions are editorially visible changes, not private engineering trivia; if a release exceeds the target, the colophon says why the added weight is worth it. If motion is reduced, the exhibit keeps it reduced. If forced-colors is active, the user's contrast choices win.
Context: the exhibit needs a taxonomy small enough to learn and large enough to feel real. Decision: keep the scope to six recurrent option-level failures that can be named quickly and contrasted across specimens. Consequences: the page gets a reusable vocabulary and stronger transfer; it also excludes many real survey problems that live elsewhere, including sampling, routing, mode, translation, and estimation issues. Alternatives considered: a broader survey-defects atlas, or a looser scrapbook of examples without named categories. The chosen approach is narrower, but teachably narrow.
Context: once a taxonomy exists, there is pressure to turn it into a checklist or score. Decision: keep the diagnoses hand-authored and static. Consequences: every diagnosis remains inspectable, arguable, and tied to a source, but the exhibit scales more slowly and cannot pretend to automate review. Alternatives considered: heuristic scoring, LLM-generated diagnostics, or interactive rubrics. Those options would add speed and false authority at the same time. This exhibit should prefer explicit judgment over synthetic precision.
Context: the exhibit wants a transparent local settings surface without collecting accounts or behavioral data. Decision: save only the settings opt-in state in this release, and save it locally in the browser after explicit consent. Workbench progress is deliberately not persisted yet. Consequences: the privacy posture stays simple, there is no account surface to secure, and export/import remains possible for stored local data. Alternatives considered: automatic local saving, progress persistence, cloud sync, account-based persistence. Those would buy convenience at the cost of a very different data story.
Context: bespoke pages are tempting because each failure pattern invites its own visual treatment. Decision: keep one five-beat shell — Frame, Predict, Diagnose, Probe, Reveal — across the exhibit. Consequences: cognitive overhead drops, transfer improves, accessibility testing is more tractable, and the reader learns the shell once. The tradeoff is less local flourish. Alternatives considered: one custom layout per pattern or one long essay per pattern. Both would likely feel richer in isolation and weaker across the set.
Context: early taxonomies often drift toward every familiar survey pathology, whether or not it belongs at the response-option layer. Decision: remove scale anchoring from the core six and replace it with forced precision. Consequences: the taxonomy gets sharper about response-option fit and less diffuse about general scale design. Alternatives considered: keeping both, merging them, or leaving scale anchoring in as a "classic." Forced precision earns the slot because it is easier to show in worked option sets and closer to the exhibit's central claim about failure before analysis.
This exhibit does not:
The label 'forced_precision' is editorially synthesized; the underlying mechanisms are documented separately in QAS-99 §4b and Bradburn, Rips & Shevell (1987).
Body copy is set in Iowan Old Style with Charter, Source Serif Pro, and Georgia as fallbacks. Eyebrows, receipts, and utility text use JetBrains Mono with IBM Plex Mono and the platform monospace as fallbacks. Navigation, controls, and short labels use Inter with the platform sans-serif as a fallback. All faces are loaded from the user's system; no web fonts are fetched at runtime.
Built as a static React 18 + TypeScript site with Vite and tested with Playwright on a fixed Chromium build. No backend services. No third-party runtime origin.
Repository: github.com/leibenjamin/response-option-fit
License: MIT for source code; CC BY 4.0 for exhibit text, diagrams, and editorial illustrations; quoted source material remains under its original terms (U.S. Census Bureau working papers are public-domain U.S. Government works; UK Office for National Statistics material is published under the Open Government Licence v3.0).