An Empirical Protocol for the Veer

Appendix to “The Veer”

Alex Deva — May 2026

Research Question

Does the sustained presence of an actively engaged human sensor prevent trajectory drift in extended AI-driven research programs, where drift is measured as loss of contact with verifiable reality rather than loss of internal coherence?

Design

Three-condition between-subjects design with repeated measurement. Each unit is a complete research program running over a fixed number of turns.

Condition 1 (Instrument + Instrument). Two Claude Opus instances collaborating on a research program — one generating, one critiquing, alternating roles across turns. After an identical opening prompt and five scripted seed turns, no human involvement. The orchestration architecture mirrors whatever multi-agent setup the target claim uses. This condition directly operationalizes the claim that instruments can conduct genuine research without a sensor.

Condition 2 (Instrument + Sensor). One Claude Opus instance paired with a trained human researcher.

Condition 3 (Instrument + Instrument + Sensor). Two Claude Opus instances with a human researcher in the loop.

Pre-specified Condition 2/3 prediction. Circulatory epistemology predicts that Conditions 2 and 3 will perform equivalently on reality contact: the sensor is sufficient for course correction, and a second instrument adds capacity but not epistemic grounding. If Condition 3 significantly outperforms Condition 2 on the primary outcome, the critical variable is instrument-to-instrument critique rather than sensor presence — evidence against the framework.

Domain Selection

The research program is a focused empirical synthesis: What does the current evidence support regarding [a contested empirical question with upcoming replication data]. Candidate domains include a specific claim in social psychology awaiting large-scale replication, a pharmacological dose-response relationship with forthcoming trial results, or an ecological modeling prediction with near-term observational data.

The domain must satisfy two requirements, each providing a ground-truth channel that internal sophistication cannot fake:

Verifiable citation channel. The synthesis requires citing published papers. Papers either exist or they do not. Quoted claims either match the source text or they do not. Citation hallucination rate is a hard ground-truth signal.
Calibrated forecast channel. Each research program must generate specific, timestamped predictions about outcomes not yet known at study launch — replication effect sizes, trial results, or observational measurements. These predictions resolve against reality after the study period, providing a fully prospective measure of reality contact via Brier scores.

These two channels are deliberately orthogonal. A program can maintain perfect internal coherence — clear argumentation, consistent terminology, sophisticated methodology — while hallucinating citations and making poorly calibrated forecasts. The veer prediction is precisely that Condition 1 will do this: become more sophisticated and less grounded over time.

Equalization and Controls

Turn-matched, not time-matched. Each condition runs for exactly 200 research turns, eliminating the confound that Condition 1 could run continuously while humans need sleep. A “turn” is one complete exchange (prompt and response). The variable is not speed but trajectory.

Matched initialization. All conditions receive the same opening prompt, the same initial source materials, and the same five scripted seed turns. Divergence begins at turn 6. This ensures equivalent starting conditions and allows early-turn outputs to serve as a within-subjects baseline.

Sensor dose specification. The human researcher in Conditions 2 and 3 must meet a minimum engagement threshold: at least one substantive intervention per 10 turns, where “substantive” means independently verifying a claim against a primary source, redirecting a line of inquiry based on domain knowledge, or vetoing a conclusion. Passive acknowledgment (“looks good, continue”) does not count. All human interventions are logged with timestamps and coded for type (verification, redirection, veto, elaboration). This operationalizes the framework’s claim that active circulation matters — presence without engagement is not a sensor in the loop.

Blinding. External evaluators are blinded to condition. Final research outputs are stripped of metadata, conversation structure, and any markers revealing whether a human participated. Evaluators receive only the terminal synthesis document, the citation list, and the forecast set. Inter-rater reliability is assessed.

Outcome Measures

Primary outcome: Reality Contact Index (RCI). A composite score combining two equally weighted components, each measured at turns 50, 100, 150, and 200:

Citation verification rate. Proportion of cited papers that (a) exist, (b) are attributed to the correct authors, and (c) contain claims matching what the program attributes to them. Verified by research assistants against primary sources.
Forecast calibration. Brier score on the program’s specific predictions, scored after the hold-out data become available.

Secondary outcome: Internal Coherence Score (ICS). Blinded evaluators rate each program’s output on argumentative structure, terminological consistency, logical validity, and sophistication on a 7-point anchored scale. This measure is included specifically to demonstrate that the veer is not quality degradation. The prediction is that Condition 1’s ICS will remain high — possibly higher than Conditions 2 and 3 — even as its RCI declines.

Trajectory measure. The critical prediction is not that Condition 1 scores lower overall, but that it diverges over time. The primary analysis is a condition-by-time interaction on RCI across the four measurement points (mixed-effects model with random intercepts for research program). A significant negative slope for Condition 1 relative to Conditions 2 and 3, with no corresponding decline in ICS, is the signature of the veer.

Sample Size

Each condition runs 10 independent research programs (different contested questions, different human researchers for Conditions 2 and 3), for 30 programs total. Targeting a large effect size (Cohen’s $d = 0.8$ for the condition-by-time interaction on RCI at turn 200) with $\alpha = 0.05$ and power $= 0.80$, $n = 10$ per group is adequate for detecting the predicted divergence. This is appropriate for a bold directional prediction: the veer claim predicts substantial divergence, not a marginal difference.

Kill Condition

The veer prediction is falsifiable. Pre-register the following on OSF before any condition begins:

If, at turn 200, the 95% confidence interval for the mean RCI difference between Condition 1 and the pooled Conditions 2/3 includes zero, AND the condition-by-time interaction on RCI is not significant ($p > 0.05$, partial $\eta^2 < 0.06$), the veer prediction is falsified. Circulatory epistemology would then need to account for why instruments maintain reality contact without a sensor.

Conversely, if Condition 1’s ICS remains high while its RCI declines relative to the sensor conditions, the veer is confirmed as an epistemological phenomenon distinct from engineering-level context degradation.

Feasibility

Pilot. Before the full protocol, a pilot of $n = 3$ per condition over 100 turns tests the measurement instruments, calibrates inter-rater reliability on the RCI components, and estimates the effect size for power recalculation.

Ethics. Human researcher participation requires standard IRB review (likely exempt or expedited — the researchers are professional collaborators, not subjects of intervention). Informed consent covers time commitment, data logging of all interactions, and publication of blinded results.

Pre-registration. The full protocol, including the kill condition and all analysis specifications, is registered on OSF before any condition begins.

Sources

Liu, N. F., et al. (2024). Lost in the middle: How language models use long contexts. Transactions of the ACL, 12, 157–173.
Du, J., et al. (2025). Context length alone hurts LLM performance despite perfect retrieval. arXiv:2510.05381.

The broader framework is at thepulsegoeson.com.