Emerging Paradigms in Systematic Review Automation
Contemporary systematic review automation has converged on two structurally different paradigms for accelerating title and abstract screening: active learning (AL) tools that rank records for human review, and generative AI platforms that autonomously screen records with calibrated confidence outputs. The methodological choice between them has often been treated as a question of comfort with AI involvement. The published evidence reframes it as a question of scientific reliability, reproducibility, and risk transparency.
The conventional positioning of AL as the "lower-risk" automation strategy rests on the assumption that a human screens every record the model surfaces. In practice, this safeguard is only as strong as the stopping rule that determines when to terminate pre-screening. The Kempny et al. (2026) evidence shows that stopping rules in current use cannot reliably tell a researcher whether they have reached the last relevant record or stopped well short of it. The AL system itself provides no calibrated signal at the point of decision.
Active-learning approaches are predicated on a premise that researchers can stop manually pre-labelling records before reaching the end of a dataset without missing relevant studies. The Kempny et al. (2026) simulation, comprising 35,000 automated screenings across five SYNERGY datasets, demonstrates that this premise has not survived empirical validation and fails in three independent ways.
Key Methodological Limitations of Active-Learning Screening:
Absence of a Validated Stopping Criterion: None of the three proposed stop criteria (sample-based extrapolation, consecutive irrelevants, fixed percentage) reliably identified all relevant studies across datasets. Criterion 1 was reliable in 2% of simulations; criterion 2 required thresholds ranging from 2% to 61% depending on dataset; criterion 3 admitted no universal value.
Unpredictable Workload Reduction: The proportion of records that had to be pre-labelled to identify all relevant studies varied from 2.9% to 76.9% on average across the five datasets, with worst-case runs requiring almost 95% of records to be screened on smaller corpora (Donners 2022, 975 records).
Per-Run Variability: Because the active-learning model is re-initialised at the start of each run, identical inputs do not yield identical screening trajectories, undermining the reproducibility that systematic review methodology requires.
Kempny et al. (2026) conclude that current AL stopping rules "should be applied cautiously," that published practice (stopping after 100 consecutive irrelevants, or 5% of the dataset) is "highly questionable," and that a single simulation under one of the tested rules missed 84 relevant studies. The paper provides the first systematic empirical test of these rules. Each one fails.
Calibrated Confidence as a Methodological Alternative
The Loon Lens™ platform, validated in Janoudi et al. (2025) across 8 systematic reviews and 3,796 citations from Canada's Drug Agency, addresses the same screening task through a structurally different mechanism. The Loon AI® engine produces an autonomous Include or Exclude decision for every record, accompanied by a categorical confidence rating calibrated against observed error rates.
Key Analytical Innovations:
Autonomous Decisioning with No Stop-Criterion Dependency: Every record receives a labelled decision, eliminating the methodologically unresolved question of when to stop screening.
Conditional Error Calibration: Error probability is explicitly stratified by confidence and decision. Very-High-confidence Excludes carry under 0.1% error probability, enabling defensible triage of records for human review.
Validated Performance with Confidence-Routed Review: Janoudi et al. (2025) reports baseline performance of 98.9% sensitivity (95% CI 97.6-100), 95.5% accuracy (94.8-96.1), 95.2% specificity (94.5-95.9), 99.9% negative predictive value (99.8-100), and 63.0% precision (58.4-67.3). Routing lower-confidence outputs (~5% of all records) to human review lifts sensitivity to 99.0%, accuracy to 99.0%, specificity to 99.1%, and precision to 89.9%.
Quantified Risk and Methodological Reproducibility
Two methodological tests determine whether a screening approach can stand up to scientific scrutiny: the risk of error at each decision must be quantified in a way the reviewer can act on, and identical inputs must produce identical outputs across reviewers and runs. Active learning fails both tests.
The calibrated-confidence design addresses the first test by construction. Every Loon AI® decision carries a quantified error probability that can be acted on. Lower-confidence decisions are routed to human review by construction, and the conditional error matrix tells the reviewer exactly what proportion of errors that targeting strategy is expected to absorb. No equivalent diagnostic exists in active-learning ranking: a record sitting at position 800 of 1,000 carries no calibrated risk signal distinguishing it from one at position 50, beyond its rank.
On the second test, reproducibility, the comparison becomes one-sided. Systematic review methodology requires that independent researchers, given identical inputs, arrive at identical screening decisions. An AL workflow that produces different rankings on each reviewer run, and depends on a stopping rule that no two reviewers will apply identically, cannot meet this standard. The problem compounds beyond the run-to-run variability itself: AL provides no reliable way to tell the researcher which trajectory is better, which ranking surfaces relevant studies more completely, or which records may have been mis-ranked. The researcher has neither a deterministic output to defend nor a calibrated signal with which to interrogate the one they have. A platform that produces a deterministic, calibrated decision for every record can.
A fourth consideration belongs alongside these: the validation gap of pre-screening stopping criteria. The Loon Lens™ confidence calibration has been validated in a peer-reviewed diagnostic-accuracy study reported in line with the STARD statement (Janoudi et al., 2025). The active-learning stop criteria evaluated in Kempny et al. (2026) had not previously been subjected to systematic empirical test. The result of the first such test is that all three fail. Beyond a calibration question, this is a foundational scientific result with direct implications for any HTA submission, regulatory dossier, or Cochrane-aligned review that relies on AL-assisted screening today.
Implementation Considerations for Reproducible Evidence Synthesis
Adoption of calibrated-confidence screening at the scale of contemporary evidence generation requires:
Pre-specified confidence thresholds for human review, declared in the protocol alongside the search strategy
Routine reporting of confidence-stratified error probabilities in PRISMA-compatible documentation
Bootstrapped performance estimates for each review, not platform-level averages applied uniformly
Auditable decision logs sufficient for regulatory and HTA submissions
The methodological case for moving beyond active learning is no longer a question of appetite for AI involvement. It is a question of whether the screening method produces results that survive the basic scientific tests of reproducibility, calibrated risk, and validated performance. The published evidence places the Loon Lens™ platform and its Loon AI® engine on the side of scientifically rigorous testing.
References
Adamse et al. (2024). The risk and protective factors for entering organized crime groups and their association with different entering mechanisms: A systematic review using ASReview. European Journal of Criminology, 21(6):859–886. https://doi.org/10.1177/14773708241250278
Janoudi et al. (2025). Validating Loon Lens 1.0 for Autonomous Abstract Screening and Confidence-Guided Human-in-the-Loop Workflows in Systematic Reviews. Value in Health. https://doi.org/10.1016/j.jval.2025.09.008
Kempny et al.(2026). When to stop reviewing: validation of stop criteria in ASReview. BMC Medical Research Methodology, 26(1):109. https://doi.org/10.1186/s12874-026-02866-5
Warren et al. (2023). Functional magnetic resonance imaging, deep learning, and Alzheimer's disease: A systematic review. Journal of Neuroimaging, 33(1):5–18. https://doi.org/10.1111/jon.13063
Navigate the Complexities of Market Access with Expert Insights
Learn how Loon's evidence-based solutions can help accelerate your HTA submissions and market access strategies.
Schedule a ConsultationStart Transforming Your HTA and Market Access Strategy Today
Join pharmaceutical companies that are accelerating their market access with evidence-based AI solutions.
Schedule Your Consultation