What is the difference between active learning and calibrated-confidence screening?

Active learning (AL) tools rank records for a human to review in order, whereas calibrated-confidence platforms autonomously issue an Include or Exclude decision for every record together with a confidence rating calibrated against observed error rates. The article frames the choice between them as one of scientific reliability, reproducibility, and risk transparency — not comfort with AI involvement.

Why are active-learning stopping rules a methodological concern?

AL’s safeguard assumes a human screens every surfaced record, so it is only as reliable as the rule that decides when to stop pre-labelling. The Kempny et al. (2026) simulation of 35,000 automated screenings across five SYNERGY datasets found that current stopping rules cannot reliably tell a researcher whether they have reached the last relevant record, and the system provides no calibrated signal at the point of decision.

How does calibrated confidence quantify the risk of a screening error?

Every Loon AI® decision carries an error probability stratified by confidence and decision type. In Janoudi et al. (2025), Very-High-confidence Excludes carry under 0.1% error probability, and the lower-confidence outputs (~5%) are routed to human review — so reviewers know exactly what proportion of potential errors that targeting strategy is expected to absorb.

What performance has Loon Lens™ demonstrated?

Validated across 8 systematic reviews and 3,796 citations from Canada’s Drug Agency (Janoudi et al. 2025), Loon Lens™ reports 98.9% sensitivity (95% CI 97.6–100), 95.5% accuracy, 95.2% specificity, 99.9% negative predictive value, and 63.0% precision at baseline. Routing the ~5% of lower-confidence outputs to confidence-guided human review lifts precision from 63.0% to 89.9% — alongside 99.0% sensitivity, 99.0% accuracy, and 99.1% specificity — demonstrating the substantial gain the calibrated, confidence-guided workflow delivers over the baseline autonomous pass.

Why does reproducibility favour calibrated confidence over active learning?

Systematic review methodology requires a high level of inter-rater agreement. An AL workflow that yields different rankings on each run — and depends on a stopping rule that no two reviewers apply identically — cannot meet that standard, whereas calibrated-confidence decisions are transparent and can be pre-defined in the methods, as opposed to the AL arbitrary stopping rule that has no established evidence on what is a good stopping point and whether it reflects a good screening approach. Calibrated confidence screening carries an auditable, quantified error signal at every record.

Calibrated Confidence over Active Learning: Methodological Foundations for Reliable Abstract Screening

Emerging Paradigms in Systematic Review Automation

Contemporary systematic review automation has converged on two structurally different paradigms for accelerating title and abstract screening: active learning (AL) tools that rank records for human review, and generative AI platforms that autonomously screen records with calibrated confidence outputs. The methodological choice between them has often been treated as a question of comfort with AI involvement. The published evidence reframes it as a question of scientific reliability, reproducibility, and risk transparency.

The conventional positioning of AL as the “lower-risk” automation strategy rests on the assumption that a human screens every record the model surfaces. In practice, this safeguard is only as strong as the stopping rule that determines when to terminate pre-screening. The Kempny et al. (2026) evidence shows that stopping rules in current use cannot reliably tell a researcher whether they have reached the last relevant record or stopped well short of it. The AL system itself provides no calibrated signal at the point of decision.

Active-learning approaches are predicated on a premise that researchers can stop manually pre-labelling records before reaching the end of a dataset without missing relevant studies. The Kempny et al. (2026) simulation, comprising 35,000 automated screenings across five SYNERGY datasets, demonstrates that this premise has not survived empirical validation and fails in three independent ways.

Key Methodological Limitations of Active-Learning Screening:

Absence of a Validated Stopping Criterion: None of the three proposed stop criteria (sample-based extrapolation, consecutive irrelevants, fixed percentage) reliably identified all relevant studies across datasets. Criterion 1 was reliable in 2% of simulations; criterion 2 required thresholds ranging from 2% to 61% depending on dataset; criterion 3 admitted no universal value.
Unpredictable Workload Reduction: The proportion of records that had to be pre-labelled to identify all relevant studies varied from 2.9% to 76.9% on average across the five datasets, with worst-case runs requiring almost 95% of records to be screened on smaller corpora (Donners 2022, 975 records).
Per-Run Variability: Because the active-learning model is re-initialised at the start of each run, identical inputs do not yield identical screening trajectories, undermining the reproducibility that systematic review methodology requires.

Kempny et al. (2026) conclude that current AL stopping rules “should be applied cautiously,” that published practice (stopping after 100 consecutive irrelevants, or 5% of the dataset) is “highly questionable,” and that a single simulation under one of the tested rules missed 84 relevant studies. The paper provides the first systematic empirical test of these rules. Each one fails.

Calibrated Confidence as a Methodological Alternative

The Loon Lens™ platform, validated in Janoudi et al. (2025) across 8 systematic reviews and 3,796 citations from Canada’s Drug Agency, addresses the same screening task through a structurally different mechanism. The Loon AI® engine produces an autonomous Include or Exclude decision for every record, accompanied by a categorical confidence rating calibrated against observed error rates.

Key Analytical Innovations:

Autonomous Decisioning with No Stop-Criterion Dependency: Every record receives a labelled decision, eliminating the methodologically unresolved question of when to stop screening.
Conditional Error Calibration: Error probability is explicitly stratified by confidence and decision. Very-High-confidence Excludes carry under 0.1% error probability, enabling defensible triage of records for human review.
Validated Performance with Confidence-Routed Review: Janoudi et al. (2025) reports baseline performance of 98.9% sensitivity (95% CI 97.6–100), 95.5% accuracy (94.8–96.1), 95.2% specificity (94.5–95.9), 99.9% negative predictive value (99.8–100), and 63.0% precision (58.4–67.3). Routing lower-confidence outputs (~5% of all records) to human review lifts sensitivity to 99.0%, accuracy to 99.0%, specificity to 99.1%, and precision to 89.9%.

Quantified Risk and Methodological Reproducibility

Two methodological tests determine whether a screening approach can stand up to scientific scrutiny: the risk of error at each decision must be quantified in a way the reviewer can act on, and identical inputs must produce identical outputs across reviewers and runs. Active learning fails both tests.

The calibrated-confidence design addresses the first test by construction. Every Loon AI® decision carries a quantified error probability that can be acted on. Lower-confidence decisions are routed to human review by construction, and the conditional error matrix tells the reviewer exactly what proportion of errors that targeting strategy is expected to absorb. No equivalent diagnostic exists in active-learning ranking: a record sitting at position 800 of 1,000 carries no calibrated risk signal distinguishing it from one at position 50, beyond its rank.

On the second test, reproducibility, the comparison becomes one-sided. Systematic review methodology requires that independent researchers, given identical inputs, arrive at identical screening decisions. An AL workflow that produces different rankings on each reviewer run, and depends on a stopping rule that no two reviewers will apply identically, cannot meet this standard. The problem compounds beyond the run-to-run variability itself: AL provides no reliable way to tell the researcher which trajectory is better, which ranking surfaces relevant studies more completely, or which records may have been mis-ranked. The researcher has neither a deterministic output to defend nor a calibrated signal with which to interrogate the one they have. A platform that produces a deterministic, calibrated decision for every record can.

A fourth consideration belongs alongside these: the validation gap of pre-screening stopping criteria. The Loon Lens™ confidence calibration has been validated in a peer-reviewed diagnostic-accuracy study reported in line with the STARD statement (Janoudi et al., 2025). The active-learning stop criteria evaluated in Kempny et al. (2026) had not previously been subjected to systematic empirical test. The result of the first such test is that all three fail. Beyond a calibration question, this is a foundational scientific result with direct implications for any HTA submission, regulatory dossier, or Cochrane-aligned review that relies on AL-assisted screening today.

Implementation Considerations for Reproducible Evidence Synthesis

Adoption of calibrated-confidence screening at the scale of contemporary evidence generation requires:

Pre-specified confidence thresholds for human review, declared in the protocol alongside the search strategy
Routine reporting of confidence-stratified error probabilities in PRISMA-compatible documentation
Bootstrapped performance estimates for each review, not platform-level averages applied uniformly
Auditable decision logs sufficient for regulatory and HTA submissions

The methodological case for moving beyond active learning is no longer a question of appetite for AI involvement. It is a question of whether the screening method produces results that survive the basic scientific tests of reproducibility, calibrated risk, and validated performance. The published evidence places the Loon Lens™ platform and its Loon AI® engine on the side of scientifically rigorous testing.

References

Adamse et al. (2024). The risk and protective factors for entering organized crime groups and their association with different entering mechanisms: A systematic review using ASReview. European Journal of Criminology, 21(6):859–886. doi.org/10.1177/14773708241250278

Janoudi et al. (2025). Validating Loon Lens 1.0 for Autonomous Abstract Screening and Confidence-Guided Human-in-the-Loop Workflows in Systematic Reviews. Value in Health. doi.org/10.1016/j.jval.2025.09.008

Kempny et al. (2026). When to stop reviewing: validation of stop criteria in ASReview. BMC Medical Research Methodology, 26(1):109. doi.org/10.1186/s12874-026-02866-5

Warren et al. (2023). Functional magnetic resonance imaging, deep learning, and Alzheimer’s disease: A systematic review. Journal of Neuroimaging, 33(1):5–18. doi.org/10.1111/jon.13063

Calibrated Confidence over Active Learning: Methodological Foundations for Reliable Abstract Screening

Emerging Paradigms in Systematic Review Automation

Calibrated Confidence as a Methodological Alternative

Quantified Risk and Methodological Reproducibility

Implementation Considerations for Reproducible Evidence Synthesis

References

Frequently asked questions

What is the difference between active learning and calibrated-confidence screening?+

Why are active-learning stopping rules a methodological concern?+

How does calibrated confidence quantify the risk of a screening error?+

What performance has Loon Lens™ demonstrated?+

Why does reproducibility favour calibrated confidence over active learning?+