?
Open-Science · Reproducible · Version 4

Gap Scanner Methodology

How LOOM identifies genuine CRISPR research gaps — and why we stopped trusting single-query PubMed searches. The v4 pipeline uses four independent strategies (including NCBI ontology-derived synonym expansion) and cross-validates against Europe PMC. Only claims surviving all strategies are marked Confirmed gap in the search interface.

Version 4 · March 2026 6 confirmed gene gaps 61 false claims retracted

Why Single-Query Analysis Fails

Earlier versions (v1 and v2) of this tool used a single PubMed query per gene (e.g., "HIV gag CRISPR") and treated a result count of zero as evidence of a research gap. This produced 43 claimed gaps — nearly all false.

Root cause: vocabulary mismatch

CRISPR literature uses many interchangeable terms (guide RNA, sgRNA, Cas9, Cas12a, SHERLOCK, DETECTR) and gene synonyms (env ≡ gp160, pol ≡ reverse transcriptase). A narrow query missing even one synonym returns zero results, falsely signaling novelty. HIV gag, for example, has 669 CRISPR publications — only discoverable with broad synonym expansion.

v4 fixes this with four independent search strategies (including NCBI ontology synonym expansion from 8,283 gene annotations) and organism-level cross-validation. Any gene that passes all four strategies with zero evidence is assigned Confirmed gap confidence. Genes returning mixed signals are marked Uncertain. Genes with clear coverage are marked False claim and suppressed.

Four-Strategy Scan

Each pathogen–gene combination is independently evaluated by four steps. A gap is only confirmed when all strategies return zero evidence and Europe PMC cross-validates.

Strategy 1

PubMed Narrow Query

High-precision search: "[gene] CRISPR [pathogen]" with official gene symbol + common aliases. MeSH terms applied where available. Returns a count and sample PMIDs for manual review.

Strategy 2

PubMed Broad Query

Synonym-expanded query using NCBI Taxonomy, Disease Ontology, and MONDO synonym sets. Adds organism names, all gene aliases, and CRISPR family terms (sgRNA, gRNA, Cas9, Cas12a, SHERLOCK, DETECTR). A zero here is strong evidence of a gap.

Strategy 3

Corpus Co-occurrence

Searches the local LOOM text corpus (PubMed abstracts + preprints) for co-occurrence of pathogen name, gene synonym, and any CRISPR application term. Catches papers not indexed by PubMed in time.

Cross-validation

Europe PMC

All candidates from strategies 1–3 are independently queried against Europe PMC full-text index. Europe PMC ingests preprints and non-MEDLINE journals that PubMed misses.

Confidence Levels

Every pathogen–gene combination receives one of three labels after the four-step scan:

Label Criteria Action in LOOM
Confirmed gap All 3 strategies + Europe PMC return zero publications. At least 2 strategies must have been executed (not timed out). Gene appears in unstudied_genes. "Confirmed gap" filter in the search UI shows these targets. Research-gap badge on targets.
Uncertain Mixed signals: one strategy finds papers but another does not, or Europe PMC disagrees with PubMed. Manual review required. Not marked as gap. Not marked as studied. No badge rendered.
False claim Any strategy returns ≥1 publication with direct evidence of CRISPR work on that gene/pathogen. Gene appears in studied_genes. "Published" filter in the search UI shows these targets.

Confirmed Research Gaps (8 genes, 2 pathogens)

These are the only gene–pathogen combinations LOOM marks as Confirmed gap in the search interface as of March 2026. All claims below survived the full four-strategy pipeline with NCBI ontology synonym expansion. RSV SH protein, previously confirmed under v3, was reclassified to UNCERTAIN after ontology-derived synonyms surfaced 278 Europe PMC entries.

8Confirmed gaps
51Retracted false claims
2Pathogens with gaps
Pathogen Gene(s) Evidence of absence
Mpox A33R, B5R, E8L, J2R, A56R, thymidine kinase, hemagglutinin All 3 PubMed strategies + Europe PMC returned zero CRISPR-diagnostic publications for each gene. 7 independent confirmations.
Cholera hapA (hemagglutinin protease) All strategies zero for hapA specifically. ctxA/ctxB, ompU, and other cholera genes were confirmed false claims and retracted.

SARS-CoV-2 Gene Gaps: Full Four-Strategy Scan Results

The full four-strategy v4 PubMed + Europe PMC scan with ontology-derived synonym expansion was completed (March 2026) across all 18 SARS-CoV-2 genes that have novel FM-index candidates. No gene reached CONFIRMED status.

Gene v4 Classification PubMed broad Notes
nsp13 (Helicase) FALSE 6 CRISPR work exists but all therapeutic/mechanistic; 0 diagnostic-specific papers
nsp14 (ExoN/MTase) FALSE 9 Mixed — includes false query matches; 0 diagnostic-specific papers
nsp5 (Mpro / 3CLpro) FALSE 6 Active drug target, well-studied
nsp15, nsp16 UNCERTAIN 1 each Only 1 PubMed paper each; EuropePMC >300 (likely full-text co-mention)
nsp3, nsp8 UNCERTAIN 3 each Low but non-zero; needs human paper inspection
nsp7, nsp9, nsp10 UNCERTAIN 0–1 0 PubMed narrow, EuropePMC 174–313 (co-mention)
ORF3a, ORF8 UNCERTAIN 1–2 Very few papers; novel FM-index targets exist for these loci
ORF10 UNCERTAIN 0 0 PubMed narrow & broad; EuropePMC 240 (co-mention prevents CONFIRMED)

Interpretation: SARS-CoV-2 has received extraordinary CRISPR research attention since 2020. Even low-prominence accessory proteins (ORF10, nsp7) appear in Europe PMC due to full-text co-mention in review articles, preventing the CONFIRMED verdict. The FM-index finding of 52 novel target regions with no overlap with published diagnostic guide sets remains valid. The most under-explored loci for CRISPR diagnostic guide design (0 dedicated diagnostic PubMed papers) are nsp7, nsp9, nsp10, and ORF10 — all UNCERTAIN pending manual inspection.

Impact on website: The "Confirmed gap (v4 verified)" filter does not apply to SARS-CoV-2 targets (no CONFIRMED genes). The 52 novel targets are visible without a gap badge. The PACMAN/CARVER 0% coverage benchmarking result is a separate fully-verified finding (real Aho-Corasick scan of 9.2M genomes) and is unaffected.

What Was Retracted (and Why)

v2 published 43 gap claims. The v3 scan retracted 50 claims (43 gaps + 7 application-level claims that survived the gap screen but failed the application-context check). The v4 scan reclassified 1 additional claim (RSV SH protein → UNCERTAIN via ontology synonyms) and deduplicated J2R/thymidine kinase and A56R/hemagglutinin as alternate names for the same Mpox genes, yielding 6 confirmed gene-level gaps (5 unique biological targets). Below are the most significant:

Full Audit Trail & Source Code

All data, code, and decision logs are open and versioned. The audit file contains per-gene evidence strings, PubMed query URLs, and evidence-of-absence notes for every confirmed gap.

← Back to CRISPR Search    Learn guide →