Gap Scanner Methodology
How LOOM identifies genuine CRISPR research gaps — and why we stopped trusting single-query PubMed searches. The v4 pipeline uses four independent strategies (including NCBI ontology-derived synonym expansion) and cross-validates against Europe PMC. Only claims surviving all strategies are marked Confirmed gap in the search interface.
Why Single-Query Analysis Fails
Earlier versions (v1 and v2) of this tool used a single PubMed query per gene
(e.g., "HIV gag CRISPR") and treated a result count of zero as
evidence of a research gap. This produced 43 claimed gaps — nearly all false.
Root cause: vocabulary mismatch
CRISPR literature uses many interchangeable terms (guide RNA, sgRNA, Cas9, Cas12a, SHERLOCK, DETECTR) and gene synonyms (env ≡ gp160, pol ≡ reverse transcriptase). A narrow query missing even one synonym returns zero results, falsely signaling novelty. HIV gag, for example, has 669 CRISPR publications — only discoverable with broad synonym expansion.
v4 fixes this with four independent search strategies (including NCBI ontology synonym expansion from 8,283 gene annotations) and organism-level cross-validation. Any gene that passes all four strategies with zero evidence is assigned Confirmed gap confidence. Genes returning mixed signals are marked Uncertain. Genes with clear coverage are marked False claim and suppressed.
Four-Strategy Scan
Each pathogen–gene combination is independently evaluated by four steps. A gap is only confirmed when all strategies return zero evidence and Europe PMC cross-validates.
PubMed Narrow Query
High-precision search: "[gene] CRISPR [pathogen]" with
official gene symbol + common aliases. MeSH terms applied where
available. Returns a count and sample PMIDs for manual review.
PubMed Broad Query
Synonym-expanded query using NCBI Taxonomy, Disease Ontology, and MONDO synonym sets. Adds organism names, all gene aliases, and CRISPR family terms (sgRNA, gRNA, Cas9, Cas12a, SHERLOCK, DETECTR). A zero here is strong evidence of a gap.
Corpus Co-occurrence
Searches the local LOOM text corpus (PubMed abstracts + preprints) for co-occurrence of pathogen name, gene synonym, and any CRISPR application term. Catches papers not indexed by PubMed in time.
Europe PMC
All candidates from strategies 1–3 are independently queried against Europe PMC full-text index. Europe PMC ingests preprints and non-MEDLINE journals that PubMed misses.
Confidence Levels
Every pathogen–gene combination receives one of three labels after the four-step scan:
| Label | Criteria | Action in LOOM |
|---|---|---|
| Confirmed gap | All 3 strategies + Europe PMC return zero publications. At least 2 strategies must have been executed (not timed out). |
Gene appears in unstudied_genes. "Confirmed gap" filter
in the search UI shows these targets. Research-gap badge on targets.
|
| Uncertain | Mixed signals: one strategy finds papers but another does not, or Europe PMC disagrees with PubMed. Manual review required. | Not marked as gap. Not marked as studied. No badge rendered. |
| False claim | Any strategy returns ≥1 publication with direct evidence of CRISPR work on that gene/pathogen. |
Gene appears in studied_genes. "Published" filter in
the search UI shows these targets.
|
Confirmed Research Gaps (8 genes, 2 pathogens)
These are the only gene–pathogen combinations LOOM marks as Confirmed gap in the search interface as of March 2026. All claims below survived the full four-strategy pipeline with NCBI ontology synonym expansion. RSV SH protein, previously confirmed under v3, was reclassified to UNCERTAIN after ontology-derived synonyms surfaced 278 Europe PMC entries.
| Pathogen | Gene(s) | Evidence of absence |
|---|---|---|
| Mpox | A33R, B5R, E8L, J2R, A56R, thymidine kinase, hemagglutinin | All 3 PubMed strategies + Europe PMC returned zero CRISPR-diagnostic publications for each gene. 7 independent confirmations. |
| Cholera | hapA (hemagglutinin protease) | All strategies zero for hapA specifically. ctxA/ctxB, ompU, and other cholera genes were confirmed false claims and retracted. |
SARS-CoV-2 Gene Gaps: Full Four-Strategy Scan Results
The full four-strategy v4 PubMed + Europe PMC scan with ontology-derived synonym expansion was completed (March 2026) across all 18 SARS-CoV-2 genes that have novel FM-index candidates. No gene reached CONFIRMED status.
| Gene | v4 Classification | PubMed broad | Notes |
|---|---|---|---|
| nsp13 (Helicase) | FALSE | 6 | CRISPR work exists but all therapeutic/mechanistic; 0 diagnostic-specific papers |
| nsp14 (ExoN/MTase) | FALSE | 9 | Mixed — includes false query matches; 0 diagnostic-specific papers |
| nsp5 (Mpro / 3CLpro) | FALSE | 6 | Active drug target, well-studied |
| nsp15, nsp16 | UNCERTAIN | 1 each | Only 1 PubMed paper each; EuropePMC >300 (likely full-text co-mention) |
| nsp3, nsp8 | UNCERTAIN | 3 each | Low but non-zero; needs human paper inspection |
| nsp7, nsp9, nsp10 | UNCERTAIN | 0–1 | 0 PubMed narrow, EuropePMC 174–313 (co-mention) |
| ORF3a, ORF8 | UNCERTAIN | 1–2 | Very few papers; novel FM-index targets exist for these loci |
| ORF10 | UNCERTAIN | 0 | 0 PubMed narrow & broad; EuropePMC 240 (co-mention prevents CONFIRMED) |
Interpretation: SARS-CoV-2 has received extraordinary CRISPR research attention since 2020. Even low-prominence accessory proteins (ORF10, nsp7) appear in Europe PMC due to full-text co-mention in review articles, preventing the CONFIRMED verdict. The FM-index finding of 52 novel target regions with no overlap with published diagnostic guide sets remains valid. The most under-explored loci for CRISPR diagnostic guide design (0 dedicated diagnostic PubMed papers) are nsp7, nsp9, nsp10, and ORF10 — all UNCERTAIN pending manual inspection.
Impact on website: The "Confirmed gap (v4 verified)" filter does not apply to SARS-CoV-2 targets (no CONFIRMED genes). The 52 novel targets are visible without a gap badge. The PACMAN/CARVER 0% coverage benchmarking result is a separate fully-verified finding (real Aho-Corasick scan of 9.2M genomes) and is unaffected.
What Was Retracted (and Why)
v2 published 43 gap claims. The v3 scan retracted 50 claims (43 gaps + 7 application-level claims that survived the gap screen but failed the application-context check). The v4 scan reclassified 1 additional claim (RSV SH protein → UNCERTAIN via ontology synonyms) and deduplicated J2R/thymidine kinase and A56R/hemagglutinin as alternate names for the same Mpox genes, yielding 6 confirmed gene-level gaps (5 unique biological targets). Below are the most significant:
- HIV — all genes (gag, pol, env, tat, rev, nef, vif, integrase, gp120, LTR) were false claims. HIV has 669+ CRISPR publications per gene; the original narrow queries missed all of them.
- Influenza — HA, PB1, PB2, PA, NP, M2: hundreds of papers each.
- Ebola — NP, VP40, VP30, VP24, L: well-studied.
- Zika — NS2A, NS2B, NS4A: published CRISPR diagnostic work exists.
- TB — pncA, CFP-10: literature confirmed.
- Cholera — ctxA, ctxB, ompU, ompT, hlyA, rtxA: six genes confirmed as false; only hapA survived.
- All 14 application-level gaps (diagnostics/therapeutics per pathogen) were application-context false positives and have been fully retracted.
Full Audit Trail & Source Code
All data, code, and decision logs are open and versioned. The audit file contains per-gene evidence strings, PubMed query URLs, and evidence-of-absence notes for every confirmed gap.