Plain-Language Guide: Pangenomic CRISPR Discovery

1

The Problem: CRISPR Diagnostics Have Been Looking at the Wrong Part of the Virus

Why do tests fail after mutations?

From the paper — Introduction

CRISPR-based nucleic acid detection (SHERLOCK, DETECTR, PACMAN) has demonstrated rapid pathogen detection, but guide RNA design has followed historical RT-PCR conventions: structural and surface genes (Spike, Nucleocapsid, Envelope) predominate. Two problems follow… immune-exposed structural genes accumulate mutations under selection, degrading diagnostic sensitivity — as seen when D614G swept the SARS-CoV-2 Spike and invalidated SHERLOCK_S.

What this means in plain English: When a new virus appears, scientists design diagnostic tests by building tiny molecular "scissors" (guide RNAs) that latch onto specific parts of the virus's genetic sequence. The problem is that everyone gravitates to the same familiar parts — the Spike protein, the Nucleocapsid — because those were studied first.

But those surface parts of the virus are exactly what the immune system attacks. Under immune pressure, the virus mutates rapidly there to escape. The famous D614G mutation in the Spike protein is a real example: it became so dominant that a widely-used CRISPR diagnostic (SHERLOCK_S) stopped working because it targeted a version of the Spike that had mostly disappeared from circulation.

🔑

Analogy: Imagine a burglar alarm that only sounds when the burglar wears a red jacket. Once the burglar learns to wear blue, the alarm stops working. CRISPR diagnostics targeting the Spike protein face the same problem — the virus keeps changing its jacket.

The fix? Target the burglar's fingerprints — internal machinery that can't change without the virus losing its ability to replicate. That's what the replication enzymes are: functionally essential machinery under such strong evolutionary constraint that even small mutations are lethal to the virus.

This paper identifies 52 target locations in that machinery — places the CRISPR field has never studied for diagnostics, that stayed stable across 9.2 million virus sequences worldwide.

2

The Retraction We Made to Ourselves

An earlier version of this work was wrong — and we caught it

From the paper — Note on methodology self-correction

An earlier version of this work claimed 43 literature gaps by running a single narrow PubMed query per gene. A v3 re-verification using three evidence strategies retracted 34 of those 43; 9 survived. A subsequent v4 scanner — powered by NCBI ontology-derived synonym expansion (8,283 genes, 51 auto-synonyms) — reclassified RSV SH protein from CONFIRMED to UNCERTAIN and deduplicated J2R/thymidine kinase and A56R/hemagglutinin as alternate names for the same Mpox genes. 6 confirmed gene-level gaps remain (5 unique biological targets).

Why we're telling you this

Science is supposed to do this — find its own errors and correct them — but it usually happens very slowly through peer review of published papers, often years later. This correction happened in the same working session, before any public claim was made. We went from 43 candidate gaps to 6 confirmed ones (5 unique biological targets) by applying progressively stricter verification.

What went wrong in the first version: The original analysis searched PubMed with one simple query per gene: "[gene name] AND CRISPR". This seems reasonable but falls apart when the same gene has multiple names. Cholera's ctxA gene has also been called "hemagglutinin protease" in different papers. Search for "ctxA AND CRISPR" and you find nothing. Search for its synonyms and you find 57 papers. That's not a real gap — that's a search failure.

The v4 scanner loads 8,283 gene annotations from public databases and automatically adds all known synonyms. The RSV SH protein example is particularly striking: when only the name "SH protein" was used, zero CRISPR papers were found — a confirmed gap. When the ontology added synonyms like "small hydrophobic protein," 278 papers appeared. The gap was an illusion.

📄 web/data/pubmed-scan-v4.json 📄 web/data/ontology-enrichment.json 📄 web/data/gap-audit-v4.txt

3

How It Was Done: Search Engines, Not Wet Lab

Pure computational analysis — all code and data are public

From the paper — Methods (§2.1)

9,193,298 SARS-CoV-2 sequences (262 GB)… every overlapping 20-mer was evaluated for panviral conservation (19,019 RefSeq viral genomes) and pangenome conservation… Filter for PAM compatibility (SpCas9 NGG or Cas12a TTTV). Exclude candidates overlapping any published CRISPR diagnostic guide region.

What a 20-mer is: SARS-CoV-2 has a genetic code 29,903 letters long (using the letters A, T, G, C). A "20-mer" is any 20-letter stretch within that code. Starting from the beginning, you can slide a 20-letter window one position at a time and extract 29,640 overlapping windows. Each one is a potential CRISPR target address.

Two filters applied:

1. Cross-species conservation: Does this 20-mer appear in other viruses too? If yes, a CRISPR scissors designed against it could work across many viruses, not just one strain. 1,503 out of 29,640 windows passed this test.

2. Pangenome stability: Is this 20-mer present, exactly as written, across the 9.2 million known SARS-CoV-2 genome sequences? At 99%+ conservation, 99 in every 100 virus samples would trigger the CRISPR detection.

What "PAM compatibility" means: CRISPR scissors (Cas9, Cas12a) need a short landing pad called a PAM sequence to bind. Targets without a PAM site can't be used — filtering these out left 89 distinct target regions.

What "overlapping published guides" means: If a target region is in the same part of the genome where an existing diagnostic already works (N gene, E gene, Spike, RdRp), it's not "novel" — it's just a variant of existing work. Removing these left 52 genuinely new regions.

29,640

Windows evaluated

1,503

Cross-virus conserved

89

PAM-compatible regions

52

Genuinely novel

0

Human genome hits

4

The 52 Targets: What Makes Them Special

Replication machinery that can't mutate without killing the virus

From the paper — Results (§3.1)

52 CRISPR-targetable regions show no overlap with any published diagnostic guide; 8 top candidates spanning ORF1a, nsp13 (Helicase), and nsp14 (ExoN) show 98.9–99.2% conservation across all 9.2 million genomes and zero exact matches in the human reference genome (GRCh38, 3.2 billion bp).

What ORF1a/b, nsp13, nsp14 are: These are the virus's internal factories. Nsp13 is a helicase — a molecular motor that unzips the genetic code so it can be copied. Nsp14 is an error-proofreading enzyme (ExoN = exonuclease) that checks and corrects replication errors. Without either of these, the virus cannot reproduce.

⚙️

Analogy: If the Spike protein is the lock on the virus's front door (which the body's immune system keeps trying to pick), the replication machinery is the engine under the hood. You can repaint the doors, change the locks, even replace the chassis. But if the engine breaks, the car doesn't move. That engine is what we're targeting.

98.9–99.2% conservation means: if you collected all 9.2 million known SARS-CoV-2 genomes and checked this exact 20-letter sequence, it's present, letter-perfect, in all but a handful. Compare this to the Spike protein target used by SHERLOCK_S (63.28% — meaning the target sequence has effectively disappeared from over a third of contemporary virus genomes).

Zero human genome hits: Every candidate was checked against the entire 3.2-billion-letter human reference genome. Not one 20-letter sequence appeared in any human chromosome. This means a test using these targets will not accidentally trigger on human DNA — a critical safety requirement.

📦 novel_targets_52.json

5

How Do They Compare to Existing Tests?

Within 1 percentage point of the best published guide (DETECTR_E)

From the paper — Results (§3.2)

Our replication-machinery candidates (98.9–99.2%) are within 1 percentage point of the best published diagnostic guide (DETECTR_E: 98.34%) on an identical 9.2-million-genome corpus. Conservation alone does not predict guide performance — all candidates require experimental validation.

The paper conservatively scanned both the 52 novel targets and the published guides against their respective genome sets. There's a small (~2.4%) difference in corpus sizes between the two comparisons — the paper notes this caveat explicitly. The conceptual picture is clear: targets in mutation-constrained, functionally essential genes hold up; targets in immune-exposed surface proteins degrade.

Guide	Type	Conservation	Verdict
Our top targets (nsp13/14)	Novel / this paper	98.9–99.2%	Robust ✓
DETECTR_E (Broughton 2020)	Diagnostic	98.34%	Robust ✓
SHERLOCK_Orf1ab (Patchsung 2020)	Diagnostic	97.55%	Robust ✓
SHERLOCK_S (Patchsung 2020)	Historical context*	63.28%	Spike drift ✗
CARVER_CoV_con1 (Freije 2019)	Antiviral	<0.01%	Complete failure ✗

*SHERLOCK_S targeted a Spike region that later drifted. Comparing against it would inflate apparent advantage unfairly — it is shown here to illustrate why targeting conserved regions matters, not as a benchmark.

Note: PACMAN and CARVER are antiviral systems (designed to destroy viral RNA inside cells), not diagnostics. Their conservation numbers reflect antiviral durability, not diagnostic accuracy.

6

The 6 Confirmed Literature Gaps: Places Nobody Has Published

Mpox and cholera genes with zero CRISPR diagnostic papers

From the paper — Abstract

Four-strategy literature verification confirmed 6 genuine gene-level gaps (5 unique biological targets): Mpox (5 genes: A33R, B5R, E8L, J2R/thymidine kinase, A56R/hemagglutinin) and Vibrio cholerae (hapA). Each Mpox gene was verified under both its legacy name and protein function name; all query variants returned zero PubMed results.

What "literature gap" means: A gene where zero published CRISPR papers describe diagnostic guide design targeting that specific gene — confirmed by four independent search methods (PubMed narrow, PubMed with synonyms, Europe PMC, local paper corpus scan). These are white spaces on the scientific map.

Why Mpox genes have multiple names: Mpox (monkeypox) belongs to the poxvirus family. Poxvirus genes were named after the vaccinia virus (VACV) decades ago, and different scientific communities use different naming systems. A33R (the legacy name) is the same gene as what some papers call OPG130. J2R and "thymidine kinase" refer to the same gene (OPG101). The paper verified all naming variants returned zero CRISPR papers — the gap is real under every known name.

Why these are interesting targets: A33R, B5R, and E8L are mpox genes involved in how the virus spreads between cells and evades immune responses. HapA is Vibrio cholerae's hemagglutinin protease — involved in the bacterium's ability to colonize the intestine. None of these has ever been studied as a CRISPR detection target.

🗺️

Think of it like unexplored territory on a map: The CRISPR diagnostic field has mapped most major roads (Spike, Nucleocapsid, RdRp). These 6 gene targets are equivalent to roads that exist on the terrain but haven't been charted yet. This paper says: here is where they are, here are the coordinates, here is proof nobody else has been there.

📄 web/data/pubmed-scan-v4.json → confirmed_gene_gaps

7

What the Literature Says About SARS-CoV-2: No Confirmed Gaps

Replication genes have papers — just no diagnostic-specific work

From the paper — Results (§3.1)

A full four-strategy v4 PubMed + Europe PMC scan found 6 non-diagnostic CRISPR papers for nsp13 (all therapeutic or mechanistic) and 9 for nsp14 (mixed, 0 diagnostic-specific), classifying both as FALSE — no CONFIRMED literature gaps were identified for any SARS-CoV-2 gene. Replication-associated proteins nsp7 and nsp10, along with accessory protein ORF10, carry UNCERTAIN status (0 PubMed, high Europe PMC co-mention) and represent the least-explored diagnostic targets.

This is an important nuance. The 52 novel CRISPR target sequences are genuinely untouched — no diagnostic has used those specific genomic addresses. But in the scientific literature, papers about nsp13 and nsp14 do exist; they're just about using CRISPR as a molecular biology tool to study these enzymes, not as a diagnostic test.

So there are two types of novelty here:

Sequence novelty: The specific 20-letter DNA addresses (52 of them) have not been used as CRISPR diagnostic targets. This is fully confirmed.

Literature novelty (CRISPR diagnostics only): For SARS-CoV-2, the literature scanner cannot confirm a "zero-paper" gap for replication genes because mechanistic CRISPR papers exist. The gap is in the diagnostic application, not in any paper mentioning the gene.

nsp7 and ORF10 are the closest to true terra incognita — literally 0 PubMed results in any CRISPR context (though Europe PMC full-text co-mentions exist at high volume, keeping them UNCERTAIN rather than CONFIRMED gaps).

8

What This Study Cannot Tell You

Honest accounting of what the data does and doesn't prove

From the paper — Limitations (§4.4)

1. No experimental validation. All candidates are computational predictions. Conservation at pangenomic scale strongly suggests utility, but Tm, secondary structure, off-target in complex biological matrices, and guide efficiency require wet-lab validation. 2. Mismatch-tolerant off-target analysis completed (GRCh38, ≤3 mismatches). 0 exact matches, 58 one-mismatch positions across 22 guides (30/52 clean at ≤1 mm). Of the 8 highest-conservation candidates, 5 are completely clean at ≤1 mismatch. 3. Literature claims are time- and query-dependent.

What "computational prediction" means: Everything in this paper was done by computer — searching databases, counting matches, running scripts. No virus was grown in a lab, no CRISPR guide was synthesized, no diagnostic test was physically run. Computational predictions can be wrong.

Updated off-target results (March 2026): The paper originally flagged that only exact-match screening had been done. We subsequently ran a full genome-wide scan allowing up to 3 mismatches against the entire 3.2-billion-letter human genome. Results for the 52 final guides:

0

Exact matches (0mm)

58

1-mismatch positions (22 guides)

30/52

Guides clean at ≤1mm

5/8

Top candidates fully clean

In CRISPR practice, a guide with even 1 mismatch loses most of its cutting activity — especially if the mismatch is in the first 12 bases near the PAM site (the "seed region"). A guide with 2 mismatches is generally considered safe. The 3 top-8 candidates with a 1-mismatch hit need to have those positions checked before synthesis, but this is standard practice and routine.

What still needs wet-lab validation: RNA secondary structure, guide efficiency, off-target activity in complex biological samples (saliva, blood, sputum) — none of these can be predicted computationally with confidence. The paper says: these are leads. Very promising leads, but leads.

9

The Data: Everything Is Public and Verifiable

Every number in this paper links back to a downloadable file

From the paper — Data availability

All data, code, and indexes are publicly available at the LOOM project repository. The 130,000-target database, literature scan results, and the FM-index WASM binary for in-browser search are at the project URL.

What this means for trust: Every single number in this paper — every conservation percentage, every PubMed query result, every on/off-target count — comes from a machine-readable file that you can download and check yourself. The paper includes a Reviewer Verification Protocol (Table S2) that lists every claim with the exact artifact file and how to verify it.

The LOOM search engine in your browser runs on a 335 KB WebAssembly binary that performs the same FM-index search described in the paper — not a demo, the actual code. Type any sequence into the search tool and it searches across the same indexed genomes in milliseconds, with no server.

12

Pathogens scanned

130K

Target database entries

72

UNCERTAIN gaps (need review)

335 KB

WASM search binary

🔬 Live CRISPR Search Tool 📦 web/data/pubmed-scan-v4.json 📦 data/crispr_guides/novel_targets_52.json 📄 publications/papers/pangenomic-crispr-targets-merged.md

Preprint status: This paper has not yet been peer-reviewed by an independent journal. The data and methods are transparent — you can verify them — but the scientific community has not yet formally evaluated the claims. This is not clinical guidance. If you are working on diagnostics and want to build on this work, please contact the author and plan for full experimental validation before any clinical application.

52 Untouched Targets in the COVID-19 Virus:A New Way to Think About CRISPR Diagnostics