A plain-English walkthrough of the preprint "Pangenomic CRISPR Diagnostic Target Discovery" — explaining every key claim, what the numbers mean, and why this matters to you. No biology PhD required.
What this means in plain English: When a new virus appears, scientists design diagnostic tests by building tiny molecular "scissors" (guide RNAs) that latch onto specific parts of the virus's genetic sequence. The problem is that everyone gravitates to the same familiar parts — the Spike protein, the Nucleocapsid — because those were studied first.
But those surface parts of the virus are exactly what the immune system attacks. Under immune pressure, the virus mutates rapidly there to escape. The famous D614G mutation in the Spike protein is a real example: it became so dominant that a widely-used CRISPR diagnostic (SHERLOCK_S) stopped working because it targeted a version of the Spike that had mostly disappeared from circulation.
This paper identifies 52 target locations in that machinery — places the CRISPR field has never studied for diagnostics, that stayed stable across 9.2 million virus sequences worldwide.
Science is supposed to do this — find its own errors and correct them — but it usually happens very slowly through peer review of published papers, often years later. This correction happened in the same working session, before any public claim was made. We went from 43 candidate gaps to 6 confirmed ones (5 unique biological targets) by applying progressively stricter verification.
What went wrong in the first version: The original analysis searched PubMed with one simple query per gene: "[gene name] AND CRISPR". This seems reasonable but falls apart when the same gene has multiple names. Cholera's ctxA gene has also been called "hemagglutinin protease" in different papers. Search for "ctxA AND CRISPR" and you find nothing. Search for its synonyms and you find 57 papers. That's not a real gap — that's a search failure.
The v4 scanner loads 8,283 gene annotations from public databases and automatically adds all known synonyms. The RSV SH protein example is particularly striking: when only the name "SH protein" was used, zero CRISPR papers were found — a confirmed gap. When the ontology added synonyms like "small hydrophobic protein," 278 papers appeared. The gap was an illusion.
What a 20-mer is: SARS-CoV-2 has a genetic code 29,903 letters long (using the letters A, T, G, C). A "20-mer" is any 20-letter stretch within that code. Starting from the beginning, you can slide a 20-letter window one position at a time and extract 29,640 overlapping windows. Each one is a potential CRISPR target address.
Two filters applied:
1. Cross-species conservation: Does this 20-mer appear in other viruses too? If yes, a CRISPR scissors designed against it could work across many viruses, not just one strain. 1,503 out of 29,640 windows passed this test.
2. Pangenome stability: Is this 20-mer present, exactly as written, across the 9.2 million known SARS-CoV-2 genome sequences? At 99%+ conservation, 99 in every 100 virus samples would trigger the CRISPR detection.
What "PAM compatibility" means: CRISPR scissors (Cas9, Cas12a) need a short landing pad called a PAM sequence to bind. Targets without a PAM site can't be used — filtering these out left 89 distinct target regions.
What "overlapping published guides" means: If a target region is in the same part of the genome where an existing diagnostic already works (N gene, E gene, Spike, RdRp), it's not "novel" — it's just a variant of existing work. Removing these left 52 genuinely new regions.
What ORF1a/b, nsp13, nsp14 are: These are the virus's internal factories. Nsp13 is a helicase — a molecular motor that unzips the genetic code so it can be copied. Nsp14 is an error-proofreading enzyme (ExoN = exonuclease) that checks and corrects replication errors. Without either of these, the virus cannot reproduce.
98.9–99.2% conservation means: if you collected all 9.2 million known SARS-CoV-2 genomes and checked this exact 20-letter sequence, it's present, letter-perfect, in all but a handful. Compare this to the Spike protein target used by SHERLOCK_S (63.28% — meaning the target sequence has effectively disappeared from over a third of contemporary virus genomes).
Zero human genome hits: Every candidate was checked against the entire 3.2-billion-letter human reference genome. Not one 20-letter sequence appeared in any human chromosome. This means a test using these targets will not accidentally trigger on human DNA — a critical safety requirement.
The paper conservatively scanned both the 52 novel targets and the published guides against their respective genome sets. There's a small (~2.4%) difference in corpus sizes between the two comparisons — the paper notes this caveat explicitly. The conceptual picture is clear: targets in mutation-constrained, functionally essential genes hold up; targets in immune-exposed surface proteins degrade.
| Guide | Type | Conservation | Verdict |
|---|---|---|---|
| Our top targets (nsp13/14) | Novel / this paper | 98.9–99.2% | Robust ✓ |
| DETECTR_E (Broughton 2020) | Diagnostic | 98.34% | Robust ✓ |
| SHERLOCK_Orf1ab (Patchsung 2020) | Diagnostic | 97.55% | Robust ✓ |
| SHERLOCK_S (Patchsung 2020) | Historical context* | 63.28% | Spike drift ✗ |
| CARVER_CoV_con1 (Freije 2019) | Antiviral | <0.01% | Complete failure ✗ |
*SHERLOCK_S targeted a Spike region that later drifted. Comparing against it would inflate apparent advantage unfairly — it is shown here to illustrate why targeting conserved regions matters, not as a benchmark.
Note: PACMAN and CARVER are antiviral systems (designed to destroy viral RNA inside cells),
not diagnostics. Their conservation numbers reflect antiviral durability, not diagnostic accuracy.
What "literature gap" means: A gene where zero published CRISPR papers describe diagnostic guide design targeting that specific gene — confirmed by four independent search methods (PubMed narrow, PubMed with synonyms, Europe PMC, local paper corpus scan). These are white spaces on the scientific map.
Why Mpox genes have multiple names: Mpox (monkeypox) belongs to the poxvirus family. Poxvirus genes were named after the vaccinia virus (VACV) decades ago, and different scientific communities use different naming systems. A33R (the legacy name) is the same gene as what some papers call OPG130. J2R and "thymidine kinase" refer to the same gene (OPG101). The paper verified all naming variants returned zero CRISPR papers — the gap is real under every known name.
Why these are interesting targets: A33R, B5R, and E8L are mpox genes involved in how the virus spreads between cells and evades immune responses. HapA is Vibrio cholerae's hemagglutinin protease — involved in the bacterium's ability to colonize the intestine. None of these has ever been studied as a CRISPR detection target.
This is an important nuance. The 52 novel CRISPR target sequences are genuinely untouched — no diagnostic has used those specific genomic addresses. But in the scientific literature, papers about nsp13 and nsp14 do exist; they're just about using CRISPR as a molecular biology tool to study these enzymes, not as a diagnostic test.
So there are two types of novelty here:
Sequence novelty: The specific 20-letter DNA addresses (52 of them) have not been used as CRISPR diagnostic targets. This is fully confirmed.
Literature novelty (CRISPR diagnostics only): For SARS-CoV-2, the literature scanner cannot confirm a "zero-paper" gap for replication genes because mechanistic CRISPR papers exist. The gap is in the diagnostic application, not in any paper mentioning the gene.
nsp7 and ORF10 are the closest to true terra incognita — literally 0 PubMed results in any CRISPR context (though Europe PMC full-text co-mentions exist at high volume, keeping them UNCERTAIN rather than CONFIRMED gaps).
What "computational prediction" means: Everything in this paper was done by computer — searching databases, counting matches, running scripts. No virus was grown in a lab, no CRISPR guide was synthesized, no diagnostic test was physically run. Computational predictions can be wrong.
Updated off-target results (March 2026): The paper originally flagged that only exact-match screening had been done. We subsequently ran a full genome-wide scan allowing up to 3 mismatches against the entire 3.2-billion-letter human genome. Results for the 52 final guides:
In CRISPR practice, a guide with even 1 mismatch loses most of its cutting activity — especially if the mismatch is in the first 12 bases near the PAM site (the "seed region"). A guide with 2 mismatches is generally considered safe. The 3 top-8 candidates with a 1-mismatch hit need to have those positions checked before synthesis, but this is standard practice and routine.
What still needs wet-lab validation: RNA secondary structure, guide efficiency, off-target activity in complex biological samples (saliva, blood, sputum) — none of these can be predicted computationally with confidence. The paper says: these are leads. Very promising leads, but leads.
What this means for trust: Every single number in this paper — every conservation percentage, every PubMed query result, every on/off-target count — comes from a machine-readable file that you can download and check yourself. The paper includes a Reviewer Verification Protocol (Table S2) that lists every claim with the exact artifact file and how to verify it.
The LOOM search engine in your browser runs on a 335 KB WebAssembly binary that performs the same FM-index search described in the paper — not a demo, the actual code. Type any sequence into the search tool and it searches across the same indexed genomes in milliseconds, with no server.