Reference Genomes

Score guides against indexed reference genomes for off-target specificity analysis.

Overview

Genome screening identifies potential off-target binding sites by searching a reference genome (e.g., human, host organism) for sequences similar to the guide RNA. This ensures that diagnostic guides are specific to the intended pathogen target and won't cross-react with host sequences.

SPACER uses a minimizer-based genome index for fast seed-and-extend querying. The index stores approximately 1/w of all k-mers (default: ~20% with w=5), trading a small amount of storage for guaranteed sensitivity: any alignment of length ≥ k + w − 1 = 16 bp shares at least one seed with the reference.

Info

Genome screening requires a pre-built genome index. The index can be constructed from any FASTA reference genome using GenomeIndex::build_from_fasta(), or loaded from a memory-mapped flat file for server deployments with low RSS requirements.

Index Configuration

Parameter	Default	Description
seed_length (k)	12	K-mer length for minimizer seeds. 4¹² = 16M possible sequences, keeping collisions rare.
window_size (w)	5	Minimizer window size. Index stores ~genome_length/w seeds. Larger w = smaller index but higher minimum alignment length.
max_mismatches	4	Maximum Hamming distance for off-target hit reporting. Verification terminates early when exceeded.

The genome is stored in 2-bit encoding (A=00, C=01, G=10, T=11), reducing memory by 4× compared to ASCII. Two index variants are available: InMemory for small genomes or testing, and Mmap for large reference genomes where memory-mapped I/O keeps RSS low.

Screening Result

Each guide produces a GenomeScreeningResult containing hit counts binned by mismatch tier and a continuous specificity score:

Field	Description
hits_by_mismatch	Array of 5 hit counts: [0mm, 1mm, 2mm, 3mm, 4mm]
total_hits	Total genome hits across all tiers
closest_hit	The hit with fewest mismatches (most concerning)
pam_checked	Whether PAM adjacency filtering was applied
max_off_target_activity	ML-predicted maximum off-target activity ratio (optional, requires ML re-validation)
ml_specificity_score	ML-informed specificity score: 1.0 − max_activity_ratio (optional)

Genome Specificity Score

The specificity score is a continuous value in [0, 1] derived from the closest off-target hit. Higher values indicate fewer and weaker genome matches (better specificity):

Closest Hit	Specificity Score	Interpretation
0 mm (exact match)	0.00	Guide is self-targeting — worst case
1 mm	0.15	Very likely off-target cleavage
2 mm	0.35	Moderate off-target risk
3 mm	0.60	Low off-target risk
4 mm	0.80	Minimal off-target risk
No hits	1.00	Clean guide — no matches within threshold

When ML re-validation data is available, the specificity_score_ml_aware() method returns the ML-informed score (based on predicted off-target activity ratios) instead of the heuristic tier-based score.

Assay Score Integration

When genome specificity data is present, SPACER automatically activates the with_specificity() weight preset. This rebalances assay score weights to incorporate the genome_specificity component:

Component	Default Weight	with_specificity() Weight
ml_activity	0.30	0.25
heuristic_quality	0.10	0.08
spacer_structure	0.10	0.08
amplicon_fit	0.05	0.04
coverage	0.25	0.20
genome_specificity	0.00	0.08
msa_specificity	0.00	0.07

The specificity score feeds directly into the genome_specificity component of the Assay Score. Components with zero weight in the default preset (genome and MSA specificity) are only activated when the corresponding screening data is provided.

PAM Adjacency Filtering

For PAM-dependent enzymes (Cas12 family), genome hits can optionally be filtered by PAM adjacency. A hit is only reported if a valid PAM sequence exists at the correct position relative to the match site. This increases biological relevance by excluding hits that would not be targetable by the enzyme.

Common Use Cases

Scenario	Genome Index	Purpose
Pathogen detection assay	Human reference (GRCh38)	Ensure guides don’t match human transcriptome
Agricultural diagnostics	Host plant genome	Avoid cross-reactivity with plant RNA
Variant-specific detection	Related pathogen genome	Confirm guides don’t match closely related species