Reference Genomes
Score guides against indexed reference genomes for off-target specificity analysis.
Overview
Genome screening identifies potential off-target binding sites by searching a reference genome (e.g., human, host organism) for sequences similar to the guide RNA. This ensures that diagnostic guides are specific to the intended pathogen target and won't cross-react with host sequences.
SPACER uses a minimizer-based genome index for fast seed-and-extend querying. The index stores approximately 1/w of all k-mers (default: ~20% with w=5), trading a small amount of storage for guaranteed sensitivity: any alignment of length ≥ k + w − 1 = 16 bp shares at least one seed with the reference.
GenomeIndex::build_from_fasta(), or loaded from a memory-mapped flat file for server deployments with low RSS requirements.Index Configuration
| Parameter | Default | Description |
|---|---|---|
| seed_length (k) | 12 | K-mer length for minimizer seeds. 4¹² = 16M possible sequences, keeping collisions rare. |
| window_size (w) | 5 | Minimizer window size. Index stores ~genome_length/w seeds. Larger w = smaller index but higher minimum alignment length. |
| max_mismatches | 4 | Maximum Hamming distance for off-target hit reporting. Verification terminates early when exceeded. |
The genome is stored in 2-bit encoding (A=00, C=01, G=10, T=11), reducing memory by 4× compared to ASCII. Two index variants are available: InMemory for small genomes or testing, and Mmap for large reference genomes where memory-mapped I/O keeps RSS low.
Screening Result
Each guide produces a GenomeScreeningResult containing hit counts binned by mismatch tier and a continuous specificity score:
| Field | Description |
|---|---|
| hits_by_mismatch | Array of 5 hit counts: [0mm, 1mm, 2mm, 3mm, 4mm] |
| total_hits | Total genome hits across all tiers |
| closest_hit | The hit with fewest mismatches (most concerning) |
| pam_checked | Whether PAM adjacency filtering was applied |
| max_off_target_activity | ML-predicted maximum off-target activity ratio (optional, requires ML re-validation) |
| ml_specificity_score | ML-informed specificity score: 1.0 − max_activity_ratio (optional) |
Genome Specificity Score
The specificity score is a continuous value in [0, 1] derived from the closest off-target hit. Higher values indicate fewer and weaker genome matches (better specificity):
| Closest Hit | Specificity Score | Interpretation |
|---|---|---|
| 0 mm (exact match) | 0.00 | Guide is self-targeting — worst case |
| 1 mm | 0.15 | Very likely off-target cleavage |
| 2 mm | 0.35 | Moderate off-target risk |
| 3 mm | 0.60 | Low off-target risk |
| 4 mm | 0.80 | Minimal off-target risk |
| No hits | 1.00 | Clean guide — no matches within threshold |
When ML re-validation data is available, the specificity_score_ml_aware() method returns the ML-informed score (based on predicted off-target activity ratios) instead of the heuristic tier-based score.
Assay Score Integration
When genome specificity data is present, SPACER automatically activates the with_specificity() weight preset. This rebalances assay score weights to incorporate the genome_specificity component:
| Component | Default Weight | with_specificity() Weight |
|---|---|---|
| ml_activity | 0.30 | 0.25 |
| heuristic_quality | 0.10 | 0.08 |
| spacer_structure | 0.10 | 0.08 |
| amplicon_fit | 0.05 | 0.04 |
| coverage | 0.25 | 0.20 |
| genome_specificity | 0.00 | 0.08 |
| msa_specificity | 0.00 | 0.07 |
The specificity score feeds directly into the genome_specificity component of the Assay Score. Components with zero weight in the default preset (genome and MSA specificity) are only activated when the corresponding screening data is provided.
PAM Adjacency Filtering
For PAM-dependent enzymes (Cas12 family), genome hits can optionally be filtered by PAM adjacency. A hit is only reported if a valid PAM sequence exists at the correct position relative to the match site. This increases biological relevance by excluding hits that would not be targetable by the enzyme.
Common Use Cases
| Scenario | Genome Index | Purpose |
|---|---|---|
| Pathogen detection assay | Human reference (GRCh38) | Ensure guides don’t match human transcriptome |
| Agricultural diagnostics | Host plant genome | Avoid cross-reactivity with plant RNA |
| Variant-specific detection | Related pathogen genome | Confirm guides don’t match closely related species |