SPACER
FinderSequencesJobsDocsContact

Search documentation

Search all SPACER documentation pages

GitHub
IntroductionQuick Start
OverviewEnzyme FamiliesPAM SequencesNomenclature
Cas12 FinderCas13 FinderMSA Guide DesignBADGERS OptimizerOptimizer Configuration
RPA PrimersExport FormatsMulti-Target ScoringSignal Ratio FilteringReference Genomes
spacer-webv0.1.0
Output & Screening›Reference Genomes

Reference Genomes

Score guides against indexed reference genomes for off-target specificity analysis.

Overview

Genome screening identifies potential off-target binding sites by searching a reference genome (e.g., human, host organism) for sequences similar to the guide RNA. This ensures that diagnostic guides are specific to the intended pathogen target and won't cross-react with host sequences.

SPACER uses a minimizer-based genome index for fast seed-and-extend querying. The index stores approximately 1/w of all k-mers (default: ~20% with w=5), trading a small amount of storage for guaranteed sensitivity: any alignment of length ≥ k + w − 1 = 16 bp shares at least one seed with the reference.

Info
Genome screening requires a pre-built genome index. The index can be constructed from any FASTA reference genome using GenomeIndex::build_from_fasta(), or loaded from a memory-mapped flat file for server deployments with low RSS requirements.

Index Configuration

ParameterDefaultDescription
seed_length (k)12K-mer length for minimizer seeds. 4¹² = 16M possible sequences, keeping collisions rare.
window_size (w)5Minimizer window size. Index stores ~genome_length/w seeds. Larger w = smaller index but higher minimum alignment length.
max_mismatches4Maximum Hamming distance for off-target hit reporting. Verification terminates early when exceeded.

The genome is stored in 2-bit encoding (A=00, C=01, G=10, T=11), reducing memory by 4× compared to ASCII. Two index variants are available: InMemory for small genomes or testing, and Mmap for large reference genomes where memory-mapped I/O keeps RSS low.

Screening Result

Each guide produces a GenomeScreeningResult containing hit counts binned by mismatch tier and a continuous specificity score:

FieldDescription
hits_by_mismatchArray of 5 hit counts: [0mm, 1mm, 2mm, 3mm, 4mm]
total_hitsTotal genome hits across all tiers
closest_hitThe hit with fewest mismatches (most concerning)
pam_checkedWhether PAM adjacency filtering was applied
max_off_target_activityML-predicted maximum off-target activity ratio (optional, requires ML re-validation)
ml_specificity_scoreML-informed specificity score: 1.0 − max_activity_ratio (optional)

Genome Specificity Score

The specificity score is a continuous value in [0, 1] derived from the closest off-target hit. Higher values indicate fewer and weaker genome matches (better specificity):

Closest HitSpecificity ScoreInterpretation
0 mm (exact match)0.00Guide is self-targeting — worst case
1 mm0.15Very likely off-target cleavage
2 mm0.35Moderate off-target risk
3 mm0.60Low off-target risk
4 mm0.80Minimal off-target risk
No hits1.00Clean guide — no matches within threshold

When ML re-validation data is available, the specificity_score_ml_aware() method returns the ML-informed score (based on predicted off-target activity ratios) instead of the heuristic tier-based score.

Assay Score Integration

When genome specificity data is present, SPACER automatically activates the with_specificity() weight preset. This rebalances assay score weights to incorporate the genome_specificity component:

ComponentDefault Weightwith_specificity() Weight
ml_activity0.300.25
heuristic_quality0.100.08
spacer_structure0.100.08
amplicon_fit0.050.04
coverage0.250.20
genome_specificity0.000.08
msa_specificity0.000.07

The specificity score feeds directly into the genome_specificity component of the Assay Score. Components with zero weight in the default preset (genome and MSA specificity) are only activated when the corresponding screening data is provided.

PAM Adjacency Filtering

For PAM-dependent enzymes (Cas12 family), genome hits can optionally be filtered by PAM adjacency. A hit is only reported if a valid PAM sequence exists at the correct position relative to the match site. This increases biological relevance by excluding hits that would not be targetable by the enzyme.

Common Use Cases

ScenarioGenome IndexPurpose
Pathogen detection assayHuman reference (GRCh38)Ensure guides don’t match human transcriptome
Agricultural diagnosticsHost plant genomeAvoid cross-reactivity with plant RNA
Variant-specific detectionRelated pathogen genomeConfirm guides don’t match closely related species
Output & Screening
Signal Ratio Filtering
Project
Distribution Channels
ATCG GCTA TACG CGAT ATCG TAGC GCTA ATCG TACG CGAT ATCG GCTA TACG CGAT ATCG TAGC GCTA ATCG TACG CGAT ATCG GCTA TACG CGAT ATCG TAGC GCTA ATCG TACG CGAT ATCG GCTA TACG CGAT ATCG TAGC GCTA ATCG TACG CGAT ATCG GCTA TACG CGAT ATCG TAGC GCTA ATCG TACG CGAT ATCG GCTA TACG CGAT ATCG TAGC GCTA ATCG TACG CGAT ATCG GCTA TACG CGAT ATCG TAGC GCTA ATCG TACG CGAT ATCG GCTA TACG CGAT ATCG TAGC GCTA ATCG TACG CGAT
SPACER

Open-source CRISPR guide RNA design and scoring for Cas12 and Cas13 diagnostic systems.

Resources
FinderDocumentationChangelogContactGitHub
Developed atFiocruz Parana — Instituto Carlos Chagas

Fundacao Oswaldo Cruz - Parana

Instituto Carlos Chagas

© 2026 SPACER·v0.1.0
hwalflorGitHub