BADGERS Optimizer

An evolutionary algorithm for generating optimized—and potentially novel—Cas13a spacer sequences from multiple sequence alignments.

Overview

Unlike standard spacer finding, which identifies and scores subsequences already present in the input, the BADGERS optimizer generates novel spacer sequences that may not exist in any natural sequence. It uses the ADAPT CNN models as a frozen fitness oracle and applies an evolutionary algorithm to maximize spacer activity across target sequence diversity.

The algorithm is based on Mantena et al., Nature Biotechnology 2024. Input is a multiple sequence alignment (MSA) of pathogen variants in FASTA format. All spacer sequences are 28 nt (Cas13a).

Warning

The optimizer generates potentially novel/synthetic sequences not present in the input alignment. These spacers are computationally designed and should be experimentally validated before use in diagnostic assays.

Two Objective Modes

The optimizer supports two distinct objectives, each with its own fitness function and default hyperparameters.

Mode	Use Case	Fitness Objective
Multi-target detection	Detect all variants of a pathogen	Maximize frequency-weighted mean activity across all targets
Variant identification	Distinguish variant A from variant B	Maximize on-target activity while minimizing off-target activity (sigmoidal cost)

Workflow

The optimizer processes each eligible site in the MSA through a five-step pipeline.

Step	Operation	Output
1. Extract sites	Slide a 48 nt window (10 nt flanking + 28 nt spacer + 10 nt flanking) across the MSA. Keep positions where ≥80% of sequences have valid ACGT-only windows.	Vec<GenomicSite> — one per eligible position
2. Build fitness	Construct a MultiTargetFitness or VariantIdFitness evaluator wrapping the ADAPT predictor and target set for the site.	Fitness function for this site
3. Evolve	Initialize population via Boltzmann sampling from seed sequences. Each generation: sample parents, mutate, replace worst. Repeat until evaluation budget is exhausted. Optional local search around top spacers.	OptimizationResult with ranked population
4. Diversity filter	Greedily remove spacers within a Hamming distance threshold of a higher-fitness spacer.	Deduplicated spacer set
5. Score & return	Convert evolutionary fitness to ScoredSpacerCandidate with full quality flags, assay score (using for_optimizer() weights), and tier classification.	SiteOptimResult per site, aggregated into OptimizerOutput

Fitness Functions

Multi-Target Detection

Maximizes expected Cas13a activity across all sequence variants. The fitness of a spacer is the frequency-weighted average of its combined activity against all unique targets:

fitness(spacer) = Σ(freq_t × combined_activity(spacer, target_t))

Where combined_activity = classify_prob × (regression + 4.0) − 4.0. This joint classification-regression score is the ADAPT model's native output format.

After evolution, the optimizer also computes perc_highly_active for each top-k spacer: the frequency-weighted fraction of targets where the spacer is classified as "highly active" (both classification probability and regression score above their respective thresholds).

Variant Identification

Maximizes activity against an on-target partition while minimizing activity against an off-target partition, using sigmoidal cost functions:

t2_cost = c / (1 + a × exp(k × (t2_activity − o)))
t1_cost = c − c / (1 + a × exp(k × (t1_activity − o)))
fitness = −(t2w × t2_cost + t1_cost)

Hyperparameter	Default	Role
c	1.0	Sigmoid amplitude
a	5.897	Sigmoid scale factor
k	−2.858	Sigmoid steepness
o	−2.511	Sigmoid midpoint offset
t2w	1.737	Off-target cost weight

Diversity Filter

After evolution, a greedy Hamming distance filter ensures sequence diversity in the output. Spacers are iterated in descending fitness order; each spacer is kept only if its Hamming distance to all previously kept spacers exceeds the threshold (default: 3).

Setting the minimum distance to 0 disables filtering entirely. After filtering, results are truncated to top_k_per_site (default: 5).

Output

The optimizer produces an OptimizerOutput containing per-site results (SiteOptimResult). Each site result includes:

Field	Description
spacers	Optimized spacers as ScoredSpacerCandidates with full quality flags and tier
shannon_entropy	Average Shannon entropy across the spacer region at this site
consensus_fitness	Fitness of the consensus seed spacer (baseline for improvement)
num_targets / num_valid_seqs	Unique targets and total valid sequences at the site
mean_on/off_target_activity	Weighted combined activity against each partition (variant-id only)
site_targets	Per-target sequences with frequencies and partition labels

Cross-site convenience methods include best_spacer(), all_spacers_ranked(), and summary() for aggregated statistics (total spacers, novel count, best fitness, mean improvement over consensus).

Optimizer Weight Preset

Optimized spacers use the for_optimizer() assay score weight preset, which differs from the standard default in two key ways:

Component	Default Weight	Optimizer Weight
ml_activity	0.30	0.35
heuristic_quality	0.10	0.05
ml_activity_range	(0.0, 4.0)	(2.0, 4.0)

The narrower ML activity range of (2.0, 4.0) is used because optimizer fitness values (shifted by +4.0) cluster in that band. The default (0.0, 4.0) range would compress their spread, making it hard to differentiate top candidates. See the Assay Score page for the full weight breakdown.