Heuristic Quality
Sequence composition scoring combining GC content, homopolymer runs, and PFS preference into a single normalized component.
Overview
The heuristic quality component combines three sub-scores — GC content quality, homopolymer penalty, and PFS (Protospacer Flanking Sequence) preference — into a single [0, 1] value. It carries a default weight of 0.10 in the assay score and is always active (it requires no optional pipeline stages).
Enzyme-Specific Sub-Weights
The three sub-scores are combined with enzyme-specific weights:
| Enzyme | GC Weight | Homopolymer Weight | PFS Weight | Formula |
|---|---|---|---|---|
| Cas12 | 0.60 | 0.40 | — (not used) | 0.60 × gc + 0.40 × homo |
| Cas13 | 0.50 | 0.30 | 0.20 | 0.50 × gc + 0.30 × homo + 0.20 × pfs |
Cas12 does not use PFS because it relies on PAM recognition rather than protospacer flanking sequences. Cas13 variants may have PFS preferences that affect guide efficacy.
GC Content Quality
GC content is scored on a piecewise linear function. The optimal range is 40–60%, scoring 1.0. Outside this range, the score ramps linearly to 0.0 at the extremes:
| GC Range | Score | Formula |
|---|---|---|
| 40–60% | 1.0 | Optimal — no penalty |
| 20–40% | 0.0 → 1.0 (linear ramp) | (gc − 0.20) / 0.20 |
| 60–80% | 1.0 → 0.0 (linear ramp) | (0.80 − gc) / 0.20 |
| ≤20% or ≥80% | 0.0 | Extreme GC — maximum penalty |
Quality flags are raised at the boundaries: LowGc when GC < 40%, HighGc when GC > 60%.
Homopolymer Quality
Homopolymer runs (consecutive identical nucleotides) are penalized based on the longest run in the spacer sequence:
| Max Run Length | Score | Notes |
|---|---|---|
| 0–3 | 1.0 | No penalty — short runs are acceptable |
| 4 | 0.75 | Mild penalty |
| 5 | 0.50 | Moderate penalty; flag raised at ≥5 |
| 6 | 0.25 | Severe penalty |
| ≥7 | 0.0 | Maximum penalty |
For runs of 4–6, the formula is: 1.0 - (max_run - 3.0) / 4.0. This produces a linear ramp from 1.0 (at 3) to 0.0 (at 7). Runs of 7 or more always score 0.0.
Poly-T/U Stretches
Poly-T stretches (4+ consecutive T/U) are detected and flagged with Critical severity, but they do not contribute to any score component. Poly-T acts as a transcription terminator for RNA Polymerase III (Pol III), which is used to express guide RNAs in vivo. A poly-T stretch in the spacer would cause premature termination of the crRNA transcript, making the guide non-functional in Pol III expression systems.
PFS Preference (Cas13 Only)
Some Cas13 variants have preferences for the nucleotide immediately flanking the protospacer. The PFS sub-score uses three levels:
| PFS Status | Score | Meaning |
|---|---|---|
| Favorable | 1.0 | Flanking nucleotide matches variant preference |
| Unknown | 0.5 | No flanking context available; neutral assumption |
| Unfavorable | 0.0 | Flanking nucleotide is disfavored by the variant |
Variant-Specific Rules
| Variant | Rule | Favorable | Unfavorable |
|---|---|---|---|
| LwaCas13a | Avoid G at 3′ of protospacer | A, C, T/U at 3′ | G at 3′ |
| LbuCas13a | Avoid G at 3′ of protospacer | A, C, T/U at 3′ | G at 3′ |
| PsmCas13b | Avoid C at 5′ of protospacer | A, G, T/U at 5′ | C at 5′ |
| Generic | No PFS requirement | Always favorable | — |
When flanking context is not available (e.g., the spacer is at the edge of the input sequence), the PFS check defaults to favorable (passes). For Cas12 enzymes, PFS is not evaluated and does not contribute to the heuristic quality score.