Homopolymer Scoring
How SPACER detects and penalizes runs of consecutive identical nucleotides in guide RNA spacers.
What Are Homopolymers?
A homopolymer is a run of consecutive identical nucleotides within a sequence — for example, AAAA (poly-A), CCCC (poly-C), GGGG (poly-G), or TTTT/UUUU (poly-T/U). Homopolymer runs are problematic in guide RNA design for several reasons:
- Synthesis errors: Oligo synthesis platforms have higher error rates in homopolymer regions, leading to insertions and deletions
- Reduced specificity: Repetitive sequences are more likely to have off-target matches in complex genomes
- Secondary structures: Poly-G runs can form G-quadruplexes; poly-C runs can form i-motifs
- Enzyme inhibition: Certain homopolymer configurations interfere with Cas enzyme loading and activity
Detection Algorithm
SPACER scans each spacer sequence and identifies the longest run of any single nucleotide. The detection considers all four bases independently and reports the maximum run length found:
Spacer: A U G C C C C U A G G G U U A A C G U C
↑↑↑↑ ↑↑↑
poly-C (4) poly-G (3)
Longest homopolymer: 4 (CCCC)Scoring Function
The homopolymer score is based on the length of the longest run. Short runs (≤3 nt) are considered normal and receive no penalty. Penalties increase with run length:
| Longest Run | Score | Interpretation |
|---|---|---|
| 1–3 nt | 1.0 | Normal — no homopolymer concern |
| 4 nt | 0.5 | Moderate — synthesis risk, minor activity impact |
| 5 nt | 0.25 | Significant — likely synthesis issues, reduced activity |
| ≥6 nt | 0.0 | Severe — strong recommendation against use |
The penalty applies to the single longest run, not the sum of all runs. A spacer with two trinucleotide runs (e.g., AAA and GGG) receives no penalty, while a spacer with a single tetranucleotide run (AAAA) is penalized.
Quality Flag
When the longest homopolymer run is 4 or more nucleotides, SPACER raises the HOMOPOLYMER quality flag:
| Flag | Condition | Meaning |
|---|---|---|
| HOMOPOLYMER | Longest run ≥ 4 nt | Spacer contains a homopolymer that may affect synthesis or activity |
Base-Specific Risks
While all homopolymers are penalized equally in the scoring function, their biological risks differ:
| Run Type | Primary Risk |
|---|---|
| Poly-A | Weak binding (A-U pairs only have 2 hydrogen bonds); synthesis slippage |
| Poly-C | Can form i-motif structures at slightly acidic pH; synthesis challenges |
| Poly-G | G-quadruplex formation; strong secondary structures; highest synthesis error rate |
| Poly-T/U | Acts as Pol III terminator signal; see Poly-T Scoring for dedicated handling |