Decoding Cancer's Blueprint

An Ensemble Method for Calling and Ranking Somatic Structural Variants Using Long and Short Reads

Walter Gallego Gomez, Elena Grassi, Andrea Bertotti, Gianvito Urgese
Politecnico di Torino & University of Torino

The Genomic Jigsaw Puzzle

Structural Variants (SVs) are large-scale DNA changes—deletions, insertions, and rearrangements. In cancer, they are not just errors; they are often the key drivers of tumor growth. But finding them is like spotting a single incorrect piece in a million-piece puzzle. 🧩

Why Is This So Hard?

Somatic vs. Germline

We must distinguish new tumor-specific (somatic) changes from inherited (germline) variants.

Tumor Complexity

Tumors are a mix of normal and cancer cells, with variants often present at low frequencies.

Technical Limits

Traditional short-read sequencing struggles to see large SVs, while newer long-read tools can be imprecise.

Our Approach: The Power of Ensemble

We don't rely on a single source of truth. Our method combines the strengths of multiple tools and data types to build a high-confidence, unified result.

Long Reads + Short Reads + Multiple Callers = A Clearer Picture

The Ensemble Workflow

🧬

Input Data

Paired Tumor (●) and Normal (●) samples with both Long (ONT) and Short (Illumina) reads.

⬇️
📡

Long-Read SV Calling

Run three specialized callers in parallel: NanomonSV, SAVANA, and CuteSV.

⬇️
🤝

Ensemble Merge

Combine the results, identifying overlapping calls to increase confidence.

Validation & Ranking

🔬

Short-Read Validation

Use short-read data to find supporting evidence (gaps, soft-clipping, insert size) for each potential SV.

⬇️
📈

Scoring & Ranking

Calculate a final score for each SV based on all evidence from all sources.

⬇️
🏆

Prioritized Output

Produce a single, ranked list of high-confidence somatic deletions, ready for validation.

Precision Meets Recall

Tested on the EspejoValle-Inclan benchmark (COLO829 cell line).

92%
Recall

Found 35 of 38 true somatic deletions.

92%
Precision

Only 3 high-scoring false positives.

-71%
Noise Reduction

Most of the 71 false positives received a low rank.

The Power of Ranking

Our final score successfully separates high-confidence true positives from low-confidence noise.

True Positives     False Positives

A Symphony of Callers

Each component plays a crucial role in the final, accurate result.

NanomonSV

The Finder: High recall, finds almost everything, but with some noise.

SAVANA

The Confirmer: High precision, very stringent, misses some real events.

Short Reads

The Validator: The ultimate ground truth, confirming events with orthogonal data.

From Data to Discovery

Our method provides a robust, prioritized list of somatic SVs, which means researchers can:

🎯 Focus Efforts

Prioritize experimental validation on the most promising candidates.

🧬 Uncover Drivers

More reliably identify cancer-driving SVs for downstream analysis.

⚙️ Build Standards

Move towards a reproducible, gold-standard pipeline for somatic SV detection.

Conclusion & Future Work

Conclusion

Our ensemble approach successfully leverages the strengths of long-read callers and short-read validation to produce a high-quality, ranked list of somatic deletions, significantly improving on individual tools.

Future Directions

  • Extend support to other SV types like insertions.
  • Test on more benchmarks and real-world data.
  • Refine the rule-based approach for analysis without a truth set.

Thank You

Questions?

Read the full paper:
doi.org/10.1145/3700666.3700694