An Ensemble Method for Calling and Ranking Somatic Structural Variants

Combining Long and Short DNA Reads for Unprecedented Accuracy in Cancer Genomics

Based on research by W. Gallego Gomez, E. Grassi, A. Bertotti, and G. Urgese

The Challenge in Cancer Genomics

Detecting the large-scale genetic alterations that drive cancer is a critical but notoriously difficult task.

The Problem with SVs

Somatic Structural Variants (SVs) are large DNA changes that can initiate or accelerate cancer. However, they are often missed by standard methods, which struggle to distinguish true cancer-driving mutations from benign germline variants and technical artifacts.

Low Precision & Recall

Existing SV calling tools, especially for long-read sequencing, are often hampered by low precision (many false positives) or low recall (missing true variants). This uncertainty complicates research and clinical validation efforts.

Our Ensemble Solution: A Multi-Layered Pipeline

We developed a novel method that combines the strengths of multiple tools and data types to produce a single, high-confidence ranked list of somatic deletions.

Step 1: Data Input

The process begins with paired Tumor and Normal samples, each sequenced with two technologies.

Normal Sample

Long Reads (Nanopore) + Short Reads (Illumina)

Tumor Sample

Long Reads (Nanopore) + Short Reads (Illumina)

Step 2: Long-Read SV Calling

Three specialized long-read SV callers analyze the data independently to identify potential somatic variants.

NanomonSV SAVANA CuteSV

Step 3: Ensemble & Validation

The results are merged (Ensemble) and cross-validated using evidence from the high-precision short-read data, increasing confidence in each call.

Final Output: Ranked Somatic Deletions

The final result is a ranked list, prioritizing deletions with the strongest evidence from all sources, ready for downstream analysis and validation.

Precision & Power: The Benchmark Results

Our method was evaluated against the gold-standard Espejo Valle-Inclan benchmark, demonstrating a significant improvement in accuracy.

35/38
True Positives Detected

Successfully identified 92% of the curated somatic deletions in the truth set.

3
High-Scoring False Positives

The ranking system effectively filtered out noise, with most false positives receiving low scores.

Performance Comparison

Our Ensemble
SAVANA
NanomonSV
Precision
Recall

The Impact: Accelerating Cancer Research

This ensemble method provides a more robust foundation for studying the role of structural variants in cancer.

Prioritize and Validate

The ranked output allows researchers to focus experimental validation efforts on the most promising SV candidates, saving significant time, effort, and resources.

Increase Confidence

By integrating multiple callers and data types, our approach mitigates the weaknesses of individual tools, producing a more reliable and trustworthy set of somatic SVs.

Enable Future Discoveries

A robust method for SV detection is instrumental for future single- and pan-cancer studies, helping to fully define the landscape of genomic instability in cancer.