# An Ensemble Method for Calling & Ranking Somatic SVs
## 1. Problem & Motivation
- **Challenges in Somatic SV Detection**
- Distinguishing somatic vs. germline
- Dealing with subclonal variants
- Tumor heterogeneity & contamination
- **Limitations of Sequencing Technologies**
- Short-reads miss large events
- Long-reads have lower precision/recall
- **Need for Improved Methods**
- Lack of a "gold standard" tool for SVs
- SVs are crucial cancer drivers (est. 55%)
## 2. Proposed Solution: An Ensemble Method
- **Core Concept**
- Combine multiple long-read callers' strengths
- Mitigate individual tool weaknesses
- **Key Features**
- Integrates long-read & short-read evidence
- Produces a ranked list of deletions
- Prioritizes events for validation
- **Primary Focus**
- Initially focused on deletions
## 3. Methodology & Pipeline Workflow
- **Input Data**
- Paired tumor-normal samples
- Long-read (Nanopore) & short-read (Illumina) data
- **Long-Read SV Calling**
- NanomonSV
- SAVANA (classified & filtered)
- CuteSV-sub (custom subtraction)
- **Ensemble Step**
- Combines callers via overlap (>=75%)
- Union mode (NanomonSV + SAVANA)
- Validation mode (add CuteSV & SAVANA-filtered)
- **Short-Read Validation**
- Method 1: Gap & Soft-clipping
- Method 2: Insert-size variation
- **Ranking Algorithm**
- Calculates final score per deletion
- Tallies evidence for/against
- Normalizes all evidence sources
## 4. Evaluation & Benchmark
- **Dataset Used**
- Espejo Valle-Inclan benchmark
- COLO829 melanoma cell line
- **Ground Truth**
- Curated truth set (38 somatic deletions)
- Multi-platform validation
- **Configuration**
- Min SV length: 30 BP
- Reference: GRCh37
## 5. Key Results & Performance
- **Overall Ensemble Performance**
- Found 35/38 true positives (Recall: 0.92)
- Only 3 high-scoring false positives (Precision: 0.92)
- Rule-based approach: Precision 1.00, Recall 0.89
- **Ranking Efficacy**
- Successfully prioritized true positives
- Most false positives (67/71) had low scores
- **Analysis of Discrepancies**
- False Positives: Likely true but small, or germline
- False Negatives: Complex, very large, or near germline
## 6. Individual Tool Performance
- **NanomonSV**
- **Strength:** Highest recall
- **Weakness:** Most false positives
- **SAVANA**
- 'Classified': Too stringent (low recall)
- 'Filtered': Recovers TPs but adds FPs
- **CuteSV-sub**
- **Strength:** Good for reinforcement (high precision)
- **Weakness:** Too stringent for discovery (low recall)
- **Short-Read Validation**
- **Strength:** Highly complementary methods
- **Strength:** Very reliable and low FP rate
## 7. Discussion & Future Work
- **Main Contributions**
- Ensemble compensates for tool weaknesses
- Ranking aids event prioritization
- **Limitations**
- Scarcity of somatic SV benchmarks
- Focused only on deletions
- **Future Directions**
- Extend to other SV types (insertions)
- Refine filtering for real-world data
- Test with other technologies (PacBio)