# An Ensemble Method for Calling & Ranking Somatic SVs ## 1. Problem & Motivation - **Challenges in Somatic SV Detection** - Distinguishing somatic vs. germline - Dealing with subclonal variants - Tumor heterogeneity & contamination - **Limitations of Sequencing Technologies** - Short-reads miss large events - Long-reads have lower precision/recall - **Need for Improved Methods** - Lack of a "gold standard" tool for SVs - SVs are crucial cancer drivers (est. 55%) ## 2. Proposed Solution: An Ensemble Method - **Core Concept** - Combine multiple long-read callers' strengths - Mitigate individual tool weaknesses - **Key Features** - Integrates long-read & short-read evidence - Produces a ranked list of deletions - Prioritizes events for validation - **Primary Focus** - Initially focused on deletions ## 3. Methodology & Pipeline Workflow - **Input Data** - Paired tumor-normal samples - Long-read (Nanopore) & short-read (Illumina) data - **Long-Read SV Calling** - NanomonSV - SAVANA (classified & filtered) - CuteSV-sub (custom subtraction) - **Ensemble Step** - Combines callers via overlap (>=75%) - Union mode (NanomonSV + SAVANA) - Validation mode (add CuteSV & SAVANA-filtered) - **Short-Read Validation** - Method 1: Gap & Soft-clipping - Method 2: Insert-size variation - **Ranking Algorithm** - Calculates final score per deletion - Tallies evidence for/against - Normalizes all evidence sources ## 4. Evaluation & Benchmark - **Dataset Used** - Espejo Valle-Inclan benchmark - COLO829 melanoma cell line - **Ground Truth** - Curated truth set (38 somatic deletions) - Multi-platform validation - **Configuration** - Min SV length: 30 BP - Reference: GRCh37 ## 5. Key Results & Performance - **Overall Ensemble Performance** - Found 35/38 true positives (Recall: 0.92) - Only 3 high-scoring false positives (Precision: 0.92) - Rule-based approach: Precision 1.00, Recall 0.89 - **Ranking Efficacy** - Successfully prioritized true positives - Most false positives (67/71) had low scores - **Analysis of Discrepancies** - False Positives: Likely true but small, or germline - False Negatives: Complex, very large, or near germline ## 6. Individual Tool Performance - **NanomonSV** - **Strength:** Highest recall - **Weakness:** Most false positives - **SAVANA** - 'Classified': Too stringent (low recall) - 'Filtered': Recovers TPs but adds FPs - **CuteSV-sub** - **Strength:** Good for reinforcement (high precision) - **Weakness:** Too stringent for discovery (low recall) - **Short-Read Validation** - **Strength:** Highly complementary methods - **Strength:** Very reliable and low FP rate ## 7. Discussion & Future Work - **Main Contributions** - Ensemble compensates for tool weaknesses - Ranking aids event prioritization - **Limitations** - Scarcity of somatic SV benchmarks - Focused only on deletions - **Future Directions** - Extend to other SV types (insertions) - Refine filtering for real-world data - Test with other technologies (PacBio)