needLR: Long-read structural variant annotation with population-scale frequency estimation
Abstract
Summary: We present needLR, a structural variant (SV) annotation tool that can be used for filtering and prioritization of candidate pathogenic SVs from long-read sequencing data using population allele frequencies, annotations for genomic context, and gene-phenotype associations. When using population data from 500 presumably healthy individuals to evaluate nine test cases with known pathogenic SVs, needLR assigned allele frequencies to over 97.5% of all detected SVs and reduced the average number of novel genic SVs to 121 per case while retaining all known pathogenic variants. Availability and Implementation: needLR is implemented in bash with dependencies including Truvari v4.2.2, BEDTools v2.31.1, and BCFtools v1.19. Source code, documentation, and pre-computed population allele frequency data are freely available at https://github.com/jgust1/needLR under an MIT license.
Summary
The paper introduces needLR, a new structural variant (SV) annotation tool designed specifically for long-read sequencing (LRS) data. The tool addresses the challenge of filtering and prioritizing potential pathogenic SVs identified by LRS, which generates a more comprehensive list of SVs than short-read sequencing (SRS) but lacks sufficient population-level data for accurate annotation. The core innovation of needLR lies in its integrated approach, combining LRS-derived allele frequencies from a population of 500 individuals from the 1KGP-LRSC with genomic context annotation and gene-phenotype associations. needLR utilizes Truvari for SV merging to calculate allele frequencies, BEDTools for genomic context annotation, and provides both a tab-separated summary (TSV) file and a VCF file as output. The authors validated needLR using nine positive control samples with known pathogenic SVs that were missed by standard clinical SRS but detected by LRS. The results demonstrated that needLR could accurately annotate rare SVs, filtering out >97% of common SVs (AF > 0) and reducing the number of novel genic SVs significantly. Compared to using gnomAD v4.1 (SRS-derived) as the reference database, needLR with the LRS-derived control database substantially reduced the number of candidate pathogenic SVs. The authors also compared needLR's performance to other SV annotation tools like SVAFotate, STIX, SvAnna, and AnnotSV, highlighting its superior integration of LRS-specific allele frequencies with comprehensive genomic context annotation. This makes needLR a valuable tool for both research and clinical applications, enabling more accurate and efficient identification of pathogenic SVs in LRS data.
Key Insights
- •needLR assigns allele frequencies to over 97.5% of all detected SVs using population data from 500 individuals.
- •needLR reduced the average number of novel genic SVs to 121 per case while retaining all known pathogenic variants in the validation set.
- •Using needLR with an LRS-derived control SV database reduced the number of candidate pathogenic SVs by almost an order of magnitude compared to using the SRS-derived control SV data from gnomAD v4.1.
- •needLR utilizes customizable Truvari-based merging parameters, allowing users to adjust sequence similarity threshold, size similarity tolerance, and reference distance tolerance for optimal SV matching.
- •needLR provides ancestry-specific allele frequencies across the five 1KGP superpopulations (African, American, East Asian, European, and South Asian), improving variant interpretation in diverse patient populations.
- •The tool flags variants with unexpected genotype distributions using Hardy-Weinberg equilibrium testing for quality control.
- •Current limitations include the exclusion of breakend variants and SVs >1 Mbp, as well as limited sex chromosome analysis due to Sniffles2's performance.
Practical Implications
- •needLR can be used in clinical settings to improve the diagnostic yield of LRS by facilitating the filtering and prioritization of candidate pathogenic SVs.
- •Researchers can use needLR to analyze LRS data from diverse cohorts, enabling more accurate identification of disease-associated SVs and improving our understanding of the genetic basis of human diseases.
- •Clinical laboratories can integrate needLR into their standardized annotation workflows to leverage the advantages of LRS while maintaining accuracy and efficiency.
- •Future research directions include incorporating additional SV callers, adding trio analysis functionality for de novo variant detection, and expanding compatibility with new reference genomes and technology-specific population backends.
- •The open-source availability of needLR allows for community contributions and further development, ensuring its continued relevance and utility in the field of genomic research.