The TAG array of a multiple sequence alignment
Episode

The TAG array of a multiple sequence alignment

Nov 24, 20258:59
q-bio.GN
No ratings yet

Abstract

Modern genomic analyses increasingly rely on pangenomes, that is, representations of the genome of entire populations. The simplest representation of a pangenome is a set of individual genome sequences. Compared to e.g. sequence graphs, this has the advantage that efficient exact search via indexes based on the Burrows-Wheeler Transform (BWT) is possible, that no chimeric sequences are created, and that the results are not influenced by heuristics. However, such an index may report a match in thousands of positions even if these all correspond to the same locus, making downstream analysis unnecessarily expensive. For sufficiently similar sequences (e.g. human chromosomes), a multiple sequence alignment (MSA) can be computed. Since an MSA tends to group similar strings in the same columns, it is likely that a string occurring thousands of times in the pangenome can be described by very few columns in the MSA. We describe a method to tag entries in the BWT with the corresponding column in the MSA and develop an index that can map matches in the BWT to columns in the MSA in time proportional to the output. As a by-product, we can efficiently project a match to a designated reference genome, a capability that current pangenome aligners based on the BWT lack.

Summary

This paper addresses the problem of efficiently mapping matches found in a Burrows-Wheeler Transform (BWT) index of a pangenome to specific loci within a multiple sequence alignment (MSA). The authors observe that while BWT-based indexes are efficient for exact search in pangenomes, they can report numerous matches corresponding to the same genomic location, increasing downstream analysis costs. To mitigate this, they introduce a method to "tag" entries in the BWT with corresponding columns in the MSA. This is achieved by constructing a TAG array, which associates each BWT position with its corresponding MSA column. They then develop a sampled index over the TAG array that allows mapping matches in the BWT to columns in the MSA in time proportional to the output size. A key contribution is an efficient algorithm for constructing the run-length encoded TAG array and a sampling strategy that guarantees a bounded number of steps to reach a sampled TAG value. As an application, they demonstrate the ability to project matches to a designated reference genome, a capability lacking in some current BWT-based pangenome aligners. The methodology involves several steps: first, computing the TAG array using the BWT and LF-mapping. Second, run-length encoding the TAG array to reduce its size. Third, strategically sampling the TAG array to balance space usage and query performance. Finally, constructing an index that allows efficient retrieval of distinct TAG values within a given BWT interval using techniques from the document listing problem. The authors present algorithms for each of these steps, including a greedy algorithm for selecting sampled TAG runs. The experimental results on human chromosome 19 haplotypes show that the proposed method enables efficient mapping of multiple exact matches (MEMs) to a linear reference sequence, achieving competitive performance compared to existing tools like ropebwt3.

Key Insights

  • Introduced the concept of a TAG array to link BWT positions to MSA columns, enabling efficient localization of matches within a pangenome context.
  • Developed a linear-time algorithm for constructing the run-length encoded TAG array, leveraging the LF-mapping of the BWT and contextual locality.
  • Proposed a novel sampling strategy for the TAG array, guaranteeing that any unsampled TAG value can be reached from a sampled value within a bounded number of LF steps (less than 's' steps for a sampling rate of 's').
  • Presented an optimal greedy algorithm for selecting the sampled TAG runs, minimizing the number of sampled runs while adhering to the distance constraint.
  • Implemented a space-efficient index for reporting distinct TAG values within a BWT interval, adapting techniques from the document listing problem and achieving output-sensitive time complexity.
  • Experimental results demonstrate that the method can map MEMs to MSA columns faster than ropebwt3 finds the MEMs themselves, with a sampling rate of 4 (76.0 s vs. 81.7 s).
  • The size of the dollar-EBWT used in the experiment had 91,081,437 runs, and the TAG index had 137,981,814 runs, with 97,979,057 having a TAG value.

Practical Implications

  • Enables efficient mapping of reads or other sequence queries to a designated reference genome within a pangenome, improving accuracy and reducing reference bias in genomic analyses.
  • Benefits researchers and clinicians working with pangenomes, particularly those analyzing closely related genomes like human populations.
  • Practitioners can use the developed algorithms and data structures to build indexes that facilitate faster and more accurate pangenome analysis. Specifically, the method allows for mapping BWT-based MEMs to columns in the MSA in time proportional to the output.
  • Opens up future research directions in improving chaining heuristics for seed-and-extend alignment algorithms using BWT-based indexes and exploring applications of TAG arrays in other genomic analysis tasks.
  • The developed techniques can be applied in metagenomic read classification by tagging BWT positions with metagenomic class identifiers.

Links & Resources

Authors