Auxiliary Gene Learning: Spatial Gene Expression Estimation by Auxiliary Gene Selection
Episode

Auxiliary Gene Learning: Spatial Gene Expression Estimation by Auxiliary Gene Selection

Nov 23, 20259:34
Machine LearningComputer Vision and Pattern Recognitionq-bio.GN
(1)

Abstract

Spatial transcriptomics (ST) is a novel technology that enables the observation of gene expression at the resolution of individual spots within pathological tissues. ST quantifies the expression of tens of thousands of genes in a tissue section; however, heavy observational noise is often introduced during measurement. In prior studies, to ensure meaningful assessment, both training and evaluation have been restricted to only a small subset of highly variable genes, and genes outside this subset have also been excluded from the training process. However, since there are likely co-expression relationships between genes, low-expression genes may still contribute to the estimation of the evaluation target. In this paper, we propose $Auxiliary \ Gene \ Learning$ (AGL) that utilizes the benefit of the ignored genes by reformulating their expression estimation as auxiliary tasks and training them jointly with the primary tasks. To effectively leverage auxiliary genes, we must select a subset of auxiliary genes that positively influence the prediction of the target genes. However, this is a challenging optimization problem due to the vast number of possible combinations. To overcome this challenge, we propose Prior-Knowledge-Based Differentiable Top-$k$ Gene Selection via Bi-level Optimization (DkGSB), a method that ranks genes by leveraging prior knowledge and relaxes the combinatorial selection problem into a differentiable top-$k$ selection problem. The experiments confirm the effectiveness of incorporating auxiliary genes and show that the proposed method outperforms conventional auxiliary task learning approaches.

Summary

This paper addresses the problem of noisy gene expression measurements in spatial transcriptomics (ST) data. Current methods often discard a large portion of genes with low expression due to high noise, limiting the potential for utilizing co-expression relationships between these genes and the target genes. The authors propose Auxiliary Gene Learning (AGL), which treats the estimation of these previously ignored genes as auxiliary tasks to improve the prediction of target genes. A key challenge is selecting a subset of auxiliary genes that positively influence the prediction, which is a difficult combinatorial optimization problem given the large number of candidate genes. To tackle this selection problem, the authors introduce Prior-Knowledge-Based Differentiable Top-$k$ Gene Selection via Bi-level Optimization (DkGSB). DkGSB ranks genes based on their expression variance (HVG score) as a proxy for signal quality and then learns a single scalar *k* to determine the top-*k* genes to use as auxiliary tasks. This combinatorial selection problem is relaxed into a differentiable top-*k* selection problem, enabling gradient-based optimization. A bi-level optimization scheme is used, where the network weights are optimized to minimize the loss on both primary and auxiliary tasks, while *k* is optimized to minimize the validation loss on the primary tasks. Experiments on public datasets demonstrate that AGL with DkGSB outperforms conventional methods that discard low-expression genes and other auxiliary task learning approaches. This work matters to the field because it provides a way to leverage previously discarded information in ST data, improving the accuracy of gene expression estimation.

Key Insights

  • Novelty: The paper proposes a novel Auxiliary Gene Learning (AGL) framework that leverages previously discarded, low-expression genes as auxiliary tasks to improve spatial gene expression estimation.
  • Differentiable Top-k Selection: The Prior-Knowledge-Based Differentiable Top-$k$ Gene Selection (DkGSB) module offers a differentiable relaxation of the combinatorial gene selection problem, allowing for end-to-end optimization using gradient descent.
  • Bi-level Optimization: The bi-level optimization scheme effectively learns the optimal number of auxiliary genes (*k*) by minimizing the validation loss on the primary tasks, while simultaneously training the network weights.
  • Performance Improvement: Experiments show that AGL with DkGSB outperforms conventional methods, including "AGL + All" (using all auxiliary genes) and other auxiliary task learning techniques, across different tissue types. For example, in the intra-batch experiment, AGL+DkGSB achieves higher Pearson Correlation Coefficient (PCC) than PGL (primary gene learning) across BOWEL A (0.551 vs 0.514), BOWEL B (0.440 vs 0.419), and OVARY (0.458 vs 0.448).
  • Prior Knowledge Integration: The use of HVG scores as prior knowledge for ranking auxiliary genes proves to be effective, as demonstrated by the comparison with random gene selection. HVG score based selection yields a significantly greater performance improvement compared to random selection when the number of auxiliary genes is limited.
  • Robustness: The method demonstrates robustness to batch effects, maintaining its effectiveness even in inter-batch experiments on the HEART dataset.
  • Limitations: The approach relies solely on HVG scores for ranking, which may not fully capture complex biological relationships between genes, and the contribution of auxiliary genes may vary spatially within the tissue, which is not addressed in this work.

Practical Implications

  • Improved Gene Expression Estimation: The AGL framework with DkGSB can be used to improve the accuracy of gene expression estimation from pathological images in spatial transcriptomics studies.
  • Application Areas: This research benefits researchers and practitioners in fields such as cancer biology, drug discovery, and developmental biology, where accurate gene expression profiling is crucial for understanding disease mechanisms and developing new therapies.
  • Tool for Practitioners: Practitioners and engineers can use the DkGSB module as a plug-in component for existing spatial transcriptomics models to leverage information from low-expression genes and improve prediction accuracy.
  • Future Research Directions: The paper opens up several future research directions, including incorporating biological functional relationships among genes beyond HVG scores, developing spatially adaptive mechanisms for weighting auxiliary genes based on their local relevance, and exploring alternative frameworks for data selection alongside gene selection.

Links & Resources

Authors