Sub-sampling graph neural networks for genomic prediction of quantitative phenotypes
Kihlman, Ragini; Launonen, Ilkka; Sillanpää, Mikko J; Waldmann, Patrik (2024-09-09)
Kihlman, Ragini
Launonen, Ilkka
Sillanpää, Mikko J
Waldmann, Patrik
Oxford University Press
09.09.2024
Ragini Kihlman, Ilkka Launonen, Mikko J Sillanpää, Patrik Waldmann, Sub-sampling graph neural networks for genomic prediction of quantitative phenotypes, G3 Genes|Genomes|Genetics, 2024;, jkae216, https://doi.org/10.1093/g3journal/jkae216
https://creativecommons.org/licenses/by/4.0/
© The Author(s) 2024. Published by Oxford University Press on behalf of The Genetics Society of America. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
https://creativecommons.org/licenses/by/4.0/
© The Author(s) 2024. Published by Oxford University Press on behalf of The Genetics Society of America. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
https://creativecommons.org/licenses/by/4.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202409105782
https://urn.fi/URN:NBN:fi:oulu-202409105782
Tiivistelmä
Abstract
In genomics, use of deep learning (DL) is rapidly growing and DL has successfully demonstrated its ability to uncover complex relationships in large biological and biomedical data sets. With the development of high-throughput sequencing techniques, genomic markers can now be allocated to large sections of a genome. By analysing allele sharing between individuals, one may calculate realized genomic relationships from single nucleotide polymorphisms (SNPs) data rather than relying on known pedigree relationships under polygenic model. The traditional approaches in genome-wide prediction (GWP) of quantitative phenotypes utilise genomic relationships in fixed global covariance modelling, possibly with some non-linear kernel mapping (for example Gaussian processes). On the other hand, the DL approaches proposed so far for GWP fail to take into account the non-Euclidean graph structure of relationships between individuals over several generations. In this paper, we propose one global convolutional neural network (GCN) and one local sub-sampling architecture (GCN-RS) that are specifically designed to perform regression analysis based on genomic relationship information. A GCN is tailored to non-Euclidean spaces and consists of several layers of graph convolutions. The GCN-RS architecture is designed to further improve the GCN’s performance by sub-sampling the graph to reduce the dimensionality of the input data. Through these graph convolutional layers, the GCN maps input genomic markers to their quantitative phenotype values. The graphs are constructed using an iterative nearest neighbour approach. Comparisons show that the GCN-RS outperforms the popular Genomic Best Linear Unbiased Predictor (GBLUP) method on one simulated and three real data sets from wheat, mice and pig with a predictive improvement of 4.4% to 49.4% in terms of test mean squared error (MSE). This indicates that GCN-RS is a promising tool for genomic predictions in plants and animals. Furthermore, GCN-RS is computationally efficient, making it a viable option for large-scale applications.
In genomics, use of deep learning (DL) is rapidly growing and DL has successfully demonstrated its ability to uncover complex relationships in large biological and biomedical data sets. With the development of high-throughput sequencing techniques, genomic markers can now be allocated to large sections of a genome. By analysing allele sharing between individuals, one may calculate realized genomic relationships from single nucleotide polymorphisms (SNPs) data rather than relying on known pedigree relationships under polygenic model. The traditional approaches in genome-wide prediction (GWP) of quantitative phenotypes utilise genomic relationships in fixed global covariance modelling, possibly with some non-linear kernel mapping (for example Gaussian processes). On the other hand, the DL approaches proposed so far for GWP fail to take into account the non-Euclidean graph structure of relationships between individuals over several generations. In this paper, we propose one global convolutional neural network (GCN) and one local sub-sampling architecture (GCN-RS) that are specifically designed to perform regression analysis based on genomic relationship information. A GCN is tailored to non-Euclidean spaces and consists of several layers of graph convolutions. The GCN-RS architecture is designed to further improve the GCN’s performance by sub-sampling the graph to reduce the dimensionality of the input data. Through these graph convolutional layers, the GCN maps input genomic markers to their quantitative phenotype values. The graphs are constructed using an iterative nearest neighbour approach. Comparisons show that the GCN-RS outperforms the popular Genomic Best Linear Unbiased Predictor (GBLUP) method on one simulated and three real data sets from wheat, mice and pig with a predictive improvement of 4.4% to 49.4% in terms of test mean squared error (MSE). This indicates that GCN-RS is a promising tool for genomic predictions in plants and animals. Furthermore, GCN-RS is computationally efficient, making it a viable option for large-scale applications.
Kokoelmat
- Avoin saatavuus [34589]