In the past decades, the emergence of high-density genome-wide single nucleotide polymorphism (SNP) arrays has revolutionized the field of genetic analysis, particularly for complex traits. Linear mixed models that can integrate such large-scale genomic data are widely used for genomic prediction \citep{meuwissen2001prediction,vanraden2008efficient} and genome-wide association studies \citep{visscher201710}.Alongside these, Bayesian approaches have gained significance for their ability to accommodate biologically meaningful prior information on the structure of underlying genetics for genomic prediction and provide probabilistic interpretations of genetic associations.
Recent advancements in high-throughput phenotyping platforms and multi-omics methodologies, coupled with the progress in functional mapping, pave the way for the generation of expansive phenomic, transcriptomic, and metabolomic profiles, along with detailed functional annotations of the genomes. The acquisition of these innovative data sets opens new avenues for us, enabling significant acceleration in enhancing the accuracy of genomic predictions and discoveries of causal variants in genome-wide association studies.
In this thesis, three distinct Bayesian models are developed to address three specific scenarios, each aimed at further improving the accuracy of genome-wide prediction and the power of genome-wide association analyses: 1) The analysis of large-scale longitudinal data; 2) The analysis of high-dimensional and highly correlated (molecular) phenotypic data; and 3) The analysis of multiple traits with comprehensive annotation information on genome.
The first part of this research presents the development and application of a Bayesian random regression model with mixture priors, named RR-BayesC, to analyze time-series measurements in genome-enabled analyses. Based on simulated and real rice (Oryza sativa L.) data, we show that RR-BayesC has a promising ability to distinguish quantitative trait loci (QTL) that are invariant to temporal covariates and QTL that interact with time, resulting in high prediction accuracy even when forecasting is required for phenotypes at later periods.
The second part of this research proposes a high-dimensional Bayesian multivariate regression model with mixture priors, named MegaBayesC, to simultaneously analyze genetic variants underlying thousands of traits. Applied to Genomic Prediction, MegaBayesC effectively integrates hyperspectral reflectance data from 620 hyperspectral wavelengths to improve the accuracy of breeding value prediction on grain yield in a wheat dataset. Applied to Genome-Wide Association Studies, we use simulations to show that MegaBayesC can accurately estimate the effect sizes of QTL across various genetic architectures and causes of correlations among traits. Furthermore, application of MegaBayesC to identify genetic associations with flowering time in Arabidopsis thaliana, leveraging expression data from 20,843 genes is also presented.
The third part of this research proposes an annotation-assisted summary statistics-based multi-trait BayesC model, named MT-SBayesC-func. This model is designed to explore the pleiotropic genetic structure underlying two traits, by incorporating additional information from functional annotations. Through simulation studies, MT-SBayesC-func has demonstrated its ability to accurately estimate annotation-specific genetic correlations and polygenic overlaps, which deepens our understanding of the shared genetic mechanisms underlying complex traits.