Messenger RNA 3' polyadenylation (poly(A)) is an essential post-transcriptional processing step for most eukaryotic genes, significantly impacting many aspects of mRNA metabolism. The majority of eukaryotic genes present alternative poly(A) (APA), through which the same gene can have multiple alternative 3' ends due to the cleavage and poly(A) presence at distinct sites. APA results in RNA transcripts with different 3'UTRs, which can influence transcript transport, localization, stability, and translation, or lead to different protein products. Many human diseases including cancer have been associated with abnormal poly(A) regulation, highlighting the importance of this process. However, the rules on how poly(A) sites are selected and regulated -- the so called the poly(A) code -- are not well understood.
Recent advances in high-throughput technologies have provided a great opportunity to elucidate the rules underlying APA. High-throughput sequencing(HTS) experiments yield a wealth of data regarding APA. Consequently, there is a need to develop computational techniques to mine these data. In this thesis, we present four major contributions furthering our understanding of the poly(A) code. The algorithms and computational methods we developed have all showed improved predictive and analytical capabilities over competing methods. They are as follows:
1) HTS reads need to be efficiently mapped back to a reference genome for further downstream analysis. To address this need, we developed a fast and accurate reads mapping package for identifying all mapping locations for each read, called "Hobbes". Hobbes outperforms most state-of-the-art "all-mapping" programs, including mrsFast and Razers2.
2) We developed a bioinformatics pipeline for identifying and profiling genes with significant APA switches from different biological or clinical conditions. The pipeline includes calling poly(A) sites, filtering artificial poly(A) sites, clustering heterogeneous poly(A) sites, and identifying and profiling genes with significant APA switches. This pipeline has already provided significant insights into many core polyadenylation factors.
3) The poly(A) code can be partially deciphered from the genome-wide modeling of tissue-specific APA. Consequently, extended existing Shannon entropy measuring to assess the tissue specificity for each poly(A) site, and applied an outlier detection method to identifying the tissue-specific pattern. With new mRNA features we explored, our ensemble predictive model successfully discriminated tissue-specific poly(A) sites from constitutive poly(A) sites, with test accuracy 84.5% (auRoc 0.92), which surpassed the previous model by more than 10%. Through an in-depth analysis of the most important features, we proposed a mechanism that controls the selection and regulation of tissue-specific APA.
4) Aberrant mRNA 3' polyadenylation have been implicated for a wide variety of complex diseases. We developed a novel statistical method for identifying disease-related pathway from genome-wide association studies (GWAS). We proposed to optimally select a representative SNP (single nucleotide polymorphism) set for each gene using adaptive truncated product statistic, and conducted enrichment analysis via the weighted Kolmogorov-Smirnov test to identify enriched pathways. By applying it to Schizophrenia GWAS SNPs, we showed our method identifies pathways highly associated with the disease. Moreover, the results are reproducible across large genetically distinct samples. This method can be used for detecting pathways involved in disease caused by APA, such as cancer.