Networks pervade many disciplines of science as a way of analyzing complex systems with interacting components. The problem of network modeling is often two-fold. First, the relationships between pairs of nodes, if not directly observed, have to be estimated from data. Based on the estimated (or given) network topology, various statistical and computational tools can then be applied to extract interesting patterns such as the presence of communities. In this thesis we explore studies related to both parts of the problem.
We first discuss two studies in the context of gene regulatory networks, where the goal is to infer gene interactions using expression data. With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to infer gene interactions has been a major challenge in systems biology. The two studies differ in their considerations of how genes behave across the given samples. The first method applies to the case of large heterogenous samples, where the patterns of gene association may change or only exist in a subset of all the samples. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable if not better general performance.
In comparison, the second study goes beyond pairwise gene relationships to higher level group interactions, but requiring similar gene behaviors across all the samples. We introduce a new method for estimating group interactions using sparse canonical correlation analysis (SCCA) coupled with repeated random partition and subsampling of the gene expression dataset. By considering different subsets of genes and ways of grouping them, our interaction measure can be viewed as an aggregated estimate of partial correlations of different orders. Our approach is unique in evaluating conditional dependencies when the correct dependent sets are unknown or only partially known. As a result, a gene network can be constructed using the interaction measures as edge weights and gene functional groups can be inferred as tightly connected communities from the network. Comparisons with several popular approaches using simulated and real data show our procedure improves both the statistical significance and biological interpretability of the results. In addition to achieving considerably lower false positive rates, our procedure shows better performance in detecting important biological pathways.
Moving onto general networks, we then discuss model selection for the stochastic block model (SBM), which is a popular tool for community detection. We consider an approach based on the log likelihood ratio statistic and analyze its asymptotic properties under model misspecification. We show the limiting distribution of the statistic in the case of underfitting is normal and obtain its convergence rate in the case of overfitting. These conclusions remain valid in the regime where the average degree grows at a polylog rate. The results enable us to derive the correct order of the penalty term for model complexity and arrive at a likelihood-based model selection criterion that is asymptotically consistent. In practice, the likelihood function can be estimated by more computationally efficient variational methods, allowing the criterion to be applied to moderately large networks.