Abstract
A probabilistic model for random hypergraphs is introduced to represent unary, binary and higher order interactions among objects in real-world problems. This model is an extension of the latent class analysis model that introduces two clustering structures for hyperedges and captures variation in the size of hyperedges. An expectation maximization algorithm with minorization maximization steps is developed to perform parameter estimation. Model selection using Bayesian Information Criterion is proposed. The model is applied to simulated data and two real-world data sets where interesting results are obtained.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
A large number of random graph models have been proposed (Nowicki and Snijders 2001; Hoff et al. 2002; Handcock et al. 2007; Latouche et al. 2011) to describe complex interactions among objects of interest. Pairwise relationships among objects can be naturally represented as a graph, in which the objects are represented by the vertices, and two vertices are joined by an edge if certain relationship exists between them. While graphs are capable of representing pairwise interactions between objects, they are inadequate to represent unary and higher order interactions that are typically observed in many real-world problems. Examples of data with unary and higher-order interactions include co-authorship on academic papers, co-appearance in movie scenes, and songs performed in a concert.
For example, the study of coauthorship networks of scientists have attracted significant interest in both natural and social sciences (Newman 2001a, b, 2004; Moody 2004; Azondekon et al. 2018). Such networks are typically constructed by connecting two scientists if they have coauthored one or more papers together. However, as we will illustrate below, such representation inevitably results in loss of information while a hypergraph representation naturally preserves all information. A hypergraph is a generalization of a graph in which hyperedges are arbitrary sets of vertices, and can contain any number of vertices. As a result, hypergraphs are capable of representing relationships of any order.
We consider a simple example of a coauthorship network with 7 authors and 4 papers, in order to illustrate the benefits of hypergraph modelling. A hypergraph representation of the network is given in Fig. 1, where the vertices \(v_1, v_2, \ldots , v_7\) represent the authors while the hyperedges \(e_1, \ldots , e_4\) represent the papers. For example, the paper \(e_1\) is written by four authors \(v_1, v_2\), \(v_3\) and \(v_4\), the paper \(e_2\) is written by two authors \(v_2\) and \(v_3\), the paper \(e_3\) has \(v_3\), \(v_5\) and \(v_6\) as authors and the paper \(e_4\) has a single author \(v_4\).
On the other hand, a graph representation of this coauthorship network with edges between any two authors who have coauthored at least one paper results in the edge set \(\{ (v_1, v_2), (v_1, v_3), (v_1, v_4), (v_2, v_3), (v_2, v_4), (v_3, v_4), (v_3, v_5), (v_3, v_6), (v_5, v_6)\}\). It is evident that much information is lost with this representation. In particular, this representation removes information about the number of authors that co-authored a paper. For example, one can only deduce from this edge set that \(v_3\) has co-authored with \(v_1\) and \(v_2\) while unable to conclude that the co-authorship was for the same paper. Furthermore, the hyperedge \(e_4\) which contains a singleton \(v_4\) is left out in the graph representation.
A number of random hypergraph models have been studied in probability and combinatorics literature, where theoretical properties are investigated (Karoński and Łuczak 2002; Goldschmidt 2005; de Panafieu 2015; Dyer et al. 2015; Poole 2015). A novel parametrization of distributions on hypergraphs based on the geometry of points is proposed in Lunagómez et al. (2017) which is used to infer Markov structure for multivariate distributions. On the other hand, statistical modeling of random hypergraph data is less developed. Stasi et al. (2014) introduced the hypergraph beta model with three variants, which is a natural extension of the beta model for random graphs (Holland and Leinhardt 1981). In their model, the probability of a hyperedge e appearing in the hypergraph is parameterized by a vector \( \beta \in \mathbf {R^{N}} \), which represents the “attractiveness” of each vertex. However, their model does not capture clustering among objects, which is a typical real world phenomenon. In addition, the assumption of an upper bound on the size of hyperedges violates the structure of many real world data sets.
One may equivalently represent a hypergraph using a bipartite network (also called two-mode network and affiliation network). Two-mode networks consist of two different kinds of vertices and edges can only be observed between the two types of vertices, but not between vertices of the same type. A hypergraph can be represented as a two-mode network by considering the hyperedges as a second type of vertices. For example, an equivalent bipartite representation of the hypergraph shown in Fig. 1 is provided in Fig. 2 where the hyperedges \(\{ e_1, \ldots , e_4\}\) are now replaced by the four green vertices.
Two-mode networks have been studied in various disciplines including computer science (Perugini et al. 2004), social sciences (Faust et al. 2002; Koskinen and Edling 2012; Friel et al. 2016) and physics (Lind et al. 2005). A number of approaches have been proposed to analyze and model two-mode network data (Borgatti and Everett 1997; Doreian and Batagelj 2004; Latapy et al. 2008; Wang et al. 2009; Snijders et al. 2013; Aitkin et al. 2014). In particular, models originally developed for binary networks were extended for two-mode networks.
Doreian and Batagelj (2004) developes a blockmodeling approach of two-mode network data which aims to simultaneously partition the two types of vertices into blocks. Skvoretz and Faust (1999) proposes the exponential random graph models (ERGMs) for two-mode networks, which models the logit of the probability of an actor belonging to an event as a function of actor and event specific effects and other graph statistics. A clustering algorithm for two-mode networks is developed in Field et al. (2006) based on the modelling framework in Skvoretz and Faust (1999). Several extensions to the ERGMs for bipartite networks are proposed by (eg. Wang et al. 2009, 2013). Snijders et al. (2013) proposes a methodology for studying the co-evolution of two-mode and one-mode networks. A network autocorrelation model for two-mode networks is introduced in Fujimoto et al. (2011). Aitkin et al. (2014) evaluates the identification of clustering structure in bipartite networks through latent class analysis and introduces a new Bayesian method for choosing the number of latent classes.
Representing network observations using two-mode networks has the benefits of modelling vertices of both types jointly. However, in analyzing a two-mode network, one type of vertices may attract most interest. For example, in co-authorship networks, the main interest may lie in the collaborations rather than in co-authored papers. When modeling the co-appearance of characters in the scenes of a movie, one is typically interested in co-appearance of the characters rather than the movie scenes. In such scenarios, a hypergraph representation is most natural by converting one type of vertex into hyperedge.
A related and popular research problem is hypergraph partitioning (Zhou et al. 2007; Leordeanu and Sminchisescu 2012; Purkait et al. 2017). Hypergraph partitioning aims to partition vertices in a hypergraph into clusters based on their higher order interactions, and is an important research problem in computer vision (Agarwal et al. 2006; Li et al. 2013), recommender systems (Bu et al. 2010) and other fields. In contrast, we propose a random hypergraph model which captures the clustering structure of the hyperedges. Since hyperedges are simply arbitrary sets of vertices, interpretable structure within the vertices can also be inferred from the clustering structure of the hyperedges. By adopting a probabilistic approach to hypergraph modeling, the proposed model is capable of quantifying the uncertainties in the clustering of hyperedges.
In this paper, we propose the Extended Latent Class Analysis (ELCA) model for random hypergraphs, which is a natural extension of the Latent Class Analysis (LCA) model (Lazarsfeld and Henry 1968; Goodman 1974; Celeux and Govaert 1991) and includes the LCA model as a special case. The ELCA can alternatively be interpreted as a constrained case of the LCA and it achieves significant reduction in model complexity. Furthermore, the model directly captures the variation in sizes of hyperedges which are typically observed in applications. For example, the number of authors per scientific publication varies widely across different disciplines. We develop an EM (Expectation Maximization) algorithm with MM (Minorization Maximization) steps to perform parameter estimation. To determine the number of latent classes, we employ the Bayesian Information Criterion (BIC). The model is applied to simulated data, and two applications: Star Wars movie scenes and Reuters news articles.
2 Model and motivation
2.1 Hypergraph
A hypergraph is represented by a pair \(H = (V,E)\), where \(V = \{v_{1},v_{2},\ldots ,v_{N}\}\) is the set of N vertices and \(E = \{e_{1}, e_{2}, \ldots , e_{M}\}\) is the set of M hyperedges. A hyperedge e is a subset of V, and we allow repetitions in the hyperedge set E. Thus, the hypergraph H can alternatively be represented with a \( N \times M \) matrix \( {\mathbf {X}} = (x_{ij})\) where \( x_{ij} = 1 \) if vertex \(v_{i}\) appears in hyperedge \(e_{j}\) and \( x_{ij}=0\) otherwise.
2.2 Latent class analysis model for random hypergraphs
The binary latent class analysis (LCA) model (Lazarsfeld and Henry 1968; Goodman 1974) is a commonly used mixture model for high dimensional binary data. It assumes that each observation is a member of one and only one of the G latent classes, and conditional on the latent class membership, the manifest variables are mutually independent of each other. The LCA model appears to be a natural candidate to model random hypergraphs where hyperedges are partitioned into G latent classes, and the probability that a hyperedge \(e \in E\) contains a vertex \(v \in V\) depends only on its latent class assignment.
Let \({{\mathbf {X}}} = (x_{ij})\) be the matrix representation of the hypergraph H where \(x_{ij} = 1\) if vertex \(v_i\) is in hyperedge \(e_j\) and \(x_{ij}=0\) otherwsie. Let \( \pi = (\pi _{1}, \ldots , \pi _{G} ) \) be the a priori latent class membership probabilities, where \(\pi _g\) is the probability that a hyperedge belongs to latent class g. We define the \(N \times G\) matrix \({\mathbf {P}}\), where \(p_{ig}\) is the probability that vertex \(v_{i}\) is contained in a hyperedge e with latent class membership g. The probability of observed hyperedge \(e_j\), which is represented by \((x_{1j}, \ldots , x_{Nj})\), is thus
Thus, the likelihood function of \({\mathbf {P}}\) and \(\pi \) can be written as
Let \({\mathbf {Z}}^{(1)}\) be a \(M \times G\) latent class membership matrix, where \(z^{(1)}_{jg} = 1\) if hyperedge \(e_{j}\) has latent class label g and \(z^{(1)}_{jg}=0\) otherwise. The complete-data likelihood of \({\mathbf {P}}\) and \(\pi \) can be expressed as (1).
In comparison to the hypergraph beta models introduced in Stasi et al. (2014), the LCA model is capable of capturing the clustering and heterogeneity of hyperedges. For example, academic papers can be naturally labelled according to subject areas and conditional on a paper being labelled mathematics, one would expect that the probability a mathematician co-authored the paper is higher than a biologist. The LCA model does not assume an upper bound on the size of hyperedges and can model hyperedges of any size. Furthermore, an expectation maximization algorithm (Dempster et al. 1977) can be easily derived to perform parameter estimation.
2.3 Extended latent class analysis for random hypergraphs
While the LCA model captures the clustering and heterogeneity of hyperedges in real world data sets, a large number of latent classes are typically required to achieve a good fit of the data. As a result, the number of parameters grows quickly with a moderate or large number of nodes. The complexity of the LCA model can be substantially reduced if we assume that some of the latent class conditional probabilities \((p_{ig})_{i=1}^{N}\) tend to be proportional to each other for different values of g. While assuming proportionality of latent class conditional probabilities may appear rather restrictive, it is a reasonable assumption in many hypergraph applications. We develop the Extended Latent Class Analysis (ELCA) model which builds on the proportionality assumption on the conditional probabilities.
Let \(a=(a_{1}, \ldots , a_{K})\) with \(0 \le a_k \le 1\) be a K dimensional vector, the ELCA model assumes that the latent class conditional probabilities are of the form \( (\phi _{ig} a_k)_{i=1}^{N}\) for \(g=1, \ldots , G\) and \(k=1, \ldots , K\). In the context of hypergraph applications, the \(a_k\) parameters capture the variations in the size (number of vertices) of the hyperedges whereas the \(\phi _{ig}\) values capture the probability that a node is contained in a hyperedge. The ELCA model can be considered as having two types of clustering structure, with the primary clustering structure defined by \(\phi _{ig}\) parameters and an additional clustering structure captured by \(a_k\) parameters. We note that the ELCA reduces to the standard LCA when \(K=1\).
Let \( \tau =(\tau _{1},\ldots ,\tau _{K})\) be the clustering assignment probabilities corresponding to the additional structure, the ELCA model assumes that these two clustering structure are a priori independent. Thus, the probability that a hyperedge has primary cluster label g and additional cluster label k is \(\pi _g \tau _k\), and the probability that the vertex \(v_i\) is contained in a hyperedge from the clusters pair (g, k) is \(a_k \phi _{ig}\), and the probability that the vertex \(v_i\) is contained in a hyperedge from the primary cluster g is \(\phi _{ig} \sum _{k=1}^{K} a_k \tau _k \).
Under the ELCA model with G primary clusters and K additional clusters, the probability of observing a hyperedge \((x_{1j}, \ldots , x_{Nj})\) is given by
Let \(\theta = (\pi , \tau , \phi , a)\) denote the model parameters, the likelihood function can be written as
The ELCA model is not identifiable if the parameters \((a_k)_{k=1}^{K}\) are not constrained. To see this, if \(0< a_k < 1\) for all k, then the likelihood function is invariant under the transformation \((a_k)_{k=1}^{K} \rightarrow (C a_k)_{k=1}^{K} \) and \((\phi _{ig})_{i=1,g=1}^{i=N,g=G} \rightarrow (C^{-1} \phi _{ig})_{i=1, g=1}^{i=N, g=G}\), where C is some positive constant such that \(\max _{k} \{C a_k \} \le 1\). Thus, to ensure the identifiability of the model, \((a_k)_{k=1}^{K}\) are ranked by increasing order with \(a_K=1\).
We define the \(M \times K\) additional cluster membership matrix \({\mathbf {Z}}^{(2)} = (z_{jk}^{(2)})\), where \(z_{jk}^{(2)} = 1\) if hyperedge \(e_{j}\) has additional cluster label k and \(z_{jk}^{(2)}=0\) otherwise. The complete data likelihood function of \({\mathbf {X}}\), \({\mathbf {Z}}^{(1)}\) and \({\mathbf {Z}}^{(2)}\) is given as
We note that any ELCA with G primary clusters and K additional clusters can be equivalently represented as a standard LCA with \(G \times K\) clusters. Under the standard LCA representation of the ELCA model, the \(G \times K\) vectors of latent class conditional probabilities \(\big \{ (p_{ig})_{i=1}^{N} \big \}_{g=1}^{G \times K}\) can be partitioned into G sets of equal size K, and \((p_{ig})_{i=1}^{N}\) are proportional to each other within each set with the constants of proportionality determined by \((a_k)_{k=1}^{K}\). Consider the ELCA with 2 primary clusters and 2 additional clusters, which is a special case of the 4-cluster LCA model. The probabilities that vertex \(v_i\) is contained in a hyperedge from the cluster pair (1, 1), (1, 2), (2, 1), (2, 2) are given by \(\phi _{i1}, \phi _{i2}, a_1 \phi _{i1}, a_1 \phi _{i2} \).
It is easy to see that under the proportionality assumption, the ELCA model achieves significant reduction in the number of parameters. For the ELCA model with G primary clusters and K additional clusters, the number of parameters is given by \(GN + 2(K-1) + (G-1)\) whereas the number of parameters for the LCA with \(G \times K\) clusters is \(G K N + (G K - 1) \).
2.4 Theoretical properties
We analyze the distribution of the size of a random hyperedge under the proposed ELCA model. Proposition 1 below shows that the size of the hyperedges simulated from the ELCA model tend to have larger variance than those simulated from the LCA model.
Proposition 1
Suppose we are given the LCA model with parameters \( \{ \pi , p \}\) and the ELCA model with parameters \( \{ \pi , \tau , a, \phi \}\) and N vertices. Suppose the condition \(p_{ig} = \phi _{ig} \sum _{k=1}^{K} a_{k} \tau _{k} \) holds for \(i=1,\ldots ,N\) and \(g=1,\ldots , G\). This condition ensures that the latent class conditional probabilities of the primary clustering structure are the same for both models.
Let A denote the cardinality \(|e_1|\) of a random hyperedge \(e_1\) generated under the LCA model. Similarly, let B denote the cardinality \(|e_2|\) of a random hyperedge \(e_2\) generated under the ELCA model. We have the following results:
Proof
The proof is straightforward and is given in the Appendix. \(\square \)
We now let \(f_{N}(y)\) be the probability mass of the size of a random hyperedge simulated from a G cluster LCA model. Similarly, we let \(h_{N}(y)\) be the probability mass of the size of a random hyperedge simulated from the ELCA model with G clusters and K additional clusters. The following result can be derived.
Proposition 2
-
1.
Under the specifications of a LCA model with parameters \(\pi =(\pi _{1},\ldots ,\pi _{G})\) and \( \{ p_{ig} \}_{i=1,\ldots ,N, g=1,\ldots ,G} \), and suppose the following conditions hold for \(g=1,\ldots ,G\),
$$\begin{aligned} \lambda _{N}^{(g)}= & {} \sum _{i=1}^{N} p_{ig} \rightarrow \lambda ^{(g)} > 0 \\ \sum _{i=1}^{N} p_{ig}^{2}\rightarrow & {} 0 \end{aligned}$$as \(N \rightarrow \infty \). We have
$$\begin{aligned} f_{N}(y) \rightarrow \sum _{g=1}^{G} \pi _{g} \frac{ e^{ -\lambda ^{(g)} } (\lambda ^{(g)})^{y} }{y!} . \end{aligned}$$That is, the distribution of the size of a random hyperedge converges to a mixture of Poisson distributions with G components.
-
2.
Under the specification of a ELCA model with parameters \(\pi =(\pi _{1},\ldots ,\pi _{G})\), \( \tau =( \tau _{1}, \ldots , \tau _{K})\), \(a=(a_{1}, \ldots , a_{K})\), and \( \{ \phi _{ig} \}_{i=1,\ldots ,N, g=1,\ldots ,G} \), and further suppose the following conditions hold for \(g=1,\ldots ,G\), and \(k=1, \ldots , K\),
$$\begin{aligned} \lambda _{N}^{(g,k)}= & {} \sum _{i=1}^{N} \phi _{ig} a_{k} \rightarrow \lambda ^{(g,k)} > 0 \\ \sum _{i=1}^{N} \phi _{ig}^{2} a_{k}^{2}\rightarrow & {} 0 \end{aligned}$$as \(N \rightarrow \infty \). We have
$$\begin{aligned} h_{N}(y) \rightarrow \sum _{g=1}^{G} \sum _{k=1}^{K} \pi _{g} \tau _{k} \frac{ e^{ - \lambda ^{(g,k)} } ( \lambda ^{(g, k)})^{y} }{y!} . \end{aligned}$$That is, the distribution of the size of a random hyperedge converges to a mixture of Poisson distributions with \(G \times K\) components.
Proof
Conditional on the event that a random hyperedge is generated from cluster g, (Wang 1993, Theorem 3) implies that
Part 1 result follows by marginalizing over the G clusters. The second part of the proposition can be proved similarly. \(\square \)
Proposition 2 implies that under mild conditions, the distribution of the size of hyperedges converges to a mixture of Poisson distributions with \(G \times K\) mixture components as the number of vertices increases. We note that the mixture components of the limiting mixture of Poisson distribution are subject to the same proportionality condition. Nevertheless, larger variations in the size of hyperedges tend to be obtained under the ELCA compared to those obtained under the standard LCA.
3 Estimation and model selection
3.1 EM algorithm
We estimate the parameters \(\theta = (\pi , \tau , \phi , a) \) of the ELCA model using an EM algorithm (Dempster et al. 1977) which is a popular method in fitting mixture models. The E-step of the EM algorithm involves computing the expected value of the complete data log-likelihood (2) with respect to the distribution of the unobserved \({\mathbf {Z}}^{(1)}\) and \({\mathbf {Z}}^{(2)}\) given the current estimates. The M-step involves maximizing the expected complete data log-likelihood.
Taking logarithm of the complete data likelihood in (2), we obtain the complete data log-likelihood function below.
3.1.1 E-step
For the E-step, we need to evaluate the expected complete data log-likelihood, which is the expectation of (3) conditional on data x and current parameter estimates \(\theta ^{(t)}\). The expected complete data log-likelihood is denoted as \(Q(\theta |\theta ^{(t)})\) and is defined as
Because the complete-data log-likelihood is linear in \(Z_{jg}^{(1)}Z_{jk}^{(2)}\), we need to evaluate the expectation \( \widehat{Z^{(1)}_{jg} Z^{(2)}_{jk}} := E(Z^{(1)}_{jg} Z^{(2)}_{jk}|{\mathbf {X}}, \theta ^{(t)})\). We have that
In particular, the E-step has a computational complexity of \({{\mathcal {O}}}(N)\) for each pair (g, k), and an overall complexity of \({{\mathcal {O}}}(NGK)\).
3.1.2 M-step
While the E-step of the EM algorithm is straightforward, the M-step involves complicated maximization. For the M-step, we need to maximize \(Q(\theta |\theta ^{(t)})\) with respect to the model parameters \(\{ \phi _{ig} \}\), \(\{ a_{k} \}\), \(\{ \pi _{g} \}\) and \(\{ \tau _{k} \}\). Thus, we use the ECM algorithm (Meng and Rubin 1993) which replaces the complex M-step by a series of simpler conditional maximizations. The conditional maximizations with respect to the parameters \(\phi \) and a do not have closed form solutions. We utilize the MM algorithm (Lange et al. 2000; Hunter and Lange 2004) which works by lower bounding the objective function by a minorizing function and then maximizing the minorizing function. Since the M-step involves a series of conditional maximization, the Q function is guaranteed to increase (Meng and Rubin 1993, Theorem 1).
Maximize w.r.t. \(\phi _{ig}\)
For fixed i and g, the objective function retaining terms involving \(\phi _{ig}\) can be written as
An analytic expression for \( {{\,\mathrm{arg\,max}\,}}_{\phi _{ig}}\{Q\}\) does not exist due to the \( \log (1-a_{k}\phi _{ig})\) term and thus we apply the MM (Minorization Maximization) algorithm (Hunter and Lange 2004). We first apply a quadratic lower bound on the concave function \(\log (1-a_{k}\phi _{ig})\) for \(k < K\):
Hence, the objective function in (6) up to an additive constant can be minorized by \(Q_{lower}\):
To simplify the expression above, we define the quantities below:
Now, the lower bound (7) can be written as below.
Taking derivative with respect to \(\phi _{ig}\), we have
Let \(C=B_{1} - 2B_{2} \phi ^{(t)}_{ig}\), we have
Solving the cubic equation above results in the update for \(\phi _{ig}\).
Maximize w.r.t. \(a_{k}\)
For a fixed k, the objective function (3) retaining terms involving \(a_{k}\) can be expressed as
Since an analytic expression for \( {{\,\mathrm{arg\,max}\,}}_{a_{k}}\{Q\}\) does not exist due to the \( \log (1-a_{k}\phi _{ig})\) term, we apply the MM (Minorization Maximization) algorithm. We first apply a quadratic lower bound on the concave function
Hence, (9) up to an additive constant can be minorized by the function:
To simply the expression above, we define the following quantities:
Taking derivative of (9) with respect to \(a_{k}\), we have
Let \(D = (\frac{B}{2C} - a_{k}^{(t)})\), \(E=-\frac{A}{2C}\), we have
Maximize w.r.t. \(\pi _{g}\) and \(\tau _{k}\) We apply the method of Lagrange multipliers to derive the updates for \(\pi _g\) and \(\tau _k\). The objective function for \((\pi _h)_{h=1}^{G}\) is given by
where \(\lambda \) is the Lagrange multipler. Differentiating w.r.t. \(\pi _g\) and setting to 0 gives
Therefore, the update for \(\pi _g\) is given by
The update for \(\tau _k\) can be derived analogously and is given below:
The EM algorithm is summarized in Algorithm 1, where line 4 corresponds to the expectation step and line 5 - 18 are the conditional maximization steps. In particular, we note that the computational complexity for maximizing \( \phi _{ig} \) and \( a_{k} \) are given by \({{\mathcal {O}}}(N_{iter}MK)\) and \({{\mathcal {O}}}(N_{iter}MGN)\), respectively, where \(N_{iter}\) is the number of iterations required for the MM algorithm.
3.2 Model selection
We use the Bayesian Information Criterion (BIC) (Schwarz 1978) to determine the optimal number of primary and additional clusters for the ELCA model. For the ELCA model, the BIC takes the following form:
where \(\log L\) is the log-likelihood evaluated at the estimated parameters, and \(G N + 2(K-1) + (G-1)\) is the number of parameters in the model. The model with the lowest BIC value is selected. The accuracy of the BIC as a model selection criterion requires M to be relatively large compared to N. For the standard latent class models, existing literature suggests that the BIC is a good indicator of the true number of classes (Collins et al. 1993) and extensive simulation studies were performed in Nylund et al. (2007) to validate this claim. The performance of BIC as a model selection criterion for the ELCA model is assessed using simulation studies in Sect. 4.
4 Simulation studies
We conduct simulation studies to examine the performance of the proposed EM algorithm for the ELCA model and the behavior of BIC as a model selection criterion. The results presented in Tables 1 and 2 are concerned with assessing the convergence behavior of the proposed EM algorithm with various latent class assignment probabilities for primary and additional clusters. Hyperedges are simulated from the ELCA model with two primary clusters and two additional clusters in Table 1 and from the ELCA model with three primary clusters and two additional clusters in Table 2. The specific model parameters used in the simulation are given in the Appendix.
For the model parameters \(\phi \), a, \(\pi \) and \(\tau \) of the ELCA model, the \(\ell _2\) distances between the true parameters and the estimated ones are presented in Tables 1 and 2 . The misclassification rates for both the primary and additional clusters are also presented. We observe that the estimated parameters converge to the true values as the number of hyperedges increases. It is worth noting that the convergence tends to be faster in the case of two primary clusters compared to three primary clusters.
We examine the performance of BIC in choosing the optimal number of primary and additional clusters. The values in Tables 3 and 4 are computed by comparing the BIC across a range of models, then identifying where the lowest values occurred across these models considered. The model parameters which generate the hyperedges are given in Appendix. For example, with 10 vertices and 200 hyperedges, the lowest values of BIC occurred at the two primary and two additional cluster model (which is the true model) 67% of the time. Looking across the values in Tables 3 and 4, we notice that the BIC tends to be a less accurate model selection criterion when the number of hyperedges is small but improves significantly as the number of hyperedges M increases.
As a final simulation study, we simulate hyperedges from the LCA models with two and three clusters and note that they are special cases of the ELCA models with \(K=1\) additional cluster. The simulated data is then fitted with the ELCA models with \(K=2\) and \(K=3\) additional clusters. For various simulation settings, we simulate 100 sets of hyperedges and examine the proportion of times that the true model can be recovered. The true model is considered to be recovered if the estimated parameters satisfy \(\max \{\tau _k\} > 1 - \epsilon \) or \(\min \{a_k\} > 1 -\epsilon \) for some small positive number \(\epsilon \). Simulation results are shown in Table 5 with \(\epsilon \) is set to 0.01 and 0.05. We see that using the less strict threshold \(\epsilon = 0.05\), the true model is recovered the majority of times across all simulation settings. We also observe that as the number of nodes N increases, the proportion of times that the true model is recovered increases considerably. On the other hand, there is no clear relationship between the number of hyperedges M and the proportion of successful recovery of the true model.
5 Applications
5.1 Star Wars Movie Scenes
Our first application is modeling co-appearance of the main characters in the scenes of the movie “Star Wars: A New Hope”. We collected the scripts of the movie from the Internet Movie Script DatabaseFootnote 1 and constructed a hypergraph for the eight main characters so that each character is a vertex in the hypergraph. We define each scene in the movie as a hyperedge with a total of 178 hyperedges, and a character is contained in the scene if he/she speaks in the scene.
We determine the optimal number of clusters and additional clusters using BIC where the results are provided in Table 6. The ELCA model with 3 clusters and 2 additional clusters has the lowest BIC value and is selected. It is worth noting that the standard LCA with 3 clusters is also competitive based on the BIC.
The results from fitting the ELCA model with \(G=3\) and \(K=2\) are provided in Tables 7 and 8. We can see the variation in the size of hyperedges from the parameter estimates \({\hat{a}}\) and \({\hat{\tau }}\) with the majority (81%) of hyperedges having size much smaller than the rest of the hyperedges. Thus, one can deduce that a small proportion of the movie scenes have far more characters.
The estimates \({\hat{\phi }}\) in Table 8 reveal interesting clustering structure for the 8 main characters in the movie. For example, the lead character “Luke” has a strong tendency to appear in the two largest clusters. On the other hand, it is extremely unlikely for “Obi-Wan” and “Han” appear in the same scene.
The estimated primary cluster assignment probabilities from the EM algorithm for each movie scene in the Star Wars movie are shown in chronological order in Fig. 3. We can see from the plot that scenes in the early part of the movie are mainly associated with cluster 1, while cluster 2 contains most of the scenes from roughly scene 40 to scene 100. We can deduce from this, for example, that the character “Han” is very active in the middle part of the movie. On the other hand, there does not appear to be any obvious pattern for the third cluster. The clustering for many early and late movie scenes is relatively uncertain, as shown in the plot.
The uncertainties in primary clustering are also illustrated in a ternary plot in Fig. 4. Each dot in the plot represents a movie scene, and the three corners of the plot represent the three clusters. The closer the dot is to the corner, the higher probability that the corresponding movie scene belongs to the corresponding cluster. The ternary plot in Fig. 4 shows significant uncertainties in clustering a number of movie scenes into the first two clusters. This is reasonable since for a number of actors including the lead actor “Luke”, the probabilities of scene appearance are similar for the first two clusters.
The estimated additional cluster assignment probabilities for each movie scene in the Star Wars movie are shown in chronological order in Fig. 5. We observe that majority of the scenes are assigned additional cluster 1 with only a small number of scenes between scene 40 and 100 assigned to additional cluster 2 where these scenes tend to have more characters.
As a comparison, the results from fitting the standard LCA model with 3 clusters are shown in Tables 9 and 10, and a contigency table comparing the primary clustering structure of the ELCA model and the LCA model are given in Table 11. The contingency table shows a very different clustering structure obtained from fitting the standard LCA model versus the ELCA model. We show the estimated cluster assignment probabilities for each movie scene for the LCA model with 3 clusters in chronological order in Fig. 6. In comparing Fig. 3 with Fig. 6, we see that while primary cluster 2 and 3 for the fitted ELCA model are similar with cluster 2 and 3 for the fitted LCA model, there is significant difference between primary cluster 1 in the ELCA model and cluster 1 in the LCA model.
The difference in the clustering structure between the ELCA model and the LCA model is expected as the ELCA model explicitly captures the variation in the size of hyperedges. In comparison, the LCA model cannot decouple the variation in the size of hyperedges from the primary clustering structure. This is a key advantage of the ELCA model where the underlying structure of the size of the hyperedges can be uncovered. Furthermore, as a constrained version of the LCA model with 6 clusters, the ELCA model with 3 primary clusters and 2 additional clusters is far more parsimonious.
5.2 Reuters News articles
As a second application of the ELCA model, we collected news articles published by ReutersFootnote 2 in January 2020. We analyze the co-appearance relationships among the Group of Eight+Five (G8+5) countries. A hypergraph is constructed by defining each news article as a hyperedge and each country as a vertex. A vertex is contained in a hyperedge if the corresponding country is mentioned in the corresponding news article. News articles that do not mention any of the 13 countries were removed, and the resulting hypergraph contains 1828 hyperedges.
The model with 5 clusters and 2 additional clusters was chosen by the BIC and fitted to the data set. The BIC scores for a range of models are shown in Table 12. It is worth noting that according to the BIC scores the ELCA models with two additional clusters generally outperform the standard LCA models whereas the standard LCA performs better than the ELCA with three additional clusters.
The parameter estimates \({\hat{\pi }}, {\hat{\tau }}\) and \({\hat{a}}\) are given in Table 13. The estimate \({\hat{\pi }}\) shows that the hyperedges are relatively evenly distributed across the five clusters. We can deduce from \({\hat{a}}\) and \({\hat{\tau }}\) that there are a small number of articles mentioning many countries whereas the vast majority of the articles mention very few countries. Specifically, about 6% of articles mentioned a much larger number of countries compared to the rest of the articles. The incorporation of an additional clustering structure results in significant reduction in the number of parameters.
The clustering structure can be deduced from the estimate \({\hat{\phi }}\) given in Table 14. China, Russia and USA are among the most popular in articles in cluster 1 whereas China, France and Japan are the most commonly mentioned by articles in cluster 2. Canada, Britain and USA have the highest probability of appearing in articles in cluster 3 whereas Canada, Mexico and USA are the most likely to appear in news articles in cluster 4. Germany, France and Britain are most likely to be mentioned by news articles in cluster 5 (Table 13).
6 Conclusion
We have proposed the Extended Latent Class Analysis model as a generative model for random hypergraphs. Building on a proportionality assumption, the ELCA model introduces two clustering structures for hyperedges which captures variation in the size of hyperedges. The model achieves significant reduction in model complexity compared to the standard Latent Class Analysis model. An EM algorithm has been developed for model fitting where the M-step involves a series of conditional maximization and model selection is performed using BIC. The proposed model is fitted to two data sets and this yields interesting and interpretable structure within the vertices and hyperedges.
Several extensions to the ELCA model are possible. Hyperedges typically have temporal information associated with them, which is the case for the two applications in this paper. Developing a hypergraph model to incorporate such temporal information is of interest. Furthermore, while the ELCA is developed in the context of hypergraph applications, the model could be useful in other applications where the proportionality assumption on latent class conditional probabilities is plausible.
Notes
Movie script data freely available at https://www.imsdb.com/.
References
Agarwal S, Branson K, Belongie S (2006) Higher order learning with graphs. In: Proceedings of the 23rd international conference on machine learning, ICML’06. Association for Computing Machinery, New York, NY, USA, pp 17–24
Aitkin M, Vu D, Francis B (2014) Statistical modelling of the group structure of social networks. Soc Netw 38:74–87
Azondekon R, Harper ZJ, Agossa FR, Welzig CM, McRoy S (2018) Scientific authorship and collaboration network analysis on malaria research in Benin: papers indexed in the Web of Science (1996–2016). Glob Health Res Policy 3:11
Borgatti SP, Everett MG (1997) Network analysis of 2-mode data. Soc Netw 19:243–269
Bu J, Tan S, Chen C, Wang C, Wu H, Zhang L, He X (2010) Music recommendation by unified hypergraph: combining social media information and music content. In: Proceedings of the 18th ACM international conference on multimedia, MM’10. Association for Computing Machinery, New York, NY, USA, pp 391–400
Celeux G, Govaert G (1991) Clustering criteria for discrete data and latent class models. J Classif 8:157–176
Collins LM, Fidler PL, Wugalter SE, Long JD (1993) Goodness-of-fit testing for latent class models. Multivar Behav Res 28:375–389
de Panafieu É (2015) Phase transition of random non-uniform hypergraphs. J Discrete Algorithms 31:26–39
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Doreian P, Batagelj V (2004) Generalized blockmodeling of two-mode network data. Soc Netw 6:29–53
Dyer M, Frieze A, Greenhill C (2015) On the chromatic number of a random hypergraph. J Combin Theory Ser B 113:68–122
Faust K, Willert K, Rowlee D, Skvoretz J (2002) Scaling and statistical models for affiliation networks: patterns of participation among Soviet politicians during the Brezhnev era. Soc Netw 24:231–259
Field S, Frank KA, Schiller K, Riegle-Crumb C, Muller C (2006) Identifying positions from affiliation networks: preserving the duality of people and events. Soc Netw 28:97–123
Friel N, Rastelli R, Wyse J, Raftery AE (2016) Interlocking directorates in Irish companies using a latent space model for bipartite networks. Proc Natl Acad Sci USA 113:6629–6634
Fujimoto K, Chou C-P, Valente TW (2011) The network autocorrelation model using two-mode data: affiliation exposure and potential bias in the autocorrelation parameter. Soc Netw 33:231–243
Goldschmidt C (2005) Critical random hypergraphs: the emergence of a giant set of identifiable vertices. Ann Probab 33:1573–1600
Goodman LA (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61:215–231
Handcock MS, Raftery AE, Tantrum JM (2007) Model-based clustering for social networks. J R Stat Soc Ser A 170:301–354
Hoff PD, Raftery AE, Handcock MS (2002) Latent space approaches to social network analysis. J Am Stat Assoc 97:1090–1098
Holland PW, Leinhardt S (1981) An exponential family of probability distributions for directed graphs. J Am Stat Assoc 76:33–65
Hunter DR, Lange K (2004) A tutorial on MM algorithms. Am Stat 58:30–37
Karoński M, Łuczak T (2002) The phase transition in a random hypergraph. J Comput Appl Math 142:125–135
Koskinen J, Edling C (2012) Modelling the evolution of a bipartite network—peer referral in interlocking directorates. Soc Netw 34:309–322
Lange K, Hunter DR, Yang I (2000) Optimization transfer using surrogate objective functions. J Comput Graph Stat 9:1–59
Latapy M, Magnien C, Vecchio ND (2008) Basic notions for the analysis of large two-mode networks. Soc Netw 30:31–48
Latouche P, Birmelé E, Ambroise C (2011) Overlapping stochastic block models with application to the French political blogosphere. Ann Appl Stat 5:309–336
Lazarsfeld PF, Henry NW (1968) Latent structure analysis. Houghton Mifflin, Boston
Leordeanu M, Sminchisescu C (2012) Efficient hypergraph clustering. In: Lawrence ND, Girolami M (eds) Proceedings of the fifteenth international conference on artificial intelligence and statistics. PMLR, vol 22 of proceedings of machine learning research, La Palma, Canary Islands, pp 676–684
Li X, Li Y, Shen C, Dick A, Van Den Hengel A (2013) Contextual hypergraph modeling for salient object detection. In: The IEEE international conference on computer vision (ICCV)
Lind PG, González MC, Herrmann HJ (2005) Cycles and clustering in bipartite networks. Phys Rev E 72:66
Lunagómez S, Mukherjee S, Wolpert RL, Airoldi EM (2017) Geometric representations of random hypergraphs. J Am Stat Assoc 112:363–383
Meng X-L, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278
Moody J (2004) The structure of a social science collaboration network: disciplinary cohesion from 1963 to 1999. Am Sociol Rev 69:213–238
Newman MEJ (2001a) Scientific collaboration networks. I. Network construction and fundamental results. Phys Rev E 64:016131
Newman MEJ (2001b) Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Phys Rev E 64:016132
Newman ME (2004) Who is the best connected scientist? A study of scientific coauthorship networks. In: Ben-Naim E, Frauenfelder H, Toroczkai Z(eds) Complex networks. Springer, Berlin, pp 337–370
Nowicki K, Snijders TAB (2001) Estimation and prediction for stochastic blockstructures. J Am Stat Assoc 96:1077–1087
Nylund KL, Asparouhov T, Muthén BO (2007) Deciding on the number of classes in latent class analysis and growth mixture modeling: a Monte Carlo simulation study. Struct Equ Model 14:535–569
Perugini S, Gonçalves MA, Fox EA (2004) Recommender systems research: a connection-centric survey. J Intell Inf Syst 23:107–143
Poole D (2015) On the strength of connectedness of a random hypergraph. Electron J Combin 22, Paper 1.69, 16
Purkait P, Chin T, Sadri A, Suter D (2017) Clustering with hypergraphs: the case for large hyperedges. IEEE Trans Pattern Anal Mach Intell 39:1697–1711
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Skvoretz J, Faust K (1999) Logit models for affiliation networks. Sociol Methodol 29:253–280
Snijders TA, Lomi A, Torló VJ (2013) A model for the multiplex dynamics of two-mode and one-mode networks, with an application to employment preference, friendship, and advice. Soc Netw 35:265–276
Stasi D, Sadeghi K, Rinaldo A, Petrovic S, Fienberg S (2014) \(\beta \) models for random hypergraphs with a given degree sequence. In: Proceedings of COMPSTAT 2014—21st international conference on computational statistics, pp 593–600
Wang YH (1993) On the number of successes in independent trials. Stat Sin 3:295–312
Wang P, Sharpe K, Robins G, Pattison P (2009) Exponential random graph (p*) models for affiliation networks. Soc Netw 31:12–25
Wang P, Pattison P, Robins G (2013) Exponential random graph model specifications for bipartite networks—a dependence hierarchy. Soc Netw 35:211–222
Zhou D, Huang J, Schölkopf B (2007) Learning with hypergraphs: clustering, classification, and embedding. In: Schölkopf B, Platt JC, Hoffman T (eds) Advances in neural information processing systems 19. MIT Press, pp 1601–1608
Acknowledgements
Funding was provided by Science Foundation Ireland (Grant No. SFI/12/RC/2289-2).
Funding
Open Access funding provided by the IReL Consortium
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof on Proposition 1
Proof
We can write \(A = \sum _{i=1}^{N} A_{i}\) where \(A_{i}=1\) if node i appears in the hyperedge and \(A_{i}=0\) otherwise. Similarly, we write \(B=\sum _{i=1}^{N} B_{i}\). Let \(Z_{A}\) be the latent cluster assignment of \(X_{A}\) where \(Z_{A} = g\) if \(X_{A}\) is generated from cluster g. Let \(Z_{B}^{(1)}\) and \(Z_{B}^{(2)}\) be the latent cluster and additional clusters assignments of \(X_{B}\), where \(Z_{B}^{(1)}=g\) and \(Z_{B}^{(2)}=k\) if \(X_{B}\) is generated from cluster g and additional clusters k. We have
For the variance of the LCA model, we have that
where
Hence, we have that
Now,
We have
Now,
To show the quantity above is non-negative, we have to show that
which follows from Jensen’s inequality. \(\square \)
Appendix B: Simulation studies
Model parameters for Tables 1 and 3
Model parameters for Tables 2 and 4
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ng, T.L.J., Murphy, T.B. Model-based clustering for random hypergraphs. Adv Data Anal Classif 16, 691–723 (2022). https://doi.org/10.1007/s11634-021-00454-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-021-00454-7