1. Introduction
Multivariate clustering methods are now well-developed. A wide range of methods and techniques is available, including empirical methods such as K-means [
1,
2], parametric methods based on mixture models [
3,
4], and nonparametric methods based on density modes [
5,
6]. Questions regarding the choice of model complexity—such as the number of clusters, cluster shape, and bandwidth—are generally well-understood, with statistical software providing efficient implementations to automatically select these parameters. For instance, the R package
mclust for mixture-based clustering uses the BIC criterion in order to automatically and simultaneously determine the number of Gaussian mixture components and the shapes of these components, selected out of 14 possible variance parameterizations, and allowing for differing degrees of flexibility in terms of volume, shape, and orientation [
7]. For a relatively recent (but still valid) overview of contemporary clustering techniques, see [
8]. Nonetheless, research on multivariate clustering continues at high intensity: A Google Scholar search on articles since 2015 featuring the terms ‘clustering’, ‘multivariate’, and ‘methodology’ gives, at the time of writing, 19,100 hits. Areas of research activity on clustering in the last 10 years include the development of methods for high-dimensional data, incorporating aspects of variable selection and sparsity [
9,
10], methods tailored to functional data [
11], variations in clustering methods for spatial statistics, including spatially constrained clustering [
12], algorithmic variants that allow for faster computation [
13,
14], and methods that link to soft computing techniques [
15], to name a few.
Clustering is generally considered as a process without a sense of direction. In other words, the set of detected clusters is an entirely unordered one. That is, in existing clustering procedures, there is no sense of ‘ordering’ that can be sensibly assigned to the clusters (apart from, perhaps, by probability mass). In other words, there is usually no way of putting the clusters in relation to each other based on their location within the data space. This is a possible drawback, especially in situations where the relative positioning of the cluster centers (and, hence, of the clusters) is meaningful as it can be considered as being driven by an external variable, which may or may not be observed. Specifically, in many situations, one may argue that the cluster centers are just particular realizations of a latent variable spanning the data space, with the observations essentially constituting multivariate ‘noise’ around these centers. Such scenarios are particularly prone to occur when the relationship between the variables in question is governed by some physical equation or property. An example of such a situation is the measurement of speed and flow on highways, sometimes referred to as the ‘fundamental diagram’, where the underlying latent variable can be thought to represent the traffic density. Although the next section is dedicated to detailing this particular application, the reader may wish to glance at
Figure 1 at this point to gain some intuition. When clustering this data, the order of the clusters along the ‘curved direction’ taken by the data cloud is clearly meaningful: there is much more information in the data than would be captured by an unordered list of cluster centers.
To define the terms more formally, assume the given data , . We consider clustering as an algorithmic process that takes the data as input, and produces a set of cluster centers, say , with cluster labels as output, where observations are then assigned according to a clustering rule with values in . It is important to realize that, in conventional clustering, it is generally not meaningful to establish an ordering relation on the cluster labels; that is, one cannot say that cluster 2 is ‘larger’ than cluster 1 in any meaningful way. The reason is that any such ordering would be based on the relative positioning of the centers in the m-variate space, but these will not allow for obvious ordering unless . Since this paper is concerned with multivariate clustering, we are only interested in the case where .
We define a
directed clustering algorithm as a clustering process in which the ordinal relationship of the resulting cluster labels is well-defined; that is, one can say that cluster
is larger (or smaller) than cluster
k in a meaningful way. In order to include directional information in the clustering process, one needs to connect the clustering problem with the concept of a latent variable. The idea is to assume the existence of such a latent variable, which parameterizes a curve through
m-variate space, passing through
. Since the latent variable is one-dimensional, it enables an ordering relation along it, which can then be inherited by the cluster centers ‘sitting’ on it and, hence, to their corresponding cluster labels. A suitable latent variable model that can used to achieve this purpose was recently proposed by [
16], who suggested approximating highly correlated multivariate data through the use of a one-dimensional latent space, i.e., a straight line, which is parameterized by a single random effect. An estimation of this model was carried out through the nonparametric maximum likelihood approach [
17], which is based on a mixture model approximation that facilitates the clustering step. This model achieves clustering of the data and dimension reduction simultaneously. The work by [
16] did not introduce the notion of ‘directed’ clustering, so one novel contribution of the present manuscript is to introduce and implement this concept. However, importantly, it is clear that most data structures are not exactly linear (such as the ones in
Figure 1); hence, as a second novel contribution, in this manuscript, we also extend the approach in Ref. [
16] to quadratic latent variable models. We propose an algorithm to simultaneously select the number of clusters and choose between a linear and a quadratic latent variable model, and a second algorithm to carry out the actual directed clustering based on this model choice.
Apart from the mentioned example from traffic engineering, potential applications of this method include situations where ‘rankings’ or ‘league tables’ are constructed from multivariate data. Instances of such problems are school-effectiveness studies [
18] or large-scale international skill surveys [
16], where one is not only interested in identifying clusters of similarly performing schools or countries but also in their relative performance to each other, as this is of relevance for resource allocation and related policy decisions. Similar arguments can be made for summaries of economic indicators such as export/import activities or price indices [
19], where one may be interested in identifying clusters of the best- or worst-performing countries, in certain senses. Ample areas of potential application can be found in the sciences where relationships are driven by natural laws and processes; for instance, in single-cell RNA sequencing, a common task is the ‘cell ordering problem’ [
20,
21], where one attempts to identify clusters of gene co-expressions with the help of a latent variable referred to as ‘pseudotime’; in the case of reference [
20], this is achieved through a combination of the principal curve [
22] and K-means methodology.
This paper proceeds as follows. In
Section 2, we introduce the motivating traffic data application in more detail; explicitly making the case for the usefulness of the proposed methodology in this field. In
Section 3, we recall the linear latent variable model and then develop the quadratic one, including the estimation approach. Simulation studies that demonstrate the efficiency of the estimation methodology are presented in
Section 4. Equipped with the methodology, we can formulate the required algorithms in
Section 5. In
Section 6, we return to the speed-flow data, producing the directed clustering results for the considered datasets. In
Section 7, we draw some parallels between the presented approach and principal curves before the paper is concluded with a Discussion in
Section 8.
2. Motivating Application: Speed-Flow Data
Although most drivers typically do not give much thought to the physics of the traffic they are in, they instinctively understand the mechanisms that determine road conditions. Drivers are aware that, when traffic density [vehicles/distance] is low, their speed is essentially only restrained by the speed limit. Once traffic is denser, this does not immediately impact their speed but will increase the flow [or throughput; vehicles/time] of vehicles passing each fixed location. However, when traffic is too dense, the speed is impacted, up to a point when vehicles are jammed and both speed and flow break down synchronously.
This is precisely the story told by the scatterplot depicted on the left side of
Figure 1. This plot depicts measurements of traffic speed and flow on the California freeway SR57-N, recorded from 9 July 2007, 9 am, to 10 July 2007, 10 pm, on Lane 5, with 444 observations. The data, which are available as part of a dataset
calspeedflow in the R package
LPCM [
23], were collected by a loop-detector at a fixed ‘vehicle detector station’. Each point in the scatterplot represents the number of vehicles passing over the loop detector within a 5-min interval (flow;
x-axis) and the average speed (in miles per hour) of those vehicles in that interval (
y-axis).
Direct statistical analyses of speed-flow data are not very widespread in the literature due to potential bimodalities in both coordinate directions, which precludes regression analysis in either orientation. Ref. [
24] fitted principal curves to speed-flow data patterns from the California PeMS database. Using data from the same source, but additionally involving a third variable (occupancy, the length of time a vehicle takes to drive over a loop), Ref. [
25] demonstrated through the use of an agglomerative clustering algorithm that five clusters can typically be identified. Ref. [
26] used K-means clustering to identify the transition point from uncongested to congested conditions in speed-flow data from the metropolitan area of São Paulo, Brazil.
The left panel in
Figure 1 shows the clustering of the
calspeedflow dataset through K-means with five clusters (as in [
25]) using the
kmeans function available in R. While the clusters appear nicely lined up along the data cloud, it is important to note that they are not
actually ordered—neither K-means nor any other commonly used clustering algorithms inherently provide a method for arranging the clusters in any ordered relationship with each other. In this paper, we follow the approaches of both [
24,
26], and analyze such data directly in speed-flow space, combining the notions of curvilinear approximation and clustering.
Indeed, considering
Figure 1 (left), it is intuitive to postulate the presence of a latent variable, starting in the top left corner of the data and proceeding, continuously and smoothly, to the bottom right of the data cloud. An obvious question is then, why would anyone be interested in such a latent variable? To answer this question, we need to appreciate that the observed pattern is not coincidental: beyond the intuitive arguments given in the first paragraph of this section, one can resort to the fundamental identity of traffic flow, which states that, under certain conditions on the stationarity and homogeneity of the traffic, speed
v and flow
q are related through the fundamental identity
, where
d is the traffic density.
As illustrated by [
24], traffic density is a monotone transformation of the described latent variable. A mathematical reason for this monotonicity involves the following: Note that the fundamental identity of traffic flow implies
, i.e., the traffic density is determined by the (inverse of) the ratio of speed and flow. Hence, the traffic density at each point is uniquely determined by the slope of the line connecting that point to the origin. For there to be a break in the monotony of traffic density along the latent variable describing the curved data shape, a line through the origin would need to intersect that curve twice. When considering
Figure 1 (left), it looks unlikely that this property is strongly violated here, and as we will see in
Section 5, it is certainly not violated for the sequence of ordered (directed) cluster centers identified by our methodology.
The data in
Figure 1 (right) show a pattern that differs from the typically reported behavior. This dataset was collected from 0:00 to 19:59 on the southbound California Freeway SR57-S, with 240 observations in total, as presented in [
24]. Just like the previous dataset, it was originally extracted from the PeMS 7.3 database. As explained in [
24], the unusual pattern is likely explained by the presence of heavy vehicles (vehicles with a larger-than-standard gross vehicle weight rating, such as trucks or buses) on a slow lane, where increased speed is counterproductive as it leads to larger stopping distances and, hence, reduces flow. Imagine a latent variable tracing a curvilinear path from the top left to the bottom right of this data cloud. By reasoning similar to the previous example, the assumption that traffic density increases monotonically with respect to the latent variable is plausible. The right panel in
Figure 1 also gives the
K-means clustering for this dataset.
To summarize, achieving a directed clustering of the data with respect to the latent variable simultaneously results in a directed clustering with respect to traffic density, which is immediately and intrinsically meaningful and interpretable. According to [
25], density is “the primary factor to determine the level of service on a freeway”. So, the algorithmic determination of cluster membership ranked by traffic density enables an automatic assessment of the road’s operating condition (congested, free-flow, etc.) with immediate relevance for traffic control strategies and intelligent transportation systems. On the contrary, a plain establishment of cluster membership, such as through
K-means as in
Figure 1, does not allow for identifying the operating condition on the road without additional information about the meaning of these clusters.
Of course, in other applications, it may even be useful to carry out directed clustering along a latent variable that has no such physical meeting. But if it has, as in the present application, the case for such a method is particularly compelling.
We provide the required methodology for directed clustering along curvilinear data structures in the following sections and will return to this example in
Section 6.
4. Simulation
We conducted a simulation study to evaluate the accuracy of parameter estimation for the quadratic latent variable model (
2), based on an implementation of the EM algorithm according to
Section 3.3. We consider a simple scenario where
are 2-dimensional data generated according to model (
4) from
mixture components, with three sample sizes
,
, and
. The true model parameters are displayed in the left column of
Table 1.
Figure 2 shows the data structures of the data we used for this simulation study. We generate 300 replicates for each sample size, respectively. For each of the 300 replicates (for each sample size), we run the EM algorithm 20 times to select the best estimates with the smallest BIC value; this process aids in selecting a good starting value for the EM algorithm.
The averaged estimates are shown in
Table 1, and boxplots of the estimates for some of the parameters are provided in
Figure 3,
Figure 4,
Figure 5 and
Figure 6. Firstly, we observe that as the sample size increases, the boxplots squeeze towards the true values for each parameter, indicating empirically the consistency of the proposed estimation procedure. For the
, this is less clearly visible, since even at
, the interquartile range (IQR) is already very small.
We also observe that the averaged estimations in
Table 1 are generally very close to their true values, except for the
parameters. However, closer inspection of
Figure 5 shows that the estimates of
and
are correctly centered; but there is a considerable amount of outliers similarly observed in the boxplots for the
parameters (see
Figure 6). We proceed with displaying the mean squared errors (MSEs) for the estimates of the parameters in
Table 2; note that, for
and
, due to the mentioned presence of outliers, we use the trimmed MSE [
32] with a trimming parameter of 0.1.
Overall, from these tables and boxplots, we find that the estimators give sensible estimates and that the bias and MSE are reduced with the increase in the sample size.
7. Comparison with Principal Curves
The fitted curves shown in the previous sections suggest that this methodology could be seen as another approach to estimating principal curves [
22]. Both principal curves and the proposed methodology aim to approximate multivariate datasets with smooth curves parameterized by a one-dimensional latent variable.
It is clear that the model formulation (
3), which encompasses both the linear and the quadratic case, can be seen as special cases of a generative data model
where
is a smooth curve. This model was formulated by Hastie and Stuetzle [
22] as their base model for their approach to estimating principal curves, informally defined as ‘smooth curves passing through the middle of a dataset’. Their estimation approach is based on the self-consistency property, loosely meaning that each point on the curve is the mean of all data points, which project onto that point. Under this approach,
g is not actually the principal curve for data generated from the model (
12). An example illustrating this point is provided by [
34]. However, conceptually, it is still right to think of ‘principal curves’ being a concept trying to estimate
g in (
12), and other principal curve definitions, such as by Ref. [
34], do not suffer from this conceptual issue. Some principal curve estimation approaches, including local principal curves [
35], do not even postulate an underlying model at all. In either case, the similarity between (
3) and (
12) justifies a brief comparative look at our approach in relation to principal curves.
Here, we restrict to the first of the two speed-flow datasets. We choose to fit the quadratic model with
, which leads to a minimum BIC value of 7359.93, for comparison with the principal curves. We also fit the data with a Hastie and Stuetzle principal curve, using the R package
princurve [
36], and a local principal curve (LPC), using R package
LPCM [
23]. The results are shown in
Figure 12. We observe that both the fitted HS principal curve (top left panel) and the curve arising from the quadratic latent variable model (bottom panel) fail to fully capture the curvature of the data. In contrast, the local principal curve (top right panel) manages to pull into the endpoint associated with maximum traffic density. One can still argue that the latent variable model produces a sensible result: The methodology aims to locate the mixture centers, which it has done convincingly, and other points around them are considered just noise (belonging to the respective clusters).
The main difference between the approaches lies in the way the projections are carried out, as is evident in
Figure 12. While under the HS and LPC frameworks, one has orthogonal projections; under the latent variable approach proposed in this paper, one does not, since the dashed lines
are driven toward the cluster centers.
Apart from qualitatively comparing the fitted curves, it is not fair to use a quantitative assessment, such as the goodness of fit, to compare them, given the different characteristics of the projections from these methods.
8. Discussion
We presented an approach for clustering multivariate data that enables the ordering of clusters along a curve parameterized by a latent variable, estimated alongside the clustering process. We illustrated this concept in the context of linear and quadratic latent variable models. We showed how to choose between these models, how to implement the estimators, and how to interpret the results. Particular attention was devoted to an application in traffic engineering, where the latent variable had a meaningful interpretation, corresponding to the traffic density. The application of the proposed methodology is not restricted to this field; it has potential applicability across various scientific fields. While the idea of a latent variable representing some ‘underlying governing process’ seems more natural in the physical sciences, the concept of latent variables is home to many other sciences, including fields like education or economics, as alluded to in the introduction. However, it is not actually necessary for the researcher to identify or justify the existence of a governing process in order to use the presented approach. At a minimum, there should be a belief that, grounded in subject matter expertise, the relative ordering of the clusters is informative. In doing so, the existence of a latent variable spanning the centers is postulated—and then the methodology estimates it.
Our work contributes to the development of clustering algorithms that account for the governing process underlying the data at hand. As pointed out by a referee, mathematical clustering models are ineffective if they do not consider important driving factors that contribute to the generation of patterns. The proposed methodology allows the identification of a potentially existing latent variable driving the data-generating process and also allows for the further processing of this latent variable through its posterior random effects. This could include applications such as regression or correlation with actually measured physical variables. Conceptually, it would be even more desirable to directly include the knowledge of the governing process of the data in the clustering algorithm. To a certain extent, this could be achieved by adjusting the clustering process for the presence of influencing variables. For instance, in the context of the traffic examples, one may want to adjust for spatial or temporal factors. This would require the inclusion of covariates into the model (
3). In the case of the linear model relationship (
1), this was developed in [
16], and could be extended to the more general framework presented here.
It is apparent that the restriction to linear and quadratic latent variable models is a possible limitation of the presented methodology. To overcome this restriction, more complex shapes of latent variable models could be considered, such as:
where
and
play the same roles as before,
can be regarded as any real-valued basis function, such as a B-spline or polynomial, and
is a
m-variate parameter vector. These still fit into the notational framework defined by (
3), but now with
. The potential power of such an approach becomes evident when considering the LPC fit in
Figure 12, which shows that there is still considerable scope for improvement in fitting capability for sufficiently complex data patterns. However, given the difficulties of even estimating the quadratic model (
2), requiring a combination of iterative statistical algorithms (EM) and numerical root finders, it will be a considerable challenge to devise an algorithm to estimate this model in full generality.
To prevent doubt, we would like to highlight that we do not advocate this methodology as a superior or favorable clustering technique for generic clustering applications. While the added structure embedded into the model may lend, in some occasions, computational stability to the clustering process, there is no general advantage of using this approach if one is not interested in the directional aspects of the clusters. One reason for this is that the linear or quadratic models constrain the flexibility in positioning the cluster centers, which generally will not lead to more precise or better-fitting clusters than conventional approaches.
Apart from the directional clustering itself, further applications of this approach arise from the presented methodology such as the construction of rankings or ‘league tables’. While it could be felt that the construction of the league table for the traffic dataset was a slightly contrived exercise—for other applications, such as when considering sets of educational attainment indicators [
16], or economic indicators such as import/export activity, it will not. We leave such considerations for interested researchers to ponder over.