HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: mdwlist

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2402.11773v2 [cs.LG] 22 Feb 2024

Dynamic Multi-Network Mining of Tensor Time Series

Kohei Obata SANKEN, Osaka University, Japan [email protected] Koki Kawabata SANKEN, Osaka University, Japan [email protected] Yasuko Matsubara SANKEN, Osaka University, Japan [email protected]  and  Yasushi Sakurai SANKEN, Osaka University, Japan [email protected]
(2024)
Abstract.

Subsequence clustering of time series is an essential task in data mining, and interpreting the resulting clusters is also crucial since we generally do not have prior knowledge of the data. Thus, given a large collection of tensor time series consisting of multiple modes, including timestamps, how can we achieve subsequence clustering for tensor time series and provide interpretable insights? In this paper, we propose a new method, Dynamic Multi-network Mining (DMM), that converts a tensor time series into a set of segment groups of various lengths (i.e., clusters) characterized by a dependency network constrained with 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm. Our method has the following properties. (a) Interpretable: it characterizes the cluster with multiple networks, each of which is a sparse dependency network of a corresponding non-temporal mode, and thus provides visible and interpretable insights into the key relationships. (b) Accurate: it discovers the clusters with distinct networks from tensor time series according to the minimum description length (MDL). (c) Scalable: it scales linearly in terms of the input data size when solving a non-convex problem to optimize the number of segments and clusters, and thus it is applicable to long-range and high-dimensional tensors. Extensive experiments with synthetic datasets confirm that our method outperforms the state-of-the-art methods in terms of clustering accuracy. We then use real datasets to demonstrate that DMM is useful for providing interpretable insights from tensor time series.

Tensor time series, Clustering, Network inference, Graphical lasso
journalyear: 2024copyright: acmlicensedconference: Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singapore.booktitle: Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singaporeisbn: 979-8-4007-0171-9/24/05doi: 10.1145/3589334.3645461ccs: Information systems Data miningccs: Information systems Clustering

1. Introduction

The development of IoT has facilitated the collection of time series data, including data related to automobiles (Miyajima et al., 2007), medicine (Hirano and Tsumoto, 2006; Monti et al., 2014), and finance (Namaki et al., 2011; Ruiz et al., 2012), from multiple modes such as sensor type, locations and users, which we call tensor time series (TTS). An instance of such data is online activity data, which records search volumes in three modes {Query, Location, Timestamp}. These TTS can often be divided and grouped into subsequences that have similar traits (i.e., clusters). Time series subsequence clustering (Aghabozorgi et al., 2015; Zolhavarieh et al., 2014) is a useful unsupervised exploratory approach for recognizing dynamic changes and uncovering interesting patterns in time series. As well as clustering data, the interpretability of the results is also important since we rarely know what each cluster refers to (Plant and Böhm, 2011; Rudin, 2019). Modeling a cluster as a dependency network (Hallac et al., 2017b; Tozzo et al., 2021; Tan et al., 2015), where nodes are variables and an edge expresses a relationship between variables, gives a clear explanation of what the cluster refers to. Considering that a TTS consists of multiple modes (Madabhushi and Lee, 2016; Gatto et al., 2021; Batalo et al., 2022), a cluster should be modeled as multiple networks, where each is a dependency network of a corresponding non-temporal mode, to provide a good explanation. In the above example, a cluster can be modeled as query and location networks, where each explains the relationships among queries/locations. With these networks, we can understand why a particular cluster distinguishes itself from another and speculate about what happened during a period belonging to the cluster. Given such a TTS, how can we find clusters with interpretability contributing to a better understanding of the data?

Research on time series subsequence clustering has mainly focused on univariate or multivariate time series (UTS and MTS). TTS is a generalization of time series and includes UTS and MTS. Here, we mainly assume that TTS has three or more modes. Generally, UTS clustering methods use distance-based metrics such as dynamic time warping (Berndt and Clifford, 1994). These methods focus on matching raw values and do not consider relationships among variables, which is essential if we are to interpret the MTS and TTS clustering. MTS clustering methods usually employ model-based clustering, which assumes, for example, a Gaussian (Matsubara et al., 2014) or an ARMA (Xiong and Yeung, 2004) model and attempts to find clusters that recover the data from the model. The interpretability of the clustering results depends on the model they assume. As a technique for interpretable clustering, TICC (Hallac et al., 2017b) models an MTS with a dependency network and discovers interpretable clusters that previously developed methods cannot find. Nevertheless, TTS clustering is a more challenging problem and cannot simply employ MTS methods due to the complexity of TTS, stemming from multiple modes, which introduces intricate dependencies and a massive data size. To employ an MTS clustering method (e.g., TICC) for TTS, the TTS must be flattened to form a higher-order MTS. As a result, the method processes the higher-order MTS and mixes up all the relationships between variables, which may capture spurious relationships and unnecessarily exacerbate the interpretability. Moreover, its computational time increases greatly as the number of variables in a mode increases.

In this paper, we propose a new method for TTS subsequence clustering, which we call Dynamic Multi-network Mining (DMM). 111 Our source code and datasets are publicly available:
https://github.com/KoheiObata/DMM.
In our method, we define each cluster as multiple networks, each of which is a sparse dependency network of a corresponding non-temporal mode and thus can be seen as visual images that can help users quickly understand the data structure. Our algorithm scales linearly with the input data size while employing the divide-and-conquer method and is thus applicable to long-range and high-dimensional tensors. Furthermore, the clustering results and every user-defined parameter of our method can be determined by a single criterion based on the Minimum Description Length (MDL) principle (Grünwald, 2007). DMM is a useful tool for TTS subsequence clustering that enables multifaceted analysis and understanding of TTS.

1.1. Preview of our results

Fig. 1 shows the DMM results for clustering over Google Trends data, which consists of 10101010 years of daily web search counts for six queries related to COVID-19 across 10101010 countries, forming a 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT-order tensor. Fig. 1 (a) shows the cluster assignments of the TTS, where each color represents a cluster. DMM splits the tensor into four segments and groups them into four clusters, each of which can be interpreted as a distinct phase corresponding to the evolving social response to COVID-19; thus, we name these phases “Before Covid,” “Outbreak,” “Vaccine,” and “Adaptation.” It is worth noting that this result is obtained with no prior knowledge.

Fig. 1 (b) presents the networks of each cluster, i.e., a country network, which has nodes plotted on the world map, reflects dependencies between different countries, and a query network for query dependencies. These networks, also known as a Markov Random Field (MRF) (Rue and Held, 2005), illustrate how the node affects the other nodes. The thickness and color of the edges in the network indicate the strength of the partial correlation between the nodes, which denotes a stronger relationship compared with a simple correlation. We learn the networks by estimating a Gaussian inverse covariance matrix. Then, by definition, if there is an edge between two nodes, the nodes are directly dependent on each other. Otherwise, they are conditionally independent, given the rest of the nodes. Moreover, we impose an 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm penalty on the networks to promote sparsity, making it possible to obtain true networks and interpretability, as well as making the method noise-robust (Wytock and Kolter, 2013; Yuan and Lin, 2006). These networks provide visible and interpretable insights into the key relationships that characterize clusters.

Refer to caption (a) Cluster assignments on the original tensor time series
Refer to caption (b) Country and query networks change dynamically
Figure 1. Effectiveness of DMM on Google Trends (#4 Covid) dataset: (a) DMM can split the tensor time series into meaningful subsequence clusters shown by colors (i.e., #green\rightarrowBefore Covid”, #pink\rightarrowOutbreak”, #gray\rightarrowVaccine”, #blue\rightarrowAdaptation”), and (b) their important relationships between variables are summarized with country and query networks, where the nodes show individual variables, and the thickness and color of the edges are partial correlations showing the importance of its interaction.

We see that each of the four clusters exhibits unique networks that evolve with the different phases. In the “Before Covid” phase, the country network displays edges between English-speaking countries, indicating their interconnectedness. In the query network, the query “vaccine” correlates with “influenza.” However, during the “Outbreak” starting in 2020202020202020, many countries respond to the COVID-19 pandemic, leading to various edges in the country network. In the query network of this phase, new edges related to “coronavirus” appear, and “coronavirus” and “virus” have a particularly strong connection. In the “Vaccine” phase, as people become more concerned about protection from COVID-19, the query “vaccine” forms an edge with “covid.” Moreover, since flu infects fewer people than in the past, “influenza” loses its edges. Lastly, during the “Adaptation” phase, as the world becomes accustomed to the situation, the country network reduces the number of edges, and the edges related to “influenza” reappear, reflecting a return to the networks observed in the “Before Covid” phase.

1.2. Contributions

In summary, we propose DMM as a subsequence clustering method for TTS based on the MDL principle that enables each cluster to be characterized by multiple networks. The contributions of this paper can be summarized as follows.

  • Interpretable: DMM realizes the meaningful subsequence clustering of TTS, where each cluster is characterized by sparse dependency networks for each non-temporal mode, which facilitates the interpretation of the cluster from important relationships between variables.

  • Accurate: We define a criterion based on MDL to discover clusters with distinct networks. Thanks to the proposed criterion, any user-defined parameters can be determined, and DMM outperforms its state-of-the-art competitors in terms of clustering accuracy on synthetic data.

  • Scalable: The proposed clustering algorithm in DMM scales linearly as regards the input data size and is thus applicable to long-range and high-dimensional tensors.

Outline. The rest of the paper is organized as follows. After introducing related work in Section 2, we present our problem and basic background in Section 3. We then propose our model and algorithm in Sections 4 and 5, respectively. We report our experimental results in Sections 6 and 7.

2. Related work

We review previous studies that are closely related to our work.

Time series subsequence clustering. Subsequence clustering is an important task in time series data mining whose benefits are the extraction of interesting patterns and the provision of valuable information, and that can also be used as a subroutine of other tasks such as forecasting (Takahashi et al., 2017; Papadimitriou et al., 2005). Time series subsequence clustering methods can be roughly separated into a distance-based method and a model-based method. The distance-based method uses metrics such as dynamic time warping (Berndt and Clifford, 1994; Keogh, 2002; Alaee et al., 2021) and longest common subsequence (Vlachos et al., 2002) and finds clusters by focusing on matching raw values rather than structure in the data. The model-based method assumes a model for each cluster, and finds the best fit of data to the model. It covers a wide variety of models such as ARMA (Xiong and Yeung, 2004), Markov chain (Ramoni et al., 2000), and Gaussian (Matsubara et al., 2014). However, most previous work has focused on MTS and are not suitable for TTS. Few studies have focused on TTS clustering, for example, CubeScope (Nakamura et al., 2023) uses Dirichlet prior as a model to achieve online TTS clustering, but it only supports sparse categorical data. In summary, existing methods are not particularly well-suited to handling TTS and discovering interpretable clusters.

Tensor time series. TTS are ubiquitous and appear in a variety of applications, such as recommendation and demand prediction (Bai et al., 2019; Wu et al., 2019; Matsubara et al., 2016). To model a tensor, tensor/matrix decomposition, such as Tucker/CP decomposition (Kolda and Bader, 2009) and SVD, is a commonly used technique. Although it obtains a lower-dimensional representation that summarizes important patterns from a tensor, it struggles to capture temporal information (Liu et al., 2020). Therefore, it is often combined with dynamical systems to handle temporal information (Rogers et al., 2013; Cai et al., 2015; Jing et al., 2021). For example, SSMF (Kawabata et al., 2021), which is an online forecasting method that uses clustering as a subroutine, combines a dynamical system with non-negative matrix factorization (NMF) to capture seasonal patterns from a TTS. Each cluster in SSMF is characterized by a lower-dimensional representation of a TTS, however, understanding the representation is demanding. Thus, tensor/matrix decomposition is not suitable for an interpretable model.

Sparse network inference. Inferring a sparse inverse covariance matrix (i.e., network) from data helps us to understand the dependency of variables in a statistical way. Graphical lasso (Friedman et al., 2008), which maximizes the Gaussian log-likelihood imposing a 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm penalty, is one of the most commonly used techniques for estimating the sparse network from static data. However, time series data are normally non-stationary, and the network varies over time; thus, to infer time-varying networks, time similarity with the neighboring network is usually considered (Hallac et al., 2017a). The monitoring of such time-varying networks has been studied with the aim of analyzing economic data (Namaki et al., 2011) and biological signal data (Monti et al., 2014) because of the high interpretability of the network (Tomasi et al., 2021). Although the inference of time-varying networks is able to find change points by comparing the networks before and after a change, it cannot find clusters (Tomasi et al., 2018; Harutyunyan et al., 2019; Xuan and Murphy, 2007). TICC (Hallac et al., 2017b) and TAGM (Tozzo et al., 2021) use graphical lasso and find clusters from time series based on the network of each subsequence, providing the clusters with interpretability and allowing us to discover clusters that other traditional clustering methods cannot find. However, they cannot provide an interpretable insight when dealing with TTS. Consequently, past studies have yet to find networks for TTS and a way to cluster TTS based on the networks. Our method uses a graphical lasso-based model modified to provide interpretable clustering results from TTS.

3. Problem formulation

In this section, we describe the TTS we want to analyze, introduce some necessary background material, and define the formal problem of TTS clustering.

The main symbols employed in this paper are described in Appendix A. Consider an ((((N+1)th)^{th}) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order TTS 𝒳D1××DN×T𝒳superscriptsubscript𝐷1subscript𝐷𝑁𝑇\mathcal{X}\in\mathbb{R}^{D_{1}\times\cdots\times D_{N}\times T}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT × italic_T end_POSTSUPERSCRIPT, where the mode-(N+1)𝑁1(N+1)( italic_N + 1 ) is the time and its dimension is T𝑇Titalic_T. We can also rewrite the TTS as a sequence of Nthsuperscript𝑁𝑡N^{th}italic_N start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order tensors 𝒳={𝒳1,𝒳2,,𝒳T}𝒳subscript𝒳1subscript𝒳2subscript𝒳𝑇\mathcal{X}=\{\mathcal{X}_{1},\mathcal{X}_{2},\dots,\mathcal{X}_{T}\}caligraphic_X = { caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, where each 𝒳t𝐑D1××DN(1tT)subscript𝒳𝑡superscript𝐑subscript𝐷1subscript𝐷𝑁1𝑡𝑇\mathcal{X}_{t}\in\mathbf{R}^{D_{1}\times\cdots\times D_{N}}(1\leq t\leq T)caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 ≤ italic_t ≤ italic_T ) denotes the observed data at the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT time step.

3.1. Tensor algebra

We briefly introduce some definitions in tensor algebra from tensor related literature (Kolda and Bader, 2009; Cai et al., 2015).

Definition 0 (Reorder).

Let the ordered sets P(1),,P(G)superscript𝑃1normal-…superscript𝑃𝐺P^{(1)},\dots,P^{(G)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_P start_POSTSUPERSCRIPT ( italic_G ) end_POSTSUPERSCRIPT, where P(g)={p1(g),,png(g)}{1,2,,N}superscript𝑃𝑔subscriptsuperscript𝑝𝑔1normal-…subscriptsuperscript𝑝𝑔subscript𝑛𝑔12normal-…𝑁P^{(g)}=\{p^{(g)}_{1},\dots,p^{(g)}_{n_{g}}\}\subset\{1,2,\dots,N\}italic_P start_POSTSUPERSCRIPT ( italic_g ) end_POSTSUPERSCRIPT = { italic_p start_POSTSUPERSCRIPT ( italic_g ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT ( italic_g ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ⊂ { 1 , 2 , … , italic_N }, be a partitioning of the modes {1,2,,N}12normal-…𝑁\{1,2,\dots,N\}{ 1 , 2 , … , italic_N } s.t., gGng=Nsuperscriptsubscript𝑔𝐺subscript𝑛𝑔𝑁\sum_{g}^{G}n_{g}=N∑ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_N. The reordering of an Nthsuperscript𝑁𝑡N^{th}italic_N start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order tensor 𝒳𝐑D1××DN𝒳superscript𝐑subscript𝐷1normal-⋯subscript𝐷𝑁\mathcal{X}\in\mathbf{R}^{D_{1}\times\cdots\times D_{N}}caligraphic_X ∈ bold_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT into ordered sets is defined as re(𝒳)(P(1),,P(G))𝐑J(1)××J(G)𝑟𝑒superscript𝒳superscript𝑃1normal-…superscript𝑃𝐺superscript𝐑superscript𝐽1normal-⋯superscript𝐽𝐺re(\mathcal{X})^{(P^{(1)},\dots,P^{(G)})}\in\mathbf{R}^{J^{(1)}\times\dots% \times J^{(G)}}italic_r italic_e ( caligraphic_X ) start_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_P start_POSTSUPERSCRIPT ( italic_G ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT × ⋯ × italic_J start_POSTSUPERSCRIPT ( italic_G ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, where J(g)=nP(g)Dnsuperscript𝐽𝑔subscriptproduct𝑛superscript𝑃𝑔subscript𝐷𝑛J^{(g)}=\prod_{n\in P^{(g)}}D_{n}italic_J start_POSTSUPERSCRIPT ( italic_g ) end_POSTSUPERSCRIPT = ∏ start_POSTSUBSCRIPT italic_n ∈ italic_P start_POSTSUPERSCRIPT ( italic_g ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Given a tensor 𝒳𝐑D1(1)××DN(1)×D1(2)××DN(G)𝒳superscript𝐑subscriptsuperscript𝐷11subscriptsuperscript𝐷1𝑁subscriptsuperscript𝐷21subscriptsuperscript𝐷𝐺𝑁\mathcal{X}\in\mathbf{R}^{D^{(1)}_{1}\times\cdots\times D^{(1)}_{N}\times D^{(% 2)}_{1}\times\cdots\times D^{(G)}_{N}}caligraphic_X ∈ bold_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_D start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT × italic_D start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_D start_POSTSUPERSCRIPT ( italic_G ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we partition the modes into G𝐺Gitalic_G, P(g)={gN+1,,g(N+1)}superscript𝑃𝑔𝑔𝑁1𝑔𝑁1P^{(g)}=\{gN+1,\cdots,g(N+1)\}italic_P start_POSTSUPERSCRIPT ( italic_g ) end_POSTSUPERSCRIPT = { italic_g italic_N + 1 , ⋯ , italic_g ( italic_N + 1 ) }. The element is given by re(𝒳)i(1),,i(G)(P(1),,P(G))=𝒳d1(1),,dN(1),d1(2),,dN(G)𝑟𝑒subscriptsuperscript𝒳superscript𝑃1superscript𝑃𝐺superscript𝑖1superscript𝑖𝐺subscript𝒳subscriptsuperscript𝑑11subscriptsuperscript𝑑1𝑁subscriptsuperscript𝑑21subscriptsuperscript𝑑𝐺𝑁re(\mathcal{X})^{(P^{(1)},\dots,P^{(G)})}_{i^{(1)},\dots,i^{(G)}}=\mathcal{X}_% {d^{(1)}_{1},\dots,d^{(1)}_{N},d^{(2)}_{1},\dots,d^{(G)}_{N}}italic_r italic_e ( caligraphic_X ) start_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_P start_POSTSUPERSCRIPT ( italic_G ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_i start_POSTSUPERSCRIPT ( italic_G ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = caligraphic_X start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUPERSCRIPT ( italic_G ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where i(1)=1+g=1N(dg(1)1)n=1g1Dn(1)superscript𝑖11superscriptsubscript𝑔1𝑁subscriptsuperscript𝑑1𝑔1superscriptsubscriptproduct𝑛1𝑔1subscriptsuperscript𝐷1𝑛i^{(1)}=1+\sum_{g=1}^{N}(d^{(1)}_{g}-1)\prod_{n=1}^{g-1}D^{(1)}_{n}italic_i start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 1 + ∑ start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_d start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - 1 ) ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Special cases of reordering are vectorization and matricization. Vectorization happens when G=1𝐺1G=1italic_G = 1. vec(𝒳)=re(𝒳)({1})𝐑D𝑣𝑒𝑐𝒳𝑟𝑒superscript𝒳1superscript𝐑𝐷vec(\mathcal{X})=re(\mathcal{X})^{(\{-1\})}\in\mathbf{R}^{D}italic_v italic_e italic_c ( caligraphic_X ) = italic_r italic_e ( caligraphic_X ) start_POSTSUPERSCRIPT ( { - 1 } ) end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, where D=n=1NDn𝐷superscriptsubscriptproduct𝑛1𝑁subscript𝐷𝑛D=\prod_{n=1}^{N}D_{n}italic_D = ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and {1}1\{-1\}{ - 1 } refers to the remaining unset modes. Mode-n matricization happens when G=2𝐺2G=2italic_G = 2 and P(1)superscript𝑃1P^{(1)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT is a singleton. mat(𝒳)(n)=re(𝒳)({n},{1})𝐑Dn×D(\n)𝑚𝑎𝑡superscript𝒳𝑛𝑟𝑒superscript𝒳𝑛1superscript𝐑subscript𝐷𝑛superscript𝐷\absent𝑛mat(\mathcal{X})^{(n)}=re(\mathcal{X})^{(\{n\},\{-1\})}\in\mathbf{R}^{D_{n}% \times D^{(\backslash n)}}italic_m italic_a italic_t ( caligraphic_X ) start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT = italic_r italic_e ( caligraphic_X ) start_POSTSUPERSCRIPT ( { italic_n } , { - 1 } ) end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_D start_POSTSUPERSCRIPT ( \ italic_n ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, where D(\n)=m=1(mn)NDmsuperscript𝐷\absent𝑛superscriptsubscriptproduct𝑚1𝑚𝑛𝑁subscript𝐷𝑚D^{(\backslash n)}=\prod_{m=1(m\neq n)}^{N}D_{m}italic_D start_POSTSUPERSCRIPT ( \ italic_n ) end_POSTSUPERSCRIPT = ∏ start_POSTSUBSCRIPT italic_m = 1 ( italic_m ≠ italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

3.2. Graphical lasso

We use graphical lasso as a part of our model. Given the mode-(N+1) matricization of the (N+1)thsuperscript𝑁1𝑡(N+1)^{th}( italic_N + 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order TTS, mat(𝒳)(N+1)T×D𝑚𝑎𝑡superscript𝒳𝑁1superscript𝑇𝐷mat(\mathcal{X})^{(N+1)}\in\mathbb{R}^{T\times D}italic_m italic_a italic_t ( caligraphic_X ) start_POSTSUPERSCRIPT ( italic_N + 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT, the graphical lasso (Friedman et al., 2008) estimates the sparse Gaussian inverse covariance matrix (i.e., network) θD×D𝜃superscript𝐷𝐷\theta\in\mathbb{R}^{D\times D}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_D end_POSTSUPERSCRIPT, also known as the precision matrix, with which we can interpret pairwise conditional independencies among D𝐷Ditalic_D variables, e.g., if θi,j=0subscript𝜃𝑖𝑗0\theta_{i,j}=0italic_θ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 then variables i𝑖iitalic_i and j𝑗jitalic_j are conditionally independent given the values of all the other variables. The optimization problem is given as follows:

(1) minimizeθS++psubscriptminimize𝜃subscriptsuperscript𝑆𝑝absent\displaystyle\textrm{minimize}_{\theta\in S^{p}_{++}}minimize start_POSTSUBSCRIPT italic_θ ∈ italic_S start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT end_POSTSUBSCRIPT λθod,1t=1Tll(mat(𝒳)t,(N+1),θ),𝜆subscriptnorm𝜃𝑜𝑑1superscriptsubscript𝑡1𝑇𝑙𝑙𝑚𝑎𝑡subscriptsuperscript𝒳𝑁1𝑡𝜃\displaystyle\lambda||\theta||_{od,1}-\sum_{t=1}^{T}ll(mat(\mathcal{X})^{(N+1)% }_{t,},\theta),italic_λ | | italic_θ | | start_POSTSUBSCRIPT italic_o italic_d , 1 end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_l italic_l ( italic_m italic_a italic_t ( caligraphic_X ) start_POSTSUPERSCRIPT ( italic_N + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , end_POSTSUBSCRIPT , italic_θ ) ,
ll(x,θ)=𝑙𝑙𝑥𝜃absent\displaystyle ll(x,\theta)=italic_l italic_l ( italic_x , italic_θ ) = 12(xμ)Tθ(xμ)12superscript𝑥𝜇𝑇𝜃𝑥𝜇\displaystyle-\frac{1}{2}(x-\mu)^{T}\theta(x-\mu)- divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_θ ( italic_x - italic_μ )
(2) +12logdetθD2log(2π),12det𝜃𝐷22𝜋\displaystyle+\frac{1}{2}\log\textrm{det}\theta-\frac{D}{2}\log(2\pi),+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log det italic_θ - divide start_ARG italic_D end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π ) ,

where θ𝜃\thetaitalic_θ must be a symmetric positive definite (S++psubscriptsuperscript𝑆𝑝absentS^{p}_{++}italic_S start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT). ll(x,θ)𝑙𝑙𝑥𝜃ll(x,\theta)italic_l italic_l ( italic_x , italic_θ ) is the log-likelihood and μ𝐑D𝜇superscript𝐑𝐷\mu\in\mathbf{R}^{D}italic_μ ∈ bold_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is the empirical mean of mat(𝒳)(N+1)𝑚𝑎𝑡superscript𝒳𝑁1mat(\mathcal{X})^{(N+1)}italic_m italic_a italic_t ( caligraphic_X ) start_POSTSUPERSCRIPT ( italic_N + 1 ) end_POSTSUPERSCRIPT. λ0𝜆0\lambda\geq 0italic_λ ≥ 0 is a hyperparameter for determining the sparsity level of the network, and od,1\|\cdot\|_{od,1}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_o italic_d , 1 end_POSTSUBSCRIPT indicates the off-diagonal 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm. Since Eq. (1) is a convex optimization problem, its solution is guaranteed to converge to the global optimum with the alternating direction method of multipliers (ADMM) (Boyd et al., 2011) and can speed up the solution time.

3.3. Network-based tensor time series clustering

A real-world complex 𝒳𝒳\mathcal{X}caligraphic_X cannot be expressed by a single static network because it contains multiple sequence patterns, each of which has a distinct relationship/network. Moreover, we rarely know the optimal number of clusters and cluster assignments in advance. To address this issue, we want to provide an appropriate cost function and achieve subsequence clustering by minimizing the cost function. We now formulate the network-based TTS clustering problem. It assumes that T𝑇Titalic_T time steps of 𝒳𝒳\mathcal{X}caligraphic_X can be divided into m𝑚mitalic_m time segments based on K𝐾Kitalic_K networks (i.e., clusters). Let cp𝑐𝑝cpitalic_c italic_p denote a starting point set of segments, i.e., cp={cp1,cp2,,cpm}𝑐𝑝𝑐subscript𝑝1𝑐subscript𝑝2𝑐subscript𝑝𝑚cp=\{cp_{1},cp_{2},\dots,cp_{m}\}italic_c italic_p = { italic_c italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, the i𝑖iitalic_i-th segment of 𝒳𝒳\mathcal{X}caligraphic_X is denoted as 𝒳cpi:cpi+1subscript𝒳:𝑐subscript𝑝𝑖𝑐subscript𝑝𝑖1\mathcal{X}_{cp_{i}:cp_{i+1}}caligraphic_X start_POSTSUBSCRIPT italic_c italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_c italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT where cpm+1=T+1𝑐subscript𝑝𝑚1𝑇1cp_{m+1}=T+1italic_c italic_p start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT = italic_T + 1. We group each of the T𝑇Titalic_T points into one of the K𝐾Kitalic_K clusters denoted by a cluster assignment set ={f1,f2,,fK}subscript𝑓1subscript𝑓2subscript𝑓𝐾\mathcal{F}=\{f_{1},f_{2},\dots,f_{K}\}caligraphic_F = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }, where fk{1,2,,T}subscript𝑓𝑘12𝑇f_{k}\subset\{1,2,\dots,T\}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊂ { 1 , 2 , … , italic_T }, and we refer to all subsequences in the cluster k𝑘kitalic_k as 𝒳[fk]𝒳𝒳delimited-[]subscript𝑓𝑘𝒳\mathcal{X}[f_{k}]\subset\mathcal{X}caligraphic_X [ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ⊂ caligraphic_X. Then, letting ΘΘ\Thetaroman_Θ be a model parameter set, i.e., Θ={θ1,θ2,,θK}Θsubscript𝜃1subscript𝜃2subscript𝜃𝐾\Theta=\{\theta_{1},\theta_{2},\dots,\theta_{K}\}roman_Θ = { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }, each θkD×Dsubscript𝜃𝑘superscript𝐷𝐷\theta_{k}\in\mathbb{R}^{D\times D}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_D end_POSTSUPERSCRIPT is a sparse Gaussian inverse covariance matrix that summarizes the relationships of variables in 𝒳[fk]𝒳delimited-[]subscript𝑓𝑘\mathcal{X}[f_{k}]caligraphic_X [ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]. Therefore, the entire cluster parameter set is given by ={1,2,,K}subscript1subscript2subscript𝐾\mathcal{M}=\{\mathcal{M}_{1},\mathcal{M}_{2},\dots,\mathcal{M}_{K}\}caligraphic_M = { caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }, consisting of k={θk,fk}subscript𝑘subscript𝜃𝑘subscript𝑓𝑘\mathcal{M}_{k}=\{\theta_{k},f_{k}\}caligraphic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. Overall, the problem that we want to solve is written as follows.

Problem 1 ().

Given a tensor time series 𝒳𝒳\mathcal{X}caligraphic_X, estimate:

  • a cluster assignment set, ={fk}k=1Ksuperscriptsubscriptsubscript𝑓𝑘𝑘1𝐾\mathcal{F}=\{f_{k}\}_{k=1}^{K}caligraphic_F = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT

  • a model parameter set, Θ={θk}k=1KΘsuperscriptsubscriptsubscript𝜃𝑘𝑘1𝐾\Theta=\{\theta_{k}\}_{k=1}^{K}roman_Θ = { italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT

  • the number of clusters K𝐾Kitalic_K

that minimizes the cost function Eq. (10).

4. Proposed DMM

In this section, we propose a new model with which to realize network-based TTS clustering, namely, DMM. We first describe our model θ𝜃\thetaitalic_θ, and then we define the criterion for determining the cluster assignments and the number of clusters.

4.1. Multimode graphical lasso

Assume K,𝐾K,\mathcal{F}italic_K , caligraphic_F are given, here, we address how to define and infer the model θksubscript𝜃𝑘\theta_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The original graphical lasso allows θksubscript𝜃𝑘\theta_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to connect any pairs of variables in a tensor; however, it is too high-dimensional to reveal relationships separately in terms of the non-temporal modes. To avoid the over-representation, we aim to capture the multi-aspect relationships by separating θksubscript𝜃𝑘\theta_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT into multimode to which we add a desired constraint for interpretability.

We assume that θ𝜃\thetaitalic_θ is derived from N𝑁Nitalic_N networks, {A(1),,A(N)}superscript𝐴1superscript𝐴𝑁\{A^{(1)},\dots,A^{(N)}\}{ italic_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_A start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT }, where A(n)𝐑Dn×Dnsuperscript𝐴𝑛superscript𝐑subscript𝐷𝑛subscript𝐷𝑛A^{(n)}\in\mathbf{R}^{D_{n}\times D_{n}}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the n𝑛nitalic_n-th network. For example, an element ai,j(n)A(n)subscriptsuperscript𝑎𝑛𝑖𝑗superscript𝐴𝑛a^{(n)}_{i,j}\in A^{(n)}italic_a start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT refers to the relationship between the i𝑖iitalic_i-th and j𝑗jitalic_j-th variables of mode-n, In each network, the goal is to capture the dependencies between Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT variables. We also assume that there are no relationships except among variables that differ only at mode-n. Thus, θ=θ(N)𝜃superscript𝜃𝑁\theta=\theta^{(N)}italic_θ = italic_θ start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT becomes an Nthsuperscript𝑁𝑡N^{th}italic_N start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT hierarchical matrix of shape D×D𝐷𝐷D\times Ditalic_D × italic_D. θ(n)superscript𝜃𝑛\theta^{(n)}italic_θ start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT can be written as follows:

θ(n)=(θ(n1)C1,2(n)C1,Dn(n)C2,1(n)θ(n1)C3,1(n)C3,2(n)CDn2,Dn1(n)CDn2,Dn(n)θ(n1)CDn1,Dn(n)CDn,1(n)CDn,Dn1(n)θ(n1)),superscript𝜃𝑛matrixsuperscript𝜃𝑛1subscriptsuperscript𝐶𝑛12subscriptsuperscript𝐶𝑛1subscript𝐷𝑛subscriptsuperscript𝐶𝑛21superscript𝜃𝑛1missing-subexpressionsubscriptsuperscript𝐶𝑛31subscriptsuperscript𝐶𝑛32subscriptsuperscript𝐶𝑛subscript𝐷𝑛2subscript𝐷𝑛1subscriptsuperscript𝐶𝑛subscript𝐷𝑛2subscript𝐷𝑛missing-subexpressionsuperscript𝜃𝑛1subscriptsuperscript𝐶𝑛subscript𝐷𝑛1subscript𝐷𝑛subscriptsuperscript𝐶𝑛subscript𝐷𝑛1subscriptsuperscript𝐶𝑛subscript𝐷𝑛subscript𝐷𝑛1superscript𝜃𝑛1\displaystyle\theta^{(n)}=\begin{pmatrix}\theta^{(n-1)}&C^{(n)}_{1,2}&\cdots&% \cdots&C^{(n)}_{1,D_{n}}\\ C^{(n)}_{2,1}&\theta^{(n-1)}&\cdots&&\vdots\\ C^{(n)}_{3,1}&C^{(n)}_{3,2}&\cdots&\ddots&\vdots\\ \vdots&\ddots&\cdots&C^{(n)}_{D_{n}-2,D_{n}-1}&C^{(n)}_{D_{n}-2,D_{n}}\\ \vdots&&\cdots&\theta^{(n-1)}&C^{(n)}_{D_{n}-1,D_{n}}\\ C^{(n)}_{D_{n},1}&\ddots&\cdots&C^{(n)}_{D_{n},D_{n}-1}&\theta^{(n-1)}\\ \end{pmatrix},italic_θ start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL italic_θ start_POSTSUPERSCRIPT ( italic_n - 1 ) end_POSTSUPERSCRIPT end_CELL start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_θ start_POSTSUPERSCRIPT ( italic_n - 1 ) end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 , 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋯ end_CELL start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 2 , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 2 , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL end_CELL start_CELL ⋯ end_CELL start_CELL italic_θ start_POSTSUPERSCRIPT ( italic_n - 1 ) end_POSTSUPERSCRIPT end_CELL start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋱ end_CELL start_CELL ⋯ end_CELL start_CELL italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_θ start_POSTSUPERSCRIPT ( italic_n - 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ,

where θ(1)=A(1)superscript𝜃1superscript𝐴1\theta^{(1)}=A^{(1)}italic_θ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = italic_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and Ci,j(n)m=1n1Dm×m=1n1Dmsubscriptsuperscript𝐶𝑛𝑖𝑗superscriptsuperscriptsubscriptproduct𝑚1𝑛1subscript𝐷𝑚superscriptsubscriptproduct𝑚1𝑛1subscript𝐷𝑚C^{(n)}_{i,j}\in\mathbb{R}^{\prod_{m=1}^{n-1}D_{m}\times\prod_{m=1}^{n-1}D_{m}}italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a diagonal matrix whose diagonal element is ai,j(n)A(n)subscriptsuperscript𝑎𝑛𝑖𝑗superscript𝐴𝑛a^{(n)}_{i,j}\in A^{(n)}italic_a start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT, i.e., Ci,j(n)=ai,j(n)δi.jsubscriptsuperscript𝐶𝑛𝑖𝑗subscriptsuperscript𝑎𝑛𝑖𝑗subscript𝛿formulae-sequence𝑖𝑗C^{(n)}_{i,j}=a^{(n)}_{i,j}\cdot\delta_{i.j}italic_C start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ⋅ italic_δ start_POSTSUBSCRIPT italic_i . italic_j end_POSTSUBSCRIPT allows edges that differ only at mode-n, where δi.jsubscript𝛿formulae-sequence𝑖𝑗\delta_{i.j}italic_δ start_POSTSUBSCRIPT italic_i . italic_j end_POSTSUBSCRIPT is the Kronecker delta.

We extend graphical lasso to obtain θ𝜃\thetaitalic_θ by inferring a sparse A(n)superscript𝐴𝑛A^{(n)}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT from a TTS. The optimization problem is written as follows:

minimizeA(n)S++psubscriptminimizesuperscript𝐴𝑛subscriptsuperscript𝑆𝑝absent\displaystyle\textrm{minimize}_{A^{(n)}\in S^{p}_{++}}minimize start_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∈ italic_S start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT end_POSTSUBSCRIPT λA(n)od,1𝜆subscriptnormsuperscript𝐴𝑛𝑜𝑑1\displaystyle\lambda||A^{(n)}||_{od,1}italic_λ | | italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_o italic_d , 1 end_POSTSUBSCRIPT
(8) tTlln(re(𝒳)t,:,:({N+1},{1},{n}),A(n)),superscriptsubscript𝑡𝑇𝑙subscript𝑙𝑛𝑟𝑒subscriptsuperscript𝒳𝑁11𝑛𝑡::superscript𝐴𝑛\displaystyle-\sum_{t}^{T}ll_{n}(re(\mathcal{X})^{(\{N+1\},\{-1\},\{n\})}_{t,:% ,:},A^{(n)}),- ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_l italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_r italic_e ( caligraphic_X ) start_POSTSUPERSCRIPT ( { italic_N + 1 } , { - 1 } , { italic_n } ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , : , : end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) ,
lln(re(𝒳)t,:,:,A(n))=d=1D(\n)𝑙subscript𝑙𝑛𝑟𝑒subscript𝒳𝑡::superscript𝐴𝑛superscriptsubscript𝑑1superscript𝐷\absent𝑛\displaystyle ll_{n}(re(\mathcal{X})_{t,:,:},A^{(n)})=\sum_{d=1}^{D^{(% \backslash n)}}italic_l italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_r italic_e ( caligraphic_X ) start_POSTSUBSCRIPT italic_t , : , : end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ( \ italic_n ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT {12(re(𝒳)t,d,:μd)TA(n)(re(𝒳)t,d,:μd)\displaystyle\{-\frac{1}{2}(re(\mathcal{X})_{t,d,:}-\mu_{d})^{T}A^{(n)}(re(% \mathcal{X})_{t,d,:}-\mu_{d}){ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_r italic_e ( caligraphic_X ) start_POSTSUBSCRIPT italic_t , italic_d , : end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ( italic_r italic_e ( caligraphic_X ) start_POSTSUBSCRIPT italic_t , italic_d , : end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )
(9) +12logdetA(n)Dn2log(2π)}/D(\n),\displaystyle+\frac{1}{2}\log\mathrm{det}A^{(n)}-\frac{D_{n}}{2}\log(2\pi)\}/D% ^{(\backslash n)},+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log roman_det italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT - divide start_ARG italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π ) } / italic_D start_POSTSUPERSCRIPT ( \ italic_n ) end_POSTSUPERSCRIPT ,

where μdDnsubscript𝜇𝑑superscriptsubscript𝐷𝑛\mu_{d}\in\mathbb{R}^{D_{n}}italic_μ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the empirical mean of the variable re(𝒳):,d,:T×Dn𝑟𝑒subscript𝒳:𝑑:superscript𝑇subscript𝐷𝑛re(\mathcal{X})_{:,d,:}\in\mathbb{R}^{T\times D_{n}}italic_r italic_e ( caligraphic_X ) start_POSTSUBSCRIPT : , italic_d , : end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Eq. (8) is a convex optimization problem solved by ADMM. We divide the log-likelihood by D(\n)superscript𝐷\absent𝑛D^{(\backslash n)}italic_D start_POSTSUPERSCRIPT ( \ italic_n ) end_POSTSUPERSCRIPT to scale the sample size.

4.2. Data compression

To determine the cluster assignment set \mathcal{F}caligraphic_F and the number of clusters K𝐾Kitalic_K, we use the MDL principle (Grünwald, 2007), which follows the assumption that the more we compress the data, the more we generalize its underlying structures. The goodness of the model \mathcal{M}caligraphic_M can be described with the following total description cost:

CostT(𝒳;)=𝐶𝑜𝑠subscript𝑡𝑇𝒳absent\displaystyle Cost_{T}(\mathcal{X};\mathcal{M})\ =italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; caligraphic_M ) = CostA()+CostM(Θ)+𝐶𝑜𝑠subscript𝑡𝐴limit-from𝐶𝑜𝑠subscript𝑡𝑀Θ\displaystyle Cost_{A}(\mathcal{F})+Cost_{M}(\Theta)+italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_F ) + italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( roman_Θ ) +
(10) CostC(𝒳|)+Cost1(Θ).𝐶𝑜𝑠subscript𝑡𝐶conditional𝒳𝐶𝑜𝑠subscript𝑡subscript1Θ\displaystyle Cost_{C}(\mathcal{X}|\mathcal{M})+Cost_{\ell_{1}}(\Theta).italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_X | caligraphic_M ) + italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Θ ) .

We describe the four terms that appear in Eq. (10).

Coding length cost. CostA()𝐶𝑜𝑠subscript𝑡𝐴Cost_{A}(\mathcal{F})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_F ) is the description complexity of the cluster assignment set \mathcal{F}caligraphic_F, which consists of the following elements: the number of clusters K𝐾Kitalic_K and segments m𝑚mitalic_m require log*(K)+log*(m)superscript𝐾superscript𝑚\log^{*}(K)+\log^{*}(m)roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_K ) + roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_m ). 222Here, log*superscript\log^{*}roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is the universal code length for integers. The assignments of the segments to clusters require m×log*(K)𝑚superscript𝐾m\times\log^{*}(K)italic_m × roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_K ). The number of observations of each cluster requires k=1Klog*(|fk|)superscriptsubscript𝑘1𝐾superscriptsubscript𝑓𝑘\sum_{k=1}^{K}\log^{*}(|f_{k}|)∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ).

CostA()=𝐶𝑜𝑠subscript𝑡𝐴absent\displaystyle Cost_{A}(\mathcal{F})=italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_F ) = log*(K)+log*(m)+superscript𝐾limit-fromsuperscript𝑚\displaystyle\log^{*}(K)+\log^{*}(m)+roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_K ) + roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_m ) +
(11) m×log*(K)+k=1Klog*(|fk|).𝑚superscript𝐾superscriptsubscript𝑘1𝐾superscriptsubscript𝑓𝑘\displaystyle m\times\log^{*}(K)+\sum_{k=1}^{K}\log^{*}(|f_{k}|).italic_m × roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_K ) + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ) .

Model coding cost. CostM(Θ)𝐶𝑜𝑠subscript𝑡𝑀ΘCost_{M}(\Theta)italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( roman_Θ ) is the description complexity of the model parameter set ΘΘ\Thetaroman_Θ, which consists of the following elements: the diagonal values of each cluster at each hierarchy, which has sizes Dn×1subscript𝐷𝑛1D_{n}\times 1italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × 1, require Dn(log(Dn)+cF)subscript𝐷𝑛subscript𝐷𝑛subscript𝑐𝐹D_{n}(\log(D_{n})+c_{F})italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_log ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ), where cFsubscript𝑐𝐹c_{F}italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the floating point cost. 333We used 4×8484\times 84 × 8 bits in our setting. The positive values of A(n)𝐑Dn×Dnsuperscript𝐴𝑛superscript𝐑subscript𝐷𝑛subscript𝐷𝑛A^{(n)}\in\mathbf{R}^{D_{n}\times D_{n}}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT require |Ak(n)|0(log(Dn(Dn1)/2)+cF)subscriptsubscriptsuperscript𝐴𝑛𝑘absent0subscript𝐷𝑛subscript𝐷𝑛12subscript𝑐𝐹|A^{(n)}_{k}|_{\neq 0}(\log(D_{n}(D_{n}-1)/2)+c_{F})| italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ≠ 0 end_POSTSUBSCRIPT ( roman_log ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 ) / 2 ) + italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ), where ||0|\cdot|_{\neq 0}| ⋅ | start_POSTSUBSCRIPT ≠ 0 end_POSTSUBSCRIPT describes the number of non-zero elements in a matrix.

CostM(Θ)=𝐶𝑜𝑠subscript𝑡𝑀Θabsent\displaystyle Cost_{M}(\Theta)=italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( roman_Θ ) = k=1Kn=1N{Dn(log(Dn)+cF)+log*(|Ak(n)|0)+\displaystyle\sum_{k=1}^{K}\sum_{n=1}^{N}\{D_{n}(\log(D_{n})+c_{F})+\log^{*}(|% A^{(n)}_{k}|_{\neq 0})+∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT { italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_log ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) + roman_log start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( | italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ≠ 0 end_POSTSUBSCRIPT ) +
(12) |Ak(n)|0(log(Dn(Dn1)/2)+cF)}/(Dn2N).\displaystyle|A^{(n)}_{k}|_{\neq 0}(\log(D_{n}(D_{n}-1)/2)+c_{F})\}/(D_{n}^{2}% N).| italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUBSCRIPT ≠ 0 end_POSTSUBSCRIPT ( roman_log ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 ) / 2 ) + italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) } / ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_N ) .

We divide by Dn2Nsuperscriptsubscript𝐷𝑛2𝑁D_{n}^{2}Nitalic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_N to deal with the change of data scale.

Data coding cost. CostC(𝒳|)𝐶𝑜𝑠subscript𝑡𝐶conditional𝒳Cost_{C}(\mathcal{X}|\mathcal{M})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_X | caligraphic_M ) is the data encoding cost of 𝒳𝒳\mathcal{X}caligraphic_X given the cluster parameter set \mathcal{M}caligraphic_M. Huffman coding (Böhm et al., 2007) uses the logarithm of the inverse of probability (i.e., the negative log-likelihood) of the values.

(13) CostC(𝒳|)=k=1Kn=1Ntfklln(re(𝒳)t,:,:({N+1},{1},{n}),Ak(n)).𝐶𝑜𝑠subscript𝑡𝐶conditional𝒳superscriptsubscript𝑘1𝐾superscriptsubscript𝑛1𝑁subscript𝑡subscript𝑓𝑘𝑙subscript𝑙𝑛𝑟𝑒subscriptsuperscript𝒳𝑁11𝑛𝑡::subscriptsuperscript𝐴𝑛𝑘\displaystyle Cost_{C}(\mathcal{X}|\mathcal{M})=\sum_{k=1}^{K}\sum_{n=1}^{N}% \sum_{t\in f_{k}}ll_{n}(re(\mathcal{X})^{(\{N+1\},\{-1\},\{n\})}_{t,:,:},A^{(n% )}_{k}).italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_X | caligraphic_M ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_r italic_e ( caligraphic_X ) start_POSTSUPERSCRIPT ( { italic_N + 1 } , { - 1 } , { italic_n } ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , : , : end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm cost. Cost1(Θ)𝐶𝑜𝑠subscript𝑡subscript1ΘCost_{\ell_{1}}(\Theta)italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Θ ) is the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm cost given a model ΘΘ\Thetaroman_Θ.

(14) Cost1(Θ)=k=1Kn=1N𝐶𝑜𝑠subscript𝑡subscript1Θsuperscriptsubscript𝑘1𝐾superscriptsubscript𝑛1𝑁\displaystyle Cost_{\ell_{1}}(\Theta)=\sum_{k=1}^{K}\sum_{n=1}^{N}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Θ ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT λAk(n)od,1.𝜆subscriptnormsubscriptsuperscript𝐴𝑛𝑘𝑜𝑑1\displaystyle\lambda||A^{(n)}_{k}||_{od,1}.italic_λ | | italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT italic_o italic_d , 1 end_POSTSUBSCRIPT .

Discovering an optimal sparse parameter λ𝜆\lambdaitalic_λ capable of modeling data is a challenge as it affects clustering results. However, the parameter value can be determined by using MDL to choose the minimum total cost (Miyaguchi et al., 2017).

Our next goal is to find the best cluster parameter set \mathcal{M}caligraphic_M that minimizes the total description cost Eq. (10).

5. Optimization algorithms

Algorithm 1 DMM(𝒳,𝐰)𝒳𝐰(\mathcal{X},\mathbf{w})( caligraphic_X , bold_w )
1:  Input: (N+1)thsuperscript𝑁1𝑡(N+1)^{th}( italic_N + 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order TTS 𝒳𝒳\mathcal{X}caligraphic_X and initial segment sizes set 𝐰𝐰\mathbf{w}bold_w
2:  Output: Cluster parameters ΘΘ\Thetaroman_Θ and cluster assignments \mathcal{F}caligraphic_F
3:  Initialize cp𝑐𝑝cpitalic_c italic_p with 𝐰𝐰\mathbf{w}bold_w;
4:  cp=𝑐𝑝absentcp=italic_c italic_p = CutPointDetector(𝒳,cp)𝒳𝑐𝑝(\mathcal{X},cp)( caligraphic_X , italic_c italic_p );   /* Finds the best cut point set */
5:  /* ClusterDetector */
6:  K=1𝐾1K=1italic_K = 1;   Initialize Θ={θ1}Θsubscript𝜃1\Theta=\{\theta_{1}\}roman_Θ = { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT };   ={{1,,T}}1𝑇\mathcal{F}=\{\{1,\dots,T\}\}caligraphic_F = { { 1 , … , italic_T } };
7:  Compute CostT(𝒳;{Θ,})𝐶𝑜𝑠subscript𝑡𝑇𝒳ΘCost_{T}(\mathcal{X};\{\Theta,\mathcal{F}\})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; { roman_Θ , caligraphic_F } );
8:  repeat
9:     K=K+1𝐾𝐾1K=K+1italic_K = italic_K + 1;   Initialize ΘΘ\Thetaroman_Θ for K𝐾Kitalic_K clusters;
10:     repeat
11:        =absent\mathcal{F}=caligraphic_F = SegmentAssignment(𝒳,Θ,cp)𝒳normal-Θ𝑐𝑝(\mathcal{X},\Theta,cp)( caligraphic_X , roman_Θ , italic_c italic_p );   /* E-step */
12:        Θ=Θabsent\Theta=roman_Θ = NetworkInference(𝒳,)𝒳(\mathcal{X},\mathcal{F})( caligraphic_X , caligraphic_F );   /* M-step */
13:     until \mathcal{F}caligraphic_F is stable;
14:     Compute CostT(𝒳;{Θ,})𝐶𝑜𝑠subscript𝑡𝑇𝒳ΘCost_{T}(\mathcal{X};\{\Theta,\mathcal{F}\})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; { roman_Θ , caligraphic_F } );
15:  until CostT(𝒳;{Θ,})𝐶𝑜𝑠subscript𝑡𝑇𝒳ΘCost_{T}(\mathcal{X};\{\Theta,\mathcal{F}\})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; { roman_Θ , caligraphic_F } ) converges;
16:  return  ={Θ,}Θ\mathcal{M}=\{\Theta,\mathcal{F}\}caligraphic_M = { roman_Θ , caligraphic_F };

Thus far, we have described our model based on graphical lasso and a criterion based on MDL. The most important question is how to discover good segmentation and clustering. Here, we propose an effective and scalable algorithm, which finds the local optimal of Eq. (10). The overall procedure is summarized in Alg. 1. Given an (N+1)thsuperscript𝑁1𝑡(N+1)^{th}( italic_N + 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order TTS 𝒳𝒳\mathcal{X}caligraphic_X, the total description cost Eq. (10) is minimized using the following two sub-algorithms.

  1. (1)

    CutPointDetector: finds the number of segments m𝑚mitalic_m and their cut points, i.e., the best cut point set cp𝑐𝑝cpitalic_c italic_p of 𝒳𝒳\mathcal{X}caligraphic_X.

  2. (2)

    ClusterDetector: finds the number of clusters K𝐾Kitalic_K and the cluster parameter set \mathcal{M}caligraphic_M.

5.1. CutPointDetector

The first goal is to divide a given 𝒳𝒳\mathcal{X}caligraphic_X into m𝑚mitalic_m segments (i.e., patterns), but we assume that no information is known about them in advance. Therefore, to prevent a pattern explosion when searching for their optimal cut points, we introduce CutPointDetector based on the divide-and-conquer method (Keogh et al., 2001).

Specifically, it recursively merges a small segment set of 𝒳𝒳\mathcal{X}caligraphic_X while reducing its total description cost, because neighboring subsequences typically exhibit the same pattern. We define 𝐰𝐰\mathbf{w}bold_w as a set of user-defined initial segment sizes, i.e., 𝐰={wi}i=1m𝐰superscriptsubscriptsubscript𝑤𝑖𝑖1𝑚\mathbf{w}=\{w_{i}\}_{i=1}^{m}bold_w = { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, such as the number of days in each month or any small constant. An example illustration is shown in Fig. 2. Let θi:i+1subscript𝜃:𝑖𝑖1\theta_{i:i+1}italic_θ start_POSTSUBSCRIPT italic_i : italic_i + 1 end_POSTSUBSCRIPT be a model of 𝒳{cpi:cpi+1}𝒳conditional-set𝑐subscript𝑝𝑖𝑐subscript𝑝𝑖1\mathcal{X}\{cp_{i}:cp_{i+1}\}caligraphic_X { italic_c italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_c italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT } at the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT segment. Given the three subsequent segments illustrated in Fig. 2 (a), we evaluate whether to merge the middle segment with either of the side segments (Fig. 2 (b)(c)). The total description cost for Fig. 2 (a) is given by CostT(𝒳;{θi:i+1,θi+1:i+2,θi+2:i+3})𝐶𝑜𝑠subscript𝑡𝑇𝒳subscript𝜃:𝑖𝑖1subscript𝜃:𝑖1𝑖2subscript𝜃:𝑖2𝑖3Cost_{T}(\mathcal{X};\{\theta_{i:i+1},\theta_{i+1:i+2},\theta_{i+2:i+3}\})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; { italic_θ start_POSTSUBSCRIPT italic_i : italic_i + 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i + 1 : italic_i + 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i + 2 : italic_i + 3 end_POSTSUBSCRIPT } ), where we omit the cluster assignment (e.g., {j}j=cpicpi+11}\{j\}_{j=cp_{i}}^{cp_{i+1}-1}\}{ italic_j } start_POSTSUBSCRIPT italic_j = italic_c italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT }) from the cost for clarity. If the cost for the original three segments is reduced by merging, it eliminates the unnecessary cut point and employs a new model θ𝜃\thetaitalic_θ for the merged segment. By repeating this procedure for each segment, m𝑚mitalic_m decreases monotonically until convergence. See Appendix B.1 for the detailed procedure.

Refer to caption

(a) Original cp

(b) Left merge

(c) Right merge

Figure 2. Illustration of the three candidates. We compare the total description cost of each of these candidates.

5.2. ClusterDetector

DMM searches for the best number of clusters by increasing K=1,2,,m𝐾12𝑚K=1,2,\dots,mitalic_K = 1 , 2 , … , italic_m, while the total description cost CostT(𝒳;)𝐶𝑜𝑠subscript𝑡𝑇𝒳Cost_{T}(\mathcal{X};\mathcal{M})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; caligraphic_M ) is decreasing. To compute the cost, however, we must solve two problems, namely obtain the cluster assignment set \mathcal{F}caligraphic_F and the model parameter set ΘΘ\Thetaroman_Θ, either of which affects the optimization of the other. Therefore, we design ClusterDetector with the expectation and maximization (EM) algorithm. In the E-step, it determines \mathcal{F}caligraphic_F to minimize the data coding cost, CostC(𝒳|)𝐶𝑜𝑠subscript𝑡𝐶conditional𝒳Cost_{C}(\mathcal{X}|\mathcal{M})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_X | caligraphic_M ), which is achieved by solving:

(15) argmink{1,,K}CostC(𝒳|{θk,{j}j=cpicpi+11}),subscriptargmin𝑘1𝐾𝐶𝑜𝑠subscript𝑡𝐶conditional𝒳subscript𝜃𝑘superscriptsubscript𝑗𝑗𝑐subscript𝑝𝑖𝑐subscript𝑝𝑖11\displaystyle\mathop{\rm arg~{}min}\limits_{k\in\{1,\dots,K\}}Cost_{C}(% \mathcal{X}|\{\theta_{k},\{j\}_{j=cp_{i}}^{cp_{i+1}-1}\}),start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_k ∈ { 1 , … , italic_K } end_POSTSUBSCRIPT italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_X | { italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { italic_j } start_POSTSUBSCRIPT italic_j = italic_c italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT } ) ,

for the i𝑖iitalic_i-th segment, and then inserts time points from cpi𝑐subscript𝑝𝑖cp_{i}italic_c italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to cpi+1𝑐subscript𝑝𝑖1cp_{i+1}italic_c italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT (i.e., {j}j=cpicpi+11superscriptsubscript𝑗𝑗𝑐subscript𝑝𝑖𝑐subscript𝑝𝑖11\{j\}_{j=cp_{i}}^{cp_{i+1}-1}{ italic_j } start_POSTSUBSCRIPT italic_j = italic_c italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT) to the best k𝑘kitalic_k-th cluster fksubscript𝑓𝑘f_{k}\in\mathcal{F}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_F. In the M-step, for 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K the algorithm infers Ak(n)(1nN)subscriptsuperscript𝐴𝑛𝑘1𝑛𝑁A^{(n)}_{k}(1\leq n\leq N)italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 ≤ italic_n ≤ italic_N ) according to Eq. (8) to obtain θkΘsubscript𝜃𝑘Θ\theta_{k}\in\Thetaitalic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ roman_Θ for a given 𝒳[fk]𝒳delimited-[]subscript𝑓𝑘\mathcal{X}[f_{k}]caligraphic_X [ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]. Note that ClusterDetector starts by randomly initializing ΘΘ\Thetaroman_Θ.

Theoretical analysis.

Lemma 0 ().

The time complexity of DMM is O(Tm=1NDm)𝑂𝑇superscriptsubscriptproduct𝑚1𝑁subscript𝐷𝑚O(T\prod_{m=1}^{N}D_{m})italic_O ( italic_T ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), where T𝑇Titalic_T is the data length, and Dmsubscript𝐷𝑚D_{m}italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the number of variables at mode-m in (normal-(((N+1)th)^{th}) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order TTS 𝒳D1××DN×T𝒳superscriptsubscript𝐷1normal-⋯subscript𝐷𝑁𝑇\mathcal{X}\in\mathbb{R}^{D_{1}\times\cdots\times D_{N}\times T}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT × italic_T end_POSTSUPERSCRIPT.

Proof.

Please see Appendix B.2. ∎

6. Experiments

In this section, we demonstrate the effectiveness of DMM on synthetic data. We use synthetic data because there are clear ground truth networks with which to test the clustering accuracy.

6.1. Experimental setting

6.1.1. Synthetic datasets

We randomly generate synthetic ((((N+1)th)^{th}) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order TTS, 𝒳D1××DN×T𝒳superscriptsubscript𝐷1subscript𝐷𝑁𝑇\mathcal{X}\in\mathbb{R}^{D_{1}\times\cdots\times D_{N}\times T}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT × italic_T end_POSTSUPERSCRIPT, which follows a multivariate normal distribution vec(𝒳t)𝒩(0,θ1)similar-to𝑣𝑒𝑐subscript𝒳𝑡𝒩0superscript𝜃1vec(\mathcal{X}_{t})\sim\mathcal{N}(0,\theta^{-1})italic_v italic_e italic_c ( caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ caligraphic_N ( 0 , italic_θ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Each of the K𝐾Kitalic_K clusters has a mean of 00\vec{0}over→ start_ARG 0 end_ARG, so that the clustering results are based entirely on the structure of the data. For each cluster, we generate a random ground truth inverse covariance matrix θ𝜃\thetaitalic_θ as follows (Mohan et al., 2014; Hallac et al., 2017b):

  1. (1)

    For n=1,N𝑛1𝑁n=1,\dots Nitalic_n = 1 , … italic_N, set A(n)Dn×Dnsuperscript𝐴𝑛superscriptsubscript𝐷𝑛subscript𝐷𝑛A^{(n)}\in\mathbb{R}^{D_{n}\times D_{n}}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT equal to the adjacency matrix of an Erdős-Rényi directed random graph, where every edge has a 20%percent2020\%20 % chance of being selected.

  2. (2)

    For every selected edge in A(n)superscript𝐴𝑛A^{(n)}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT, set ai,j(n)similar-tosubscriptsuperscript𝑎𝑛𝑖𝑗absenta^{(n)}_{i,j}\simitalic_a start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∼ Uniform([0.6,0.3][0.3,0.6])0.60.30.30.6([-0.6,-0.3]\cup[0.3,0.6])( [ - 0.6 , - 0.3 ] ∪ [ 0.3 , 0.6 ] ). We enforce a symmetry constraint whereby every ai,j(n)=aj,i(n)subscriptsuperscript𝑎𝑛𝑖𝑗subscriptsuperscript𝑎𝑛𝑗𝑖a^{(n)}_{i,j}=a^{(n)}_{j,i}italic_a start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT.

  3. (3)

    Construct a hierarchical matrix θtemD×Dsubscript𝜃𝑡𝑒𝑚superscript𝐷𝐷\theta_{tem}\in\mathbb{R}^{D\times D}italic_θ start_POSTSUBSCRIPT italic_t italic_e italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_D end_POSTSUPERSCRIPT using {A(n)}n=1Nsuperscriptsubscriptsuperscript𝐴𝑛𝑛1𝑁\{A^{(n)}\}_{n=1}^{N}{ italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT.

  4. (4)

    Let c𝑐citalic_c be the smallest eigenvalue of θtemsubscript𝜃𝑡𝑒𝑚\theta_{tem}italic_θ start_POSTSUBSCRIPT italic_t italic_e italic_m end_POSTSUBSCRIPT, and set θ=θtem+(0.1+|c|)I𝜃subscript𝜃𝑡𝑒𝑚0.1𝑐𝐼\theta=\theta_{tem}+(0.1+|c|)Iitalic_θ = italic_θ start_POSTSUBSCRIPT italic_t italic_e italic_m end_POSTSUBSCRIPT + ( 0.1 + | italic_c | ) italic_I, where I𝐼Iitalic_I is an identity matrix. This ensures that θ𝜃\thetaitalic_θ is invertible.

6.1.2. Evaluation metrics

We run our experiments on four different temporal sequences: 𝖠𝖠\mathsf{A}sansserif_A: “1,2,1”, 𝖡𝖡\mathsf{B}sansserif_B: “1,2,3,2,1”, 𝖢𝖢\mathsf{C}sansserif_C: “1,2,3,4,1,2,3,4”, 𝖣𝖣\mathsf{D}sansserif_D: “1,2,2,1,3,3,3,1”, (for example, 𝖠𝖠\mathsf{A}sansserif_A consists of three segments and two clusters θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.) We set each cluster in each example to have 100G100𝐺100G100 italic_G observations, where G𝐺Gitalic_G is the number of segments in each cluster (e.g., 𝖠𝖠\mathsf{A}sansserif_A has T=300𝑇300T=300italic_T = 300), and cut points are set randomly. We generate each dataset ten times and report the mean of the macro-F1 score.

6.1.3. Baselines

We compare our method with the following two state-of-the-art methods for time series clustering using the graphical lasso as their model.

  • TAGM (Tozzo et al., 2021): combines HMM with a graphical lasso by modeling each cluster as a graphical lasso and assuming clusters as hidden states of HMM.

  • TICC (Hallac et al., 2017b): uses the Toeplitz matrix to capture lag correlations and inter-variable correlations and penalizes changing clusters to assign the neighboring segments to the same cluster.

We do not compare with other clustering methods that ignore the network, such as K-means and DTW, because they do not show good results (Hallac et al., 2017b).

6.1.4. Parameter tuning

DMM and the baselines require a sparsity parameter for 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm. We varied λ={0.5,1,2,4}𝜆0.5124\lambda=\{0.5,1,2,4\}italic_λ = { 0.5 , 1 , 2 , 4 } and set λ=4𝜆4\lambda=4italic_λ = 4 for DMM and λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5 for the baselines, which produces the best results. A matricization of tensor mat(𝒳)(N+1)T×D𝑚𝑎𝑡superscript𝒳𝑁1superscript𝑇𝐷mat(\mathcal{X})^{(N+1)}\in\mathbb{R}^{T\times D}italic_m italic_a italic_t ( caligraphic_X ) start_POSTSUPERSCRIPT ( italic_N + 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT and the true number of clusters are given to the baselines since the number of clusters need to be set. To tune TICC, we varied the regularization parameter β={4,16,64,256}𝛽41664256\beta=\{4,16,64,256\}italic_β = { 4 , 16 , 64 , 256 } and set β=16𝛽16\beta=16italic_β = 16, and set the window size w=1𝑤1w=1italic_w = 1, which is the correct assumption considering the data generation process. DMM requires us to specify 𝐰𝐰\mathbf{w}bold_w. We use the same wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (s.t., i=1,,m𝑖1𝑚i=1,\dots,mitalic_i = 1 , … , italic_m) for all initial segments, and we set wi=4subscript𝑤𝑖4w_{i}=4italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 4.

6.2. Results

6.2.1. Clustering accuracy

We take four different temporal sequences 𝖠𝖠\mathsf{A}sansserif_A similar-to\sim 𝖣𝖣\mathsf{D}sansserif_D, and two different data sizes (i) and (ii) to observe the ability of DMM as regards clustering TTS. Table 1 shows the clustering accuracy for the macro-F1 scores for each dataset. {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT shows TAGM and TICC set the number of clusters K={2,3,4,5}𝐾2345K=\{2,3,4,5\}italic_K = { 2 , 3 , 4 , 5 } by Bayesian information criterion (BIC). As shown, DMM outperforms the baselines in most of the datasets, even for the (i) 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT-order TTS datasets. In particular, the difference in (ii) is even more noteworthy. Because TAGM and TICC cannot handle 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT-order TTS due to the limitation imposed by the matricization of the tensor.

Table 1. Macro-F1 score of clustering accuracy for eight different temporal sequences, comparing DMM with state-of-the-art methods (higher score is better). Best results are in bold, and second best results are underlined. {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT indicates a method where the number of clusters is set by BIC. (i): 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT-order TTS D1=10subscript𝐷110D_{1}=10italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10, (ii): 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT-order TTS D1=D2=10subscript𝐷1subscript𝐷210D_{1}=D_{2}=10italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 10, 𝖠𝖠\mathsf{A}sansserif_A: “1,2,1”, 𝖡𝖡\mathsf{B}sansserif_B: “1,2,3,2,1”, 𝖢𝖢\mathsf{C}sansserif_C: “1,2,3,4,1,2,3,4”, 𝖣𝖣\mathsf{D}sansserif_D: “1,2,2,1,3,3,3,1.”
Data DMM TAGM TAGM {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT TICC TICC {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT
(i) 𝖠𝖠\mathsf{A}sansserif_A 0.955¯¯0.955\underline{0.955}under¯ start_ARG 0.955 end_ARG 0.9150.9150.9150.915 0.9150.9150.9150.915 0.9970.997\mathbf{0.997}bold_0.997 0.9970.997\mathbf{0.997}bold_0.997
𝖡𝖡\mathsf{B}sansserif_B 0.9260.926\mathbf{0.926}bold_0.926 0.897¯¯0.897\underline{0.897}under¯ start_ARG 0.897 end_ARG 0.7560.7560.7560.756 0.8840.8840.8840.884 0.8250.8250.8250.825
𝖢𝖢\mathsf{C}sansserif_C 0.9560.956\mathbf{0.956}bold_0.956 0.7700.7700.7700.770 0.811¯¯0.811\underline{0.811}under¯ start_ARG 0.811 end_ARG 0.7250.7250.7250.725 0.7560.7560.7560.756
𝖣𝖣\mathsf{D}sansserif_D 0.9600.960\mathbf{0.960}bold_0.960 0.9070.9070.9070.907 0.9120.9120.9120.912 0.8570.8570.8570.857 0.952¯¯0.952\underline{0.952}under¯ start_ARG 0.952 end_ARG
(ii) 𝖠𝖠\mathsf{A}sansserif_A 0.9610.961\mathbf{0.961}bold_0.961 0.5140.5140.5140.514 0.5140.5140.5140.514 0.932¯¯0.932\underline{0.932}under¯ start_ARG 0.932 end_ARG 0.9230.9230.9230.923
𝖡𝖡\mathsf{B}sansserif_B 0.9620.962\mathbf{0.962}bold_0.962 0.4620.4620.4620.462 0.4310.4310.4310.431 0.844¯¯0.844\underline{0.844}under¯ start_ARG 0.844 end_ARG 0.7700.7700.7700.770
𝖢𝖢\mathsf{C}sansserif_C 0.9410.941\mathbf{0.941}bold_0.941 0.3590.3590.3590.359 0.3960.3960.3960.396 0.704¯¯0.704\underline{0.704}under¯ start_ARG 0.704 end_ARG 0.5940.5940.5940.594
𝖣𝖣\mathsf{D}sansserif_D 0.9800.980\mathbf{0.980}bold_0.980 0.4380.4380.4380.438 0.4320.4320.4320.432 0.838¯¯0.838\underline{0.838}under¯ start_ARG 0.838 end_ARG 0.7410.7410.7410.741

6.2.2. Effect of total number of variables

We next examine how the number of variables D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT affects each method as regards accurately finding clusters. We take the 𝖢𝖢\mathsf{C}sansserif_C example and vary D1=550subscript𝐷15similar-to50D_{1}=5\sim 50italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5 ∼ 50 for (a) 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT-order TTS and (b) 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT-order TTS. As shown in Fig. 3, our method outperforms the baselines for all D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in both tensors. The performance of TAGM and TICC worsens as D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT increases, while DMM maintains its performance even though D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT increases due to our well-defined total description cost that can handle the change in data scale. TAGM and TICC are less accurate in Fig. 3 (b) than Fig. 3 (a) since they cannot deal with 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT-order TTS.

6.2.3. Scalability

We perform experiments to verify the time complexity of DMM. As described in Lemma 1, the time complexity of DMM scales linearly in terms of the data size. Fig. 4 shows the computation time of DMM when we vary D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Fig. 4 (a)) and T𝑇Titalic_T (Fig. 4 (b)). Thanks to our proposed optimization algorithm, the time complexity of DMM scales linearly with Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and T𝑇Titalic_T.

Refer to caption (a) vs. 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT-order TTS Refer to caption (b) vs. 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT-order TTS
Figure 3. DMM outperforms the state-of-the-art methods: Clustering accuracy for synthetic data, macro-F1 score vs. data size, i.e., (a) 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT-order TTS (D1,T)=(550,800)subscript𝐷1𝑇similar-to550800(D_{1},T)=(5\sim 50,800)( italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T ) = ( 5 ∼ 50 , 800 ), (b) 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT-order TTS (D1,D2,T)=(550,5,800)subscript𝐷1subscript𝐷2𝑇similar-to5505800(D_{1},D_{2},T)=(5\sim 50,5,800)( italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_T ) = ( 5 ∼ 50 , 5 , 800 ).
Refer to caption (a) vs. D1=550subscript𝐷15similar-to50D_{1}=5\sim 50italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5 ∼ 50 Refer to caption (b) vs. T=80080000𝑇800similar-to80000T=800\sim 80000italic_T = 800 ∼ 80000
Figure 4. DMM scales linearly: Computation time vs. data size, i.e., we vary (a) D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (D1=550,D2=5,T=800formulae-sequencesubscript𝐷15similar-to50formulae-sequencesubscript𝐷25𝑇800D_{1}=5\sim 50,D_{2}=5,T=800italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5 ∼ 50 , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 5 , italic_T = 800) and (b) T𝑇Titalic_T (D1=5,D2=5,T=80080000formulae-sequencesubscript𝐷15formulae-sequencesubscript𝐷25𝑇800similar-to80000D_{1}=5,D_{2}=5,T=800\sim 80000italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5 , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 5 , italic_T = 800 ∼ 80000).

7. Case study

We perform experiments on real data to show the applicability of DMM and demonstrate how DMM can be used to obtain meaningful insights from TTS.

7.1. Experimental setting

7.1.1. Datasets

We describe our datasets in detail.

Table 2. The data size and attributes for each dataset.
ID Dataset Size Description
#1 E-commerce (11, 10, 1796) (query, state, day)
#2 VoD (8, 10, 1796)
#3 Sweets (9, 10, 1796)
#4 Covid (6, 10, 3652) (query, country, day)
#5 GAFAM (5, 10, 1796)
#6 Air (6, 12, 1461) (pollutant, site, day)
#7 Car-A (6, 10, 4, 3241) (sensor, lap, driver, meter)
#8 Car-H (6, 10, 4, 4000)

Google Trends (#1 similar-to\sim #5). We use the data from Google Trends. Each tensor contains daily web-search counts. #4 Covid was collected over 10101010 years from Jan. 1st 2013201320132013 to Dec. 31st 2022202220222022 to include the effect of COVID-19. Other datasets are from Jan. 1st 2015201520152015 to Dec. 31st 2019201920192019 to avoid the effect of COVID-19. The datasets include five query sets (Appendix C.1). We collect the data from two target areas: three datasets from the top 10101010 populated US states and two from the top 10101010 countries ranked by GDP score. We normalize the data every month to achieve clustering that only considers the network.

Air (#6). We use Air data that collected daily concentrations of six pollutants at 12121212 nationally-controlled monitoring sites in Beijing, China from Mar. 1st 2013201320132013 to Feb. 29th 2016201620162016 (Zhang et al., 2017). We fill the missing values by linear interpolation and normalize the data every month.

Automobile (#7, #8). We use two automobile datasets with different driving courses. #7 Car-A is a city course and #8 Car-H is a highway course. We observe six sensors every meter: Brake, Speed, GX (X Accel), GY (Y Accel), Steering angle, Fuel Economy. Four drivers drive 10101010 laps of the same course, hence each dataset forms a 4thsuperscript4𝑡4^{th}4 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order tensor. We normalize the data every 10101010 meters.

The size and attributes of the datasets are given in Table 2.

7.1.2. Hyperparameter

To tune DMM, we vary the sparsity parameter λ={0.5,1,2,4}𝜆0.5124\lambda=\{0.5,1,2,4\}italic_λ = { 0.5 , 1 , 2 , 4 } and set the value that produces the minimum total description cost Eq. (10). We fix the initial window size w𝑤witalic_w depending on the dataset, equal to the normalization period. For a fair comparison, for TAGM and TICC, we set the sparse parameter equal to DMM, and the number of clusters equal to that found by DMM. For TICC, we vary the regularization parameter β={4,16,64,256}𝛽41664256\beta=\{4,16,64,256\}italic_β = { 4 , 16 , 64 , 256 } and set the parameter with BIC.

7.2. Results

7.2.1. Applicability

We show the usefulness of DMM for analyzing real-world TTS.

Table 3. The number of clusters (# Cl.) and segments (# Seg.), and log-likelihood (LL) of eight real-world datasets, comparing DMM with state-of-the-art methods. The bold font and underlines show methods providing the best and second best LL, respectively (higher is better).
DMM TAGM TICC
Data # Cl. # Seg. LL # Seg. LL # Seg. LL
#1 2222 10101010 1.89e𝟓1.89e5\mathbf{{-1.89}\mathrm{e}{5}}- bold_1.89 roman_e bold_5 485485485485 1.92e5¯¯1.92e5\underline{{-1.92}\mathrm{e}{5}}under¯ start_ARG - 1.92 e5 end_ARG 3333 1.97e51.97e5{-1.97}\mathrm{e}{5}- 1.97 e5
#2 2222 2222 1.68e5¯¯1.68e5\underline{{-1.68}\mathrm{e}{5}}under¯ start_ARG - 1.68 e5 end_ARG 527527527527 1.65e𝟓1.65e5\mathbf{{-1.65}\mathrm{e}{5}}- bold_1.65 roman_e bold_5 2222 1.68e5¯¯1.68e5\underline{{-1.68}\mathrm{e}{5}}under¯ start_ARG - 1.68 e5 end_ARG
#3 2222 7777 1.90e𝟓1.90e5\mathbf{{-1.90}\mathrm{e}{5}}- bold_1.90 roman_e bold_5 502502502502 1.90e𝟓1.90e5\mathbf{{-1.90}\mathrm{e}{5}}- bold_1.90 roman_e bold_5 17171717 1.90e𝟓1.90e5\mathbf{{-1.90}\mathrm{e}{5}}- bold_1.90 roman_e bold_5
#4 4444 4444 2.85e5¯¯2.85e5\underline{{-2.85}\mathrm{e}{5}}under¯ start_ARG - 2.85 e5 end_ARG 1778177817781778 2.73e𝟓2.73e5\mathbf{{-2.73}\mathrm{e}{5}}- bold_2.73 roman_e bold_5 5555 2.88e52.88e5{-2.88}\mathrm{e}{5}- 2.88 e5
#5 2222 2222 9.28e4¯¯9.28e4\underline{{-9.28}\mathrm{e}{4}}under¯ start_ARG - 9.28 e4 end_ARG 519519519519 9.10e𝟒9.10e4\mathbf{{-9.10}\mathrm{e}{4}}- bold_9.10 roman_e bold_4 3333 9.48e49.48e4{-9.48}\mathrm{e}{4}- 9.48 e4
#6 6666 13131313 5.19e4¯¯5.19e4\underline{{-5.19}\mathrm{e}{4}}under¯ start_ARG - 5.19 e4 end_ARG 929929929929 4.82e𝟒4.82e4\mathbf{{-4.82}\mathrm{e}{4}}- bold_4.82 roman_e bold_4 10101010 6.34e46.34e4{-6.34}\mathrm{e}{4}- 6.34 e4
#7 11111111 11111111 5.89e𝟓5.89e5\mathbf{{-5.89}\mathrm{e}{5}}- bold_5.89 roman_e bold_5 1300130013001300 6.33e5¯¯6.33e5\underline{{-6.33}\mathrm{e}{5}}under¯ start_ARG - 6.33 e5 end_ARG 12121212 9.36e59.36e5{-9.36}\mathrm{e}{5}- 9.36 e5
#8 5555 12121212 1.06e6¯¯1.06e6\underline{{-1.06}\mathrm{e}{6}}under¯ start_ARG - 1.06 e6 end_ARG 974974974974 1.02e𝟔1.02e6\mathbf{{-1.02}\mathrm{e}{6}}- bold_1.02 roman_e bold_6 6666 1.16e61.16e6{-1.16}\mathrm{e}{6}- 1.16 e6

Modeling accuracy. Since there are no labels for TTS, we review the modeling accuracy of DMM by comparing the number of segments and the log-likelihood, which explains the goodness of clustering according to our objective function based on MDL. We use cluster assignments to calculate the log-likelihood (Eq. (2)). Table 3 shows the results. DMM finds a reasonable number of segments and a higher log-likelihood than TICC. TAGM switches clusters with the transition matrix of HMM. This works well on synthetic datasets when there are clear transitions. However, it is not suitable for real-world datasets, which contain noises and whose network changes gradually. As a result, TAGM finds the cluster assignments that maximize the log-likelihood regardless of the number of segments. TICC assigns neighboring time steps to the same cluster using a penalty β𝛽\betaitalic_β. Thus, its number of segments is close to DMM. However, TICC is not suitable for tensors, and the log-likelihood is worse than DMM for most datasets.

Computation time.

Refer to caption
Figure 5. Computation time of DMM: our method surpasses its baselines. It is up to 300×300\times300 × faster than TICC.

We compare the computation time needed for processing real data in Fig. 5. DMM is the fastest for most datasets since it infers the network for each mode. In contrast, TAGM and TICC compute the entire network at once. Therefore, they are more affected by the number of variables at each mode than DMM, resulting in a longer computation time. Note that the computation time of TAGM and TICC at 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT-order TTS is comparable to DMM.

7.2.2. Interpretability

Refer to caption

(a) Original sensor data at a site (Aoti Zhongxin) Refer to caption
(b) DMM assigns every Apr. similar-to\sim Oct. to cluster #2 Refer to caption
(c) TAGM causes frequent switching Refer to caption
(d) TICC assigns most periods to cluster #4

Figure 6. DMM demonstrates effective cluster assignments on the #6 Air dataset. (a) Original tensor time series data. Cluster assignments of (b) DMM, (c) TAGM, and (d) TICC.
Refer to caption (a) DMM networks of cluster #2 Refer to caption (b) TAGMRefer to caption (c) TICC
Figure 7. Networks obtained for each method for the #6 Air dataset: (a) DMM detects a pollutant network and a location network, where it is easy to understand the key relationships within the cluster. (b) TAGM (cluster #6) and (c) TICC (cluster #4) find a complex network, which is difficult to interpret.

We show how the clustering results presented by DMM make sense. We have already shown the results of DMM for clustering over #4 Covid in Section 1 (see Fig. 1). Please also see the results in #1 E-commerce in Appendix C.2.

Air. We compare the clustering results of DMM, TAGM and, TICC over #6 Air regarding cluster assignments (Fig. 6) and obtained networks (Fig. 7). Fig. 6 (a) shows the original sensor data at Aoti Zhongxin. Fig. 6 (b) shows that DMM assigns Apr. through Oct. of each year to cluster #2, capturing the yearly seasonality (Zhang et al., 2017). The cluster assignments of TAGM (see Fig. 6 (c)) switch frequently, and TICC (see Fig. 6 (d)) assigns most of the period to cluster #4. Both cluster assignments are far from interpretable. Fig. 7 shows the networks obtained with each method. The cluster of DMM (see Fig. 7 (a)) includes the pollutant network and the location network. The pollutant network has a strong edge between PM2.5 and PM10, and the location network, whose nodes are plotted on the map, has edges only between closely located nodes, both of which match our expectation and accordingly indicate that DMM discovers interpretable networks. TAGM and TICC (see Fig. 7 (b)(c)) find a network for all variables. Although the networks are sparse, the large number of nodes and edges hampers our understanding of the networks. Due to the simplicity of networks generated by DMM, their interpretability surpasses those of other methods (Du et al., 2019). Consequently, DMM provides interpretable clustering results that can reveal underlying relationships among variables of each mode and is suitable for modeling and clustering TTS.

8. Conclusion

In this paper, we proposed an efficient tensor time series subsequence clustering method, namely DMM. Our method characterizes each cluster by multiple networks, each of which is the dependency network of a corresponding non-temporal mode. These networks make our results visible and interpretable, enabling the multifaceted analysis and understanding of tensor time series. We defined a criterion based on MDL that allows us to find clusters of data and determine all user-defined parameters. Our algorithm scales linearly with the input size and thus can apply to the massive data size of a tensor. We showed the effectiveness of DMM via extensive experiments using synthetic and real datasets.

Acknowledgements.
The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was supported by JSPS KAKENHI Grant-in-Aid for Scientific Research Number JP21H03446, JP22K17896, NICT JPJ012368C03501, JST CREST JPMJCR23M3, JST-AIP JPMJCR21U4.

References

  • (1)
  • Aghabozorgi et al. (2015) Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. 2015. Time-series clustering–a decade review. Information systems 53 (2015), 16–38.
  • Alaee et al. (2021) Sara Alaee, Ryan Mercer, Kaveh Kamgar, and Eamonn Keogh. 2021. Time series motifs discovery under DTW allows more robust discovery of conserved structure. Data Mining and Knowledge Discovery 35 (2021), 863–910.
  • Bai et al. (2019) Lei Bai, Lina Yao, Salil S. Kanhere, Xianzhi Wang, and Quan Z. Sheng. 2019. STG2Seq: Spatial-Temporal Graph to Sequence Model for Multi-step Passenger Demand Forecasting. In IJCAI. 1981–1987.
  • Batalo et al. (2022) Bojan Batalo, Lincon S Souza, Bernardo B Gatto, Naoya Sogi, and Kazuhiro Fukui. 2022. Analysis of Temporal Tensor Datasets on Product Grassmann Manifold. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4869–4877.
  • Berndt and Clifford (1994) Donald J. Berndt and James Clifford. 1994. Using Dynamic Time Warping to Find Patterns in Time Series. In Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop, Seattle, Washington, USA, July 1994. Technical Report WS-94-03. 359–370.
  • Böhm et al. (2007) Christian Böhm, Christos Faloutsos, Jia-Yu Pan, and Claudia Plant. 2007. Ric: Parameter-free noise-robust clustering. TKDD 1, 3 (2007), 10–es.
  • Boyd et al. (2011) Stephen P. Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 3, 1 (2011), 1–122.
  • Cai et al. (2015) Yongjie Cai, Hanghang Tong, Wei Fan, Ping Ji, and Qing He. 2015. Facets: Fast Comprehensive Mining of Coevolving High-order Time Series. In KDD. 79–88.
  • Du et al. (2019) Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretable machine learning. Commun. ACM 63, 1 (2019), 68–77.
  • Friedman et al. (2008) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 3 (2008), 432–441.
  • Gatto et al. (2021) Bernardo B Gatto, Eulanda M dos Santos, Alessandro L Koerich, Kazuhiro Fukui, and Waldir SS Junior. 2021. Tensor analysis with n-mode generalized difference subspace. Expert Systems with Applications 171 (2021), 114559.
  • Grünwald (2007) Peter D Grünwald. 2007. The minimum description length principle. MIT press.
  • Hallac et al. (2017a) David Hallac, Youngsuk Park, Stephen P. Boyd, and Jure Leskovec. 2017a. Network Inference via the Time-Varying Graphical Lasso. In KDD. 205–213.
  • Hallac et al. (2017b) David Hallac, Sagar Vare, Stephen P. Boyd, and Jure Leskovec. 2017b. Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data. In KDD. 215–223.
  • Harutyunyan et al. (2019) Hrayr Harutyunyan, Daniel Moyer, Hrant Khachatrian, Greg Ver Steeg, and Aram Galstyan. 2019. Efficient Covariance Estimation from Temporal Data. arXiv preprint arXiv:1905.13276 (2019).
  • Hirano and Tsumoto (2006) Shoji Hirano and Shusaku Tsumoto. 2006. Cluster analysis of time-series medical data based on the trajectory representation and multiscale comparison techniques. In ICDM. IEEE, 896–901.
  • Jing et al. (2021) Baoyu Jing, Hanghang Tong, and Yada Zhu. 2021. Network of Tensor Time Series. In WWW, Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia (Eds.). 2425–2437.
  • Kawabata et al. (2021) Koki Kawabata, Siddharth Bhatia, Rui Liu, Mohit Wadhwa, and Bryan Hooi. 2021. Ssmf: Shifting seasonal matrix factorization. Advances in Neural Information Processing Systems 34 (2021), 3863–3873.
  • Keogh (2002) Eamonn Keogh. 2002. Exact Indexing of Dynamic Time Warping. In VLDB (Hong Kong, China). 406–417.
  • Keogh et al. (2001) Eamonn J. Keogh, Selina Chu, David M. Hart, and Michael J. Pazzani. 2001. An Online Algorithm for Segmenting Time Series. In Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November - 2 December 2001, San Jose, California, USA. IEEE Computer Society, 289–296.
  • Kolda and Bader (2009) Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455–500.
  • Liu et al. (2020) Yu Liu, Quanming Yao, and Yong Li. 2020. Generalizing tensor decomposition for n-ary relational knowledge bases. In WWW. 1104–1114.
  • Madabhushi and Lee (2016) Anant Madabhushi and George Lee. 2016. Image analysis and machine learning in digital pathology: Challenges and opportunities. Medical image analysis 33 (2016), 170–175.
  • Matsubara et al. (2014) Yasuko Matsubara, Yasushi Sakurai, and Christos Faloutsos. 2014. AutoPlait: Automatic Mining of Co-Evolving Time Sequences. In SIGMOD. 193–204.
  • Matsubara et al. (2016) Yasuko Matsubara, Yasushi Sakurai, and Christos Faloutsos. 2016. Non-Linear Mining of Competing Local Activities. In WWW.
  • Miyaguchi et al. (2017) Kohei Miyaguchi, Shin Matsushima, and Kenji Yamanishi. 2017. Sparse graphical modeling via stochastic complexity. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 723–731.
  • Miyajima et al. (2007) Chiyomi Miyajima, Yoshihiro Nishiwaki, Koji Ozawa, Toshihiro Wakita, Katsunobu Itou, Kazuya Takeda, and Fumitada Itakura. 2007. Driver modeling based on driving behavior and its evaluation in driver identification. IEEE 95, 2 (2007), 427–437.
  • Mohan et al. (2014) Karthik Mohan, Palma London, Maryam Fazel, Daniela Witten, and Su-In Lee. 2014. Node-Based Learning of Multiple Gaussian Graphical Models. J. Mach. Learn. Res. 15, 1 (jan 2014), 445–488.
  • Monti et al. (2014) Ricardo Pio Monti, Peter Hellyer, David Sharp, Robert Leech, Christoforos Anagnostopoulos, and Giovanni Montana. 2014. Estimating time-varying brain connectivity networks from functional MRI time series. NeuroImage 103 (2014), 427–443.
  • Nakamura et al. (2023) Kota Nakamura, Yasuko Matsubara, Koki Kawabata, Yuhei Umeda, Yuichiro Wada, and Yasushi Sakurai. 2023. Fast and Multi-aspect Mining of Complex Time-stamped Event Streams. In WWW. 1638–1649.
  • Namaki et al. (2011) A. Namaki, A.H. Shirazi, R. Raei, and G.R. Jafari. 2011. Network analysis of a financial market based on genuine correlation and threshold method. Physica A: Statistical Mechanics and its Applications 390, 21 (2011), 3835–3841.
  • Papadimitriou et al. (2005) Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos. 2005. Streaming pattern discovery in multiple time-series. (2005).
  • Plant and Böhm (2011) Claudia Plant and Christian Böhm. 2011. Inconco: interpretable clustering of numerical and categorical objects. In KDD. 1127–1135.
  • Ramoni et al. (2000) Marco Ramoni, Paola Sebastiani, and Paul R. Cohen. 2000. Multivariate Clustering by Dynamics. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. AAAI Press, 633–638.
  • Rogers et al. (2013) Mark Rogers, Lei Li, and Stuart J Russell. 2013. Multilinear Dynamical Systems for Tensor Time Series. In NIPS. 2634–2642.
  • Rudin (2019) Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence 1, 5 (2019), 206–215.
  • Rue and Held (2005) Havard Rue and Leonhard Held. 2005. Gaussian Markov random fields: theory and applications. CRC press.
  • Ruiz et al. (2012) Eduardo J. Ruiz, Vagelis Hristidis, Carlos Castillo, Aristides Gionis, and Alejandro Jaimes. 2012. Correlating Financial Time Series with Micro-Blogging Activity. In WSDM (Seattle, Washington, USA). Association for Computing Machinery, New York, NY, USA, 513–522.
  • Takahashi et al. (2017) Tsubasa Takahashi, Bryan Hooi, and Christos Faloutsos. 2017. AutoCyclone: Automatic Mining of Cyclic Online Activities with Robust Tensor Factorization. In WWW (Perth, Australia). 213–221.
  • Tan et al. (2015) Kean Ming Tan, Daniela Witten, and Ali Shojaie. 2015. The cluster graphical lasso for improved estimation of Gaussian graphical models. Computational statistics & data analysis 85 (2015), 23–36.
  • Tomasi et al. (2021) Federico Tomasi, Veronica Tozzo, and Annalisa Barla. 2021. Temporal Pattern Detection in Time-Varying Graphical Models. In ICPR. 4481–4488.
  • Tomasi et al. (2018) Federico Tomasi, Veronica Tozzo, Saverio Salzo, and Alessandro Verri. 2018. Latent Variable Time-varying Network Inference. In KDD. 2338–2346.
  • Tozzo et al. (2021) Veronica Tozzo, Federico Ciech, Davide Garbarino, and Alessandro Verri. 2021. Statistical Models Coupling Allows for Complex Local Multivariate Time Series Analysis. In KDD. 1593–1603.
  • Vlachos et al. (2002) Michail Vlachos, George Kollios, and Dimitrios Gunopulos. 2002. Discovering similar multidimensional trajectories. In Proceedings 18th international conference on data engineering. IEEE, 673–684.
  • Wu et al. (2019) Xunxian Wu, Tong Xu, Hengshu Zhu, Le Zhang, Enhong Chen, and Hui Xiong. 2019. Trend-Aware Tensor Factorization for Job Skill Demand Analysis.. In IJCAI. 3891–3897.
  • Wytock and Kolter (2013) Matt Wytock and Zico Kolter. 2013. Sparse Gaussian conditional random fields: Algorithms, theory, and application to energy forecasting. In International conference on machine learning. PMLR, 1265–1273.
  • Xiong and Yeung (2004) Yimin Xiong and Dit-Yan Yeung. 2004. Time series clustering with ARMA mixtures. Pattern Recognition 37, 8 (2004), 1675–1689.
  • Xuan and Murphy (2007) Xiang Xuan and Kevin Murphy. 2007. Modeling Changing Dependency Structure in Multivariate Time Series. In ICML (Corvalis, Oregon, USA). Association for Computing Machinery, New York, NY, USA, 1055–1062.
  • Yuan and Lin (2006) Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 68, 1 (2006), 49–67.
  • Zhang et al. (2017) Shuyi Zhang, Bin Guo, Anlan Dong, Jing He, Ziping Xu, and Song Xi Chen. 2017. Cautionary tales on air-quality improvement in Beijing. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 473, 2205 (2017), 20170457.
  • Zolhavarieh et al. (2014) Seyedjamal Zolhavarieh, Saeed Aghabozorgi, Ying Wah Teh, et al. 2014. A review of subsequence time series clustering. The Scientific World Journal 2014 (2014).

Appendix A Proposed Model

Table 4 lists the main symbols we use throughout this paper.

Appendix B Algorithms

B.1. CutPointDetector

Alg. 2 shows the overall procedure for CutPointDetector, which is a subalgorithm of Alg. 1. For clarity, we describe the total description cost as CostT(𝒳;{Θ})𝐶𝑜𝑠subscript𝑡𝑇𝒳ΘCost_{T}(\mathcal{X};\{\Theta\})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; { roman_Θ } ). The cluster assignment set for Θ[id]Θdelimited-[]𝑖𝑑\Theta[id]roman_Θ [ italic_i italic_d ] is a corresponding segment.

B.2. Proof of Lemma 1

Proof.

The computational cost of the DMM depends largely on the number of CutPointDetector iterations and the cost of inferring ΘΘ\Thetaroman_Θ at each iteration. Consider that all segments are eventually merged. Since the total computational time needed to infer ΘΘ\Thetaroman_Θ is the sum of {A(1),,A(N)}superscript𝐴1superscript𝐴𝑁\{A^{(1)},\cdots,A^{(N)}\}{ italic_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT } inferences, we discuss the case of A(n)superscript𝐴𝑛A^{(n)}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT. When Tm=1(mn)NDmDnmuch-greater-than𝑇superscriptsubscriptproduct𝑚1𝑚𝑛𝑁subscript𝐷𝑚subscript𝐷𝑛T\prod_{m=1(m\neq n)}^{N}D_{m}\gg D_{n}italic_T ∏ start_POSTSUBSCRIPT italic_m = 1 ( italic_m ≠ italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≫ italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, at each iteration, inferring A(n)superscript𝐴𝑛A^{(n)}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT for all segments takes O(DnTm=1(mn)NDm)𝑂subscript𝐷𝑛𝑇superscriptsubscriptproduct𝑚1𝑚𝑛𝑁subscript𝐷𝑚O(D_{n}T\prod_{m=1(m\neq n)}^{N}D_{m})italic_O ( italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_T ∏ start_POSTSUBSCRIPT italic_m = 1 ( italic_m ≠ italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) thanks to ADMM. If the number of segments is halved at each iteration, the number of iterations is log2|𝐰|subscript2𝐰\log_{2}|\mathbf{w}|roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | bold_w |. If the number of segments decreases by one at each iteration, the number of iterations is |𝐰|𝐰|\mathbf{w}|| bold_w |, but this is unlikely to happen. Tlog2|𝐰|much-greater-than𝑇subscript2𝐰T\gg\log_{2}|\mathbf{w}|italic_T ≫ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | bold_w |, and so the computation cost related to A(n)superscript𝐴𝑛A^{(n)}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT is O(Tm=1NDm)𝑂𝑇superscriptsubscriptproduct𝑚1𝑁subscript𝐷𝑚O(T\prod_{m=1}^{N}D_{m})italic_O ( italic_T ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). Since T,DnNmuch-greater-than𝑇subscript𝐷𝑛𝑁T,D_{n}\gg Nitalic_T , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≫ italic_N, the repetition of inference for each mode is negligible. Therefore, the time complexity of DMM is O(Tm=1NDm)𝑂𝑇superscriptsubscriptproduct𝑚1𝑁subscript𝐷𝑚O(T\prod_{m=1}^{N}D_{m})italic_O ( italic_T ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). ∎

Appendix C Case Study

C.1. Datasets

We describe the query set we used for Google Trends in Table 5.

C.2. Results

Total description cost. We compare the total description cost of DMM with TAGM and TICC on real-world datasets in Fig. 8. As shown, DMM achieves the lowest total description cost of all the datasets. TAGM has many segments, which results in the large coding length cost. TICC is not capable of handling tensor, which results in higher data coding cost compared with DMM.

E-commerce. We demonstrate how effectively DMM works on the #1 E-commerce dataset. Fig. 9 shows the result of DMM for clustering over #1 E-commerce. Fig. 9 (a) shows the clustering results of the original TTS, where each color represents a cluster. DMM finds 10 segments and two clusters. We name the blue cluster “Dairy products” and the pink cluster “Online sales.DMM assigns every Nov. to “Online sales”, the period of Black Friday and Cyber Monday. Fig. 9 (b) shows the query and state networks for each cluster. The query network of “Daily products” shows that there are edges between the local daily products companies ( “costco”, “walmart”, and “target”). On the other hand, with the query network of “Online sales”, there are many edges, especially related to large e-commerce companies ( “amazon” and “ebay”), and the state network shows that the top four populated states ( “CA”, “TX”, “FL”, and “NY”) form edges, indicating the similarity of online shopping among the big states.

Table 4. Symbols and definitions.
Symbol

Definition

Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

Number of variables at mode-n

N𝑁Nitalic_N

Number of modes excluding temporal mode

T𝑇Titalic_T

Number of timestamp

𝒳𝒳\mathcal{X}caligraphic_X

((((N+1)th)^{th}) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order TTS, i.e., 𝒳={𝒳1,𝒳2,,𝒳T}D1××DN×T𝒳subscript𝒳1subscript𝒳2subscript𝒳𝑇superscriptsubscript𝐷1subscript𝐷𝑁𝑇\mathcal{X}=\{\mathcal{X}_{1},\mathcal{X}_{2},\dots,\mathcal{X}_{T}\}\in% \mathbb{R}^{D_{1}\times\cdots\times D_{N}\times T}caligraphic_X = { caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT × italic_T end_POSTSUPERSCRIPT

𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

Nthsuperscript𝑁𝑡N^{th}italic_N start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order tensor at tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT time step, i.e., 𝒳tD1××DNsubscript𝒳𝑡superscriptsubscript𝐷1subscript𝐷𝑁\mathcal{X}_{t}\in\mathbb{R}^{D_{1}\times\cdots\times D_{N}}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

D𝐷Ditalic_D

Total product of variables excluding T𝑇Titalic_T, i.e., D=n=1NDn𝐷superscriptsubscriptproduct𝑛1𝑁subscript𝐷𝑛D=\prod_{n=1}^{N}D_{n}italic_D = ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

D(\n)superscript𝐷\absent𝑛D^{(\backslash n)}italic_D start_POSTSUPERSCRIPT ( \ italic_n ) end_POSTSUPERSCRIPT

Total product of variables excluding Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and T𝑇Titalic_T, i.e., D(\n)=m=1(mn)NDmsuperscript𝐷\absent𝑛superscriptsubscriptproduct𝑚1𝑚𝑛𝑁subscript𝐷𝑚D^{(\backslash n)}=\prod_{m=1(m\neq n)}^{N}D_{m}italic_D start_POSTSUPERSCRIPT ( \ italic_n ) end_POSTSUPERSCRIPT = ∏ start_POSTSUBSCRIPT italic_m = 1 ( italic_m ≠ italic_n ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

K𝐾Kitalic_K

Number of clusters

m𝑚mitalic_m

Number of segments

cp𝑐𝑝cpitalic_c italic_p

Cut points, i.e., cp={cp1,cp2,,cpm}𝑐𝑝𝑐subscript𝑝1𝑐subscript𝑝2𝑐subscript𝑝𝑚cp=\{cp_{1},cp_{2},\dots,cp_{m}\}italic_c italic_p = { italic_c italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }

cpi𝑐subscript𝑝𝑖cp_{i}italic_c italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Starting point of segment i𝑖iitalic_i, i.e., cp1=1,cpm+1=T+1formulae-sequence𝑐subscript𝑝11𝑐subscript𝑝𝑚1𝑇1cp_{1}=1,cp_{m+1}=T+1italic_c italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_c italic_p start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT = italic_T + 1

ΘΘ\Thetaroman_Θ

Model parameter set, i.e., Θ={θ1,θ2,,θK}Θsubscript𝜃1subscript𝜃2subscript𝜃𝐾\Theta=\{\theta_{1},\theta_{2},\dots,\theta_{K}\}roman_Θ = { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }

θ𝜃\thetaitalic_θ

Hierarchical Teoplitz matrix of shape θD×D𝜃superscript𝐷𝐷\theta\in\mathbb{R}^{D\times D}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_D end_POSTSUPERSCRIPT consists of {A(1),,A(N)}superscript𝐴1superscript𝐴𝑁\{A^{(1)},\cdots,A^{(N)}\}{ italic_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT }

A(n)superscript𝐴𝑛A^{(n)}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT

Precision matrix of mode-n, i.e., A(n)Dn×Dnsuperscript𝐴𝑛superscriptsubscript𝐷𝑛subscript𝐷𝑛A^{(n)}\in\mathbb{R}^{D_{n}\times D_{n}}italic_A start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

\mathcal{F}caligraphic_F

Cluster assignment set, i.e., ={f1,f2,,fK}subscript𝑓1subscript𝑓2subscript𝑓𝐾\mathcal{F}=\{f_{1},f_{2},\dots,f_{K}\}caligraphic_F = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }

\mathcal{M}caligraphic_M

Cluster parameter set, i.e., ={,Θ}Θ\mathcal{M}=\{\mathcal{F},\Theta\}caligraphic_M = { caligraphic_F , roman_Θ }

CostA()𝐶𝑜𝑠subscript𝑡𝐴Cost_{A}(\mathcal{F})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_F )

Coding length cost: description complexity of \mathcal{F}caligraphic_F

CostM(Θ)𝐶𝑜𝑠subscript𝑡𝑀ΘCost_{M}(\Theta)italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( roman_Θ )

Model coding cost: description complexity of ΘΘ\Thetaroman_Θ

CostC(𝒳|)𝐶𝑜𝑠subscript𝑡𝐶conditional𝒳Cost_{C}(\mathcal{X}|\mathcal{M})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_X | caligraphic_M )

Data coding cost: negative log-likelihood of 𝒳𝒳\mathcal{X}caligraphic_X given \mathcal{M}caligraphic_M

Cost1(Θ)𝐶𝑜𝑠subscript𝑡subscript1ΘCost_{\ell_{1}}(\Theta)italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Θ )

1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm cost: penalty for ΘΘ\Thetaroman_Θ

CostT(𝒳;)𝐶𝑜𝑠subscript𝑡𝑇𝒳Cost_{T}(\mathcal{X};\mathcal{M})italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; caligraphic_M )

Total description cost: total cost of 𝒳𝒳\mathcal{X}caligraphic_X given \mathcal{M}caligraphic_M

Algorithm 2 CutPointDetector(𝒳,cp)𝒳𝑐𝑝(\mathcal{X},cp)( caligraphic_X , italic_c italic_p )
1:  Input: (N+1)thsuperscript𝑁1𝑡(N+1)^{th}( italic_N + 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-order TTS 𝒳𝒳\mathcal{X}caligraphic_X and initial cut points set cp𝑐𝑝cpitalic_c italic_p
2:  Output: The best cut point set cp𝑐𝑝cpitalic_c italic_p
3:  repeat
4:     id=0𝑖𝑑0id=0italic_i italic_d = 0cpnew=ϕ𝑐subscript𝑝𝑛𝑒𝑤italic-ϕcp_{new}=\phiitalic_c italic_p start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT = italic_ϕ;
5:     ΘS={θcp0:cp1,θcp1:cp2,,θcpm:cpm+1}subscriptΘ𝑆subscript𝜃:𝑐subscript𝑝0𝑐subscript𝑝1subscript𝜃:𝑐subscript𝑝1𝑐subscript𝑝2subscript𝜃:𝑐subscript𝑝𝑚𝑐subscript𝑝𝑚1\Theta_{S}=\{\theta_{cp_{0}:cp_{1}},\theta_{cp_{1}:cp_{2}},\dots,\theta_{cp_{m% }:cp_{m+1}}\}roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { italic_θ start_POSTSUBSCRIPT italic_c italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_c italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_c italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_c italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_c italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT : italic_c italic_p start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT }
6:     ΘE={θcp0:cp2,θcp2:cp4,}subscriptΘ𝐸subscript𝜃:𝑐subscript𝑝0𝑐subscript𝑝2subscript𝜃:𝑐subscript𝑝2𝑐subscript𝑝4\Theta_{E}=\{\theta_{cp_{0}:cp_{2}},\theta_{cp_{2}:cp_{4}},\dots\}roman_Θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = { italic_θ start_POSTSUBSCRIPT italic_c italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_c italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_c italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_c italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … }
7:     ΘO={θcp1:cp3,θcp3:cp5,}subscriptΘ𝑂subscript𝜃:𝑐subscript𝑝1𝑐subscript𝑝3subscript𝜃:𝑐subscript𝑝3𝑐subscript𝑝5\Theta_{O}=\{\theta_{cp_{1}:cp_{3}},\theta_{cp_{3}:cp_{5}},\dots\}roman_Θ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = { italic_θ start_POSTSUBSCRIPT italic_c italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_c italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_c italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT : italic_c italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … }
8:     while id<length(𝒳)𝑖𝑑𝑙𝑒𝑛𝑔𝑡𝒳id<length(\mathcal{X})italic_i italic_d < italic_l italic_e italic_n italic_g italic_t italic_h ( caligraphic_X ) do
9:        if id𝑖𝑑iditalic_i italic_d is even then
10:           ΘLeft=ΘOsubscriptΘ𝐿𝑒𝑓𝑡subscriptΘ𝑂\Theta_{Left}=\Theta_{O}roman_Θ start_POSTSUBSCRIPT italic_L italic_e italic_f italic_t end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT;    ΘRight=ΘEsubscriptΘ𝑅𝑖𝑔𝑡subscriptΘ𝐸\Theta_{Right}=\Theta_{E}roman_Θ start_POSTSUBSCRIPT italic_R italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT;
11:           idLeft=id/2𝑖subscript𝑑𝐿𝑒𝑓𝑡𝑖𝑑2id_{Left}=\lfloor id/2\rflooritalic_i italic_d start_POSTSUBSCRIPT italic_L italic_e italic_f italic_t end_POSTSUBSCRIPT = ⌊ italic_i italic_d / 2 ⌋;     idRight=id/2+1𝑖subscript𝑑𝑅𝑖𝑔𝑡𝑖𝑑21id_{Right}=\lfloor id/2\rfloor+1italic_i italic_d start_POSTSUBSCRIPT italic_R italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT = ⌊ italic_i italic_d / 2 ⌋ + 1;
12:        else if id𝑖𝑑iditalic_i italic_d is odd then
13:           ΘLeft=ΘEsubscriptΘ𝐿𝑒𝑓𝑡subscriptΘ𝐸\Theta_{Left}=\Theta_{E}roman_Θ start_POSTSUBSCRIPT italic_L italic_e italic_f italic_t end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT;     ΘRight=ΘOsubscriptΘ𝑅𝑖𝑔𝑡subscriptΘ𝑂\Theta_{Right}=\Theta_{O}roman_Θ start_POSTSUBSCRIPT italic_R italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT;
14:           idLeft=id/2+1𝑖subscript𝑑𝐿𝑒𝑓𝑡𝑖𝑑21id_{Left}=\lfloor id/2\rfloor+1italic_i italic_d start_POSTSUBSCRIPT italic_L italic_e italic_f italic_t end_POSTSUBSCRIPT = ⌊ italic_i italic_d / 2 ⌋ + 1;     idRight=id/2+1𝑖subscript𝑑𝑅𝑖𝑔𝑡𝑖𝑑21id_{Right}=\lfloor id/2\rfloor+1italic_i italic_d start_POSTSUBSCRIPT italic_R italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT = ⌊ italic_i italic_d / 2 ⌋ + 1;
15:        end if
16:        Csolo=CostT(𝒳;{ΘS[id],ΘS[id+1],ΘS[id+2]})subscript𝐶𝑠𝑜𝑙𝑜𝐶𝑜𝑠subscript𝑡𝑇𝒳subscriptΘ𝑆delimited-[]𝑖𝑑subscriptΘ𝑆delimited-[]𝑖𝑑1subscriptΘ𝑆delimited-[]𝑖𝑑2C_{solo}=Cost_{T}(\mathcal{X};\{\Theta_{S}[id],\Theta_{S}[id+1],\Theta_{S}[id+% 2]\})italic_C start_POSTSUBSCRIPT italic_s italic_o italic_l italic_o end_POSTSUBSCRIPT = italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; { roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT [ italic_i italic_d ] , roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT [ italic_i italic_d + 1 ] , roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT [ italic_i italic_d + 2 ] } );
17:        Cleft=CostT(𝒳;{ΘLeft[idLeft],ΘS[id+2]})subscript𝐶𝑙𝑒𝑓𝑡𝐶𝑜𝑠subscript𝑡𝑇𝒳subscriptΘ𝐿𝑒𝑓𝑡delimited-[]𝑖subscript𝑑𝐿𝑒𝑓𝑡subscriptΘ𝑆delimited-[]𝑖𝑑2C_{left}=Cost_{T}(\mathcal{X};\{\Theta_{Left}[id_{Left}],\Theta_{S}[id+2]\})italic_C start_POSTSUBSCRIPT italic_l italic_e italic_f italic_t end_POSTSUBSCRIPT = italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; { roman_Θ start_POSTSUBSCRIPT italic_L italic_e italic_f italic_t end_POSTSUBSCRIPT [ italic_i italic_d start_POSTSUBSCRIPT italic_L italic_e italic_f italic_t end_POSTSUBSCRIPT ] , roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT [ italic_i italic_d + 2 ] } );
18:        Cright=CostT(𝒳;{ΘS[id],ΘRight[idRight]})subscript𝐶𝑟𝑖𝑔𝑡𝐶𝑜𝑠subscript𝑡𝑇𝒳subscriptΘ𝑆delimited-[]𝑖𝑑subscriptΘ𝑅𝑖𝑔𝑡delimited-[]𝑖subscript𝑑𝑅𝑖𝑔𝑡C_{right}=Cost_{T}(\mathcal{X};\{\Theta_{S}[id],\Theta_{Right}[id_{Right}]\})italic_C start_POSTSUBSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT = italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_X ; { roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT [ italic_i italic_d ] , roman_Θ start_POSTSUBSCRIPT italic_R italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT [ italic_i italic_d start_POSTSUBSCRIPT italic_R italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT ] } );
19:        if min(Csolo,Cleft,Cright)=Csolo𝑚𝑖𝑛subscript𝐶𝑠𝑜𝑙𝑜subscript𝐶𝑙𝑒𝑓𝑡subscript𝐶𝑟𝑖𝑔𝑡subscript𝐶𝑠𝑜𝑙𝑜min(C_{solo},C_{left},C_{right})=C_{solo}italic_m italic_i italic_n ( italic_C start_POSTSUBSCRIPT italic_s italic_o italic_l italic_o end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_l italic_e italic_f italic_t end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT ) = italic_C start_POSTSUBSCRIPT italic_s italic_o italic_l italic_o end_POSTSUBSCRIPT then
20:           cpnew=cpnewcp[id]𝑐subscript𝑝𝑛𝑒𝑤𝑐subscript𝑝𝑛𝑒𝑤𝑐𝑝delimited-[]𝑖𝑑cp_{new}=cp_{new}\cup cp[id]italic_c italic_p start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT = italic_c italic_p start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ∪ italic_c italic_p [ italic_i italic_d ];     id+=1limit-from𝑖𝑑1id+=1italic_i italic_d + = 1;
21:        else if min(Csolo,Cleft,Cright)=Cleft𝑚𝑖𝑛subscript𝐶𝑠𝑜𝑙𝑜subscript𝐶𝑙𝑒𝑓𝑡subscript𝐶𝑟𝑖𝑔𝑡subscript𝐶𝑙𝑒𝑓𝑡min(C_{solo},C_{left},C_{right})=C_{left}italic_m italic_i italic_n ( italic_C start_POSTSUBSCRIPT italic_s italic_o italic_l italic_o end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_l italic_e italic_f italic_t end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT ) = italic_C start_POSTSUBSCRIPT italic_l italic_e italic_f italic_t end_POSTSUBSCRIPT then
22:           cpnew=cpnewcp[id+1]𝑐subscript𝑝𝑛𝑒𝑤𝑐subscript𝑝𝑛𝑒𝑤𝑐𝑝delimited-[]𝑖𝑑1cp_{new}=cp_{new}\cup cp[id+1]italic_c italic_p start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT = italic_c italic_p start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ∪ italic_c italic_p [ italic_i italic_d + 1 ];     id+=2limit-from𝑖𝑑2id+=2italic_i italic_d + = 2;
23:        else if min(Csolo,Cleft,Cright)=Cright𝑚𝑖𝑛subscript𝐶𝑠𝑜𝑙𝑜subscript𝐶𝑙𝑒𝑓𝑡subscript𝐶𝑟𝑖𝑔𝑡subscript𝐶𝑟𝑖𝑔𝑡min(C_{solo},C_{left},C_{right})=C_{right}italic_m italic_i italic_n ( italic_C start_POSTSUBSCRIPT italic_s italic_o italic_l italic_o end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_l italic_e italic_f italic_t end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT ) = italic_C start_POSTSUBSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT then
24:           cpnew=cpnewcp[id],cp[id+2]𝑐subscript𝑝𝑛𝑒𝑤𝑐subscript𝑝𝑛𝑒𝑤𝑐𝑝delimited-[]𝑖𝑑𝑐𝑝delimited-[]𝑖𝑑2cp_{new}=cp_{new}\cup cp[id],cp[id+2]italic_c italic_p start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT = italic_c italic_p start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ∪ italic_c italic_p [ italic_i italic_d ] , italic_c italic_p [ italic_i italic_d + 2 ];     id+=3limit-from𝑖𝑑3id+=3italic_i italic_d + = 3;
25:        end if
26:     end while
27:     cp=cpnew𝑐𝑝𝑐subscript𝑝𝑛𝑒𝑤cp=cp_{new}italic_c italic_p = italic_c italic_p start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT;
28:  until cp𝑐𝑝cpitalic_c italic_p is stable;
29:  return  cp𝑐𝑝cpitalic_c italic_p;
Table 5. Google Trends query set.
Name Query
#1 E-commerce
Amazon/Apple/BestBuy/Costco/Craigslist/Ebay/
Homedepot/Kohls/Macys/Target/Walmart
#2 VoD
AppleTV/ESPN/HBO/Hulu/Netflix/Sling/
Vudu/YouTube
#3 Sweets
Cake/Candy/Chocolate/Cookie/Cupcake/
Gum/Icecream/Pie/Pudding
#4 Covid
Covid/Corona/Flu/Influenza/Vaccine/Virus
#5 GAFAM
Amazon/Apple/Facebook/Google/Microsoft
Refer to caption
Figure 8. Total description cost of DMM: our method consistently outperforms its baselines (lower is better).
Refer to caption

(a) Clustering results on the original tensor time series Refer to caption
(b) State and query networks

Figure 9. Effectiveness of DMM on #1 E-commerce dataset: (a) it splits the tensor into two clusters shown by colors (i.e., #blue\rightarrowDairy products” and #pink\rightarrowOnline sales”). (b) each cluster has distinct state and query networks.