Dynamic Multi-Network Mining of Tensor Time Series

Kohei Obata SANKEN, Osaka University, Japan [email protected] , Koki Kawabata SANKEN, Osaka University, Japan [email protected] , Yasuko Matsubara SANKEN, Osaka University, Japan [email protected] and Yasushi Sakurai SANKEN, Osaka University, Japan [email protected]

(2024)

Abstract.

Subsequence clustering of time series is an essential task in data mining, and interpreting the resulting clusters is also crucial since we generally do not have prior knowledge of the data. Thus, given a large collection of tensor time series consisting of multiple modes, including timestamps, how can we achieve subsequence clustering for tensor time series and provide interpretable insights? In this paper, we propose a new method, Dynamic Multi-network Mining (DMM), that converts a tensor time series into a set of segment groups of various lengths (i.e., clusters) characterized by a dependency network constrained with $\ell_{1}$ -norm. Our method has the following properties. (a) Interpretable: it characterizes the cluster with multiple networks, each of which is a sparse dependency network of a corresponding non-temporal mode, and thus provides visible and interpretable insights into the key relationships. (b) Accurate: it discovers the clusters with distinct networks from tensor time series according to the minimum description length (MDL). (c) Scalable: it scales linearly in terms of the input data size when solving a non-convex problem to optimize the number of segments and clusters, and thus it is applicable to long-range and high-dimensional tensors. Extensive experiments with synthetic datasets confirm that our method outperforms the state-of-the-art methods in terms of clustering accuracy. We then use real datasets to demonstrate that DMM is useful for providing interpretable insights from tensor time series.

Tensor time series, Clustering, Network inference, Graphical lasso

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singapore.^†^†booktitle: Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singapore^†^†isbn: 979-8-4007-0171-9/24/05^†^†doi: 10.1145/3589334.3645461^†^†ccs: Information systems Data mining^†^†ccs: Information systems Clustering

1. Introduction

The development of IoT has facilitated the collection of time series data, including data related to automobiles (Miyajima et al., 2007), medicine (Hirano and Tsumoto, 2006; Monti et al., 2014), and finance (Namaki et al., 2011; Ruiz et al., 2012), from multiple modes such as sensor type, locations and users, which we call tensor time series (TTS). An instance of such data is online activity data, which records search volumes in three modes {Query, Location, Timestamp}. These TTS can often be divided and grouped into subsequences that have similar traits (i.e., clusters). Time series subsequence clustering (Aghabozorgi et al., 2015; Zolhavarieh et al., 2014) is a useful unsupervised exploratory approach for recognizing dynamic changes and uncovering interesting patterns in time series. As well as clustering data, the interpretability of the results is also important since we rarely know what each cluster refers to (Plant and Böhm, 2011; Rudin, 2019). Modeling a cluster as a dependency network (Hallac et al., 2017b; Tozzo et al., 2021; Tan et al., 2015), where nodes are variables and an edge expresses a relationship between variables, gives a clear explanation of what the cluster refers to. Considering that a TTS consists of multiple modes (Madabhushi and Lee, 2016; Gatto et al., 2021; Batalo et al., 2022), a cluster should be modeled as multiple networks, where each is a dependency network of a corresponding non-temporal mode, to provide a good explanation. In the above example, a cluster can be modeled as query and location networks, where each explains the relationships among queries/locations. With these networks, we can understand why a particular cluster distinguishes itself from another and speculate about what happened during a period belonging to the cluster. Given such a TTS, how can we find clusters with interpretability contributing to a better understanding of the data?

Research on time series subsequence clustering has mainly focused on univariate or multivariate time series (UTS and MTS). TTS is a generalization of time series and includes UTS and MTS. Here, we mainly assume that TTS has three or more modes. Generally, UTS clustering methods use distance-based metrics such as dynamic time warping (Berndt and Clifford, 1994). These methods focus on matching raw values and do not consider relationships among variables, which is essential if we are to interpret the MTS and TTS clustering. MTS clustering methods usually employ model-based clustering, which assumes, for example, a Gaussian (Matsubara et al., 2014) or an ARMA (Xiong and Yeung, 2004) model and attempts to find clusters that recover the data from the model. The interpretability of the clustering results depends on the model they assume. As a technique for interpretable clustering, TICC (Hallac et al., 2017b) models an MTS with a dependency network and discovers interpretable clusters that previously developed methods cannot find. Nevertheless, TTS clustering is a more challenging problem and cannot simply employ MTS methods due to the complexity of TTS, stemming from multiple modes, which introduces intricate dependencies and a massive data size. To employ an MTS clustering method (e.g., TICC) for TTS, the TTS must be flattened to form a higher-order MTS. As a result, the method processes the higher-order MTS and mixes up all the relationships between variables, which may capture spurious relationships and unnecessarily exacerbate the interpretability. Moreover, its computational time increases greatly as the number of variables in a mode increases.

In this paper, we propose a new method for TTS subsequence clustering, which we call Dynamic Multi-network Mining (DMM). ¹¹1 Our source code and datasets are publicly available:
https://github.com/KoheiObata/DMM. In our method, we define each cluster as multiple networks, each of which is a sparse dependency network of a corresponding non-temporal mode and thus can be seen as visual images that can help users quickly understand the data structure. Our algorithm scales linearly with the input data size while employing the divide-and-conquer method and is thus applicable to long-range and high-dimensional tensors. Furthermore, the clustering results and every user-defined parameter of our method can be determined by a single criterion based on the Minimum Description Length (MDL) principle (Grünwald, 2007). DMM is a useful tool for TTS subsequence clustering that enables multifaceted analysis and understanding of TTS.

1.1. Preview of our results

Fig. 1 shows the DMM results for clustering over Google Trends data, which consists of $10$ years of daily web search counts for six queries related to COVID-19 across $10$ countries, forming a $3^{rd}$ -order tensor. Fig. 1 (a) shows the cluster assignments of the TTS, where each color represents a cluster. DMM splits the tensor into four segments and groups them into four clusters, each of which can be interpreted as a distinct phase corresponding to the evolving social response to COVID-19; thus, we name these phases “Before Covid,” “Outbreak,” “Vaccine,” and “Adaptation.” It is worth noting that this result is obtained with no prior knowledge.

Fig. 1 (b) presents the networks of each cluster, i.e., a country network, which has nodes plotted on the world map, reflects dependencies between different countries, and a query network for query dependencies. These networks, also known as a Markov Random Field (MRF) (Rue and Held, 2005), illustrate how the node affects the other nodes. The thickness and color of the edges in the network indicate the strength of the partial correlation between the nodes, which denotes a stronger relationship compared with a simple correlation. We learn the networks by estimating a Gaussian inverse covariance matrix. Then, by definition, if there is an edge between two nodes, the nodes are directly dependent on each other. Otherwise, they are conditionally independent, given the rest of the nodes. Moreover, we impose an $\ell_{1}$ -norm penalty on the networks to promote sparsity, making it possible to obtain true networks and interpretability, as well as making the method noise-robust (Wytock and Kolter, 2013; Yuan and Lin, 2006). These networks provide visible and interpretable insights into the key relationships that characterize clusters.

Refer to caption — Figure 1. Effectiveness of DMM on Google Trends (#4 Covid) dataset: (a) DMM can split the tensor time series into meaningful subsequence clusters shown by colors (i.e., #green $\rightarrow$ “Before Covid”, #pink $\rightarrow$ “Outbreak”, #gray $\rightarrow$ “Vaccine”, #blue $\rightarrow$ “Adaptation”), and (b) their important relationships between variables are summarized with country and query networks, where the nodes show individual variables, and the thickness and color of the edges are partial correlations showing the importance of its interaction.

We see that each of the four clusters exhibits unique networks that evolve with the different phases. In the “Before Covid” phase, the country network displays edges between English-speaking countries, indicating their interconnectedness. In the query network, the query “vaccine” correlates with “influenza.” However, during the “Outbreak” starting in $2020$ , many countries respond to the COVID-19 pandemic, leading to various edges in the country network. In the query network of this phase, new edges related to “coronavirus” appear, and “coronavirus” and “virus” have a particularly strong connection. In the “Vaccine” phase, as people become more concerned about protection from COVID-19, the query “vaccine” forms an edge with “covid.” Moreover, since flu infects fewer people than in the past, “influenza” loses its edges. Lastly, during the “Adaptation” phase, as the world becomes accustomed to the situation, the country network reduces the number of edges, and the edges related to “influenza” reappear, reflecting a return to the networks observed in the “Before Covid” phase.

1.2. Contributions

In summary, we propose DMM as a subsequence clustering method for TTS based on the MDL principle that enables each cluster to be characterized by multiple networks. The contributions of this paper can be summarized as follows.

•

Interpretable: DMM realizes the meaningful subsequence clustering of TTS, where each cluster is characterized by sparse dependency networks for each non-temporal mode, which facilitates the interpretation of the cluster from important relationships between variables.
•

Accurate: We define a criterion based on MDL to discover clusters with distinct networks. Thanks to the proposed criterion, any user-defined parameters can be determined, and DMM outperforms its state-of-the-art competitors in terms of clustering accuracy on synthetic data.
•

Scalable: The proposed clustering algorithm in DMM scales linearly as regards the input data size and is thus applicable to long-range and high-dimensional tensors.

Outline. The rest of the paper is organized as follows. After introducing related work in Section 2, we present our problem and basic background in Section 3. We then propose our model and algorithm in Sections 4 and 5, respectively. We report our experimental results in Sections 6 and 7.

2. Related work

We review previous studies that are closely related to our work.

Time series subsequence clustering. Subsequence clustering is an important task in time series data mining whose benefits are the extraction of interesting patterns and the provision of valuable information, and that can also be used as a subroutine of other tasks such as forecasting (Takahashi et al., 2017; Papadimitriou et al., 2005). Time series subsequence clustering methods can be roughly separated into a distance-based method and a model-based method. The distance-based method uses metrics such as dynamic time warping (Berndt and Clifford, 1994; Keogh, 2002; Alaee et al., 2021) and longest common subsequence (Vlachos et al., 2002) and finds clusters by focusing on matching raw values rather than structure in the data. The model-based method assumes a model for each cluster, and finds the best fit of data to the model. It covers a wide variety of models such as ARMA (Xiong and Yeung, 2004), Markov chain (Ramoni et al., 2000), and Gaussian (Matsubara et al., 2014). However, most previous work has focused on MTS and are not suitable for TTS. Few studies have focused on TTS clustering, for example, CubeScope (Nakamura et al., 2023) uses Dirichlet prior as a model to achieve online TTS clustering, but it only supports sparse categorical data. In summary, existing methods are not particularly well-suited to handling TTS and discovering interpretable clusters.

Tensor time series. TTS are ubiquitous and appear in a variety of applications, such as recommendation and demand prediction (Bai et al., 2019; Wu et al., 2019; Matsubara et al., 2016). To model a tensor, tensor/matrix decomposition, such as Tucker/CP decomposition (Kolda and Bader, 2009) and SVD, is a commonly used technique. Although it obtains a lower-dimensional representation that summarizes important patterns from a tensor, it struggles to capture temporal information (Liu et al., 2020). Therefore, it is often combined with dynamical systems to handle temporal information (Rogers et al., 2013; Cai et al., 2015; Jing et al., 2021). For example, SSMF (Kawabata et al., 2021), which is an online forecasting method that uses clustering as a subroutine, combines a dynamical system with non-negative matrix factorization (NMF) to capture seasonal patterns from a TTS. Each cluster in SSMF is characterized by a lower-dimensional representation of a TTS, however, understanding the representation is demanding. Thus, tensor/matrix decomposition is not suitable for an interpretable model.

Sparse network inference. Inferring a sparse inverse covariance matrix (i.e., network) from data helps us to understand the dependency of variables in a statistical way. Graphical lasso (Friedman et al., 2008), which maximizes the Gaussian log-likelihood imposing a $\ell_{1}$ -norm penalty, is one of the most commonly used techniques for estimating the sparse network from static data. However, time series data are normally non-stationary, and the network varies over time; thus, to infer time-varying networks, time similarity with the neighboring network is usually considered (Hallac et al., 2017a). The monitoring of such time-varying networks has been studied with the aim of analyzing economic data (Namaki et al., 2011) and biological signal data (Monti et al., 2014) because of the high interpretability of the network (Tomasi et al., 2021). Although the inference of time-varying networks is able to find change points by comparing the networks before and after a change, it cannot find clusters (Tomasi et al., 2018; Harutyunyan et al., 2019; Xuan and Murphy, 2007). TICC (Hallac et al., 2017b) and TAGM (Tozzo et al., 2021) use graphical lasso and find clusters from time series based on the network of each subsequence, providing the clusters with interpretability and allowing us to discover clusters that other traditional clustering methods cannot find. However, they cannot provide an interpretable insight when dealing with TTS. Consequently, past studies have yet to find networks for TTS and a way to cluster TTS based on the networks. Our method uses a graphical lasso-based model modified to provide interpretable clustering results from TTS.

3. Problem formulation

In this section, we describe the TTS we want to analyze, introduce some necessary background material, and define the formal problem of TTS clustering.

The main symbols employed in this paper are described in Appendix A. Consider an $($ N+1 $)^{th}$ -order TTS $\mathcal{X}\in\mathbb{R}^{D_{1}\times\cdots\times D_{N}\times T}$ , where the mode- $(N+1)$ is the time and its dimension is $T$ . We can also rewrite the TTS as a sequence of $N^{th}$ -order tensors $\mathcal{X}=\{\mathcal{X}_{1},\mathcal{X}_{2},\dots,\mathcal{X}_{T}\}$ , where each $\mathcal{X}_{t}\in\mathbf{R}^{D_{1}\times\cdots\times D_{N}}(1\leq t\leq T)$ denotes the observed data at the $t^{th}$ time step.

3.1. Tensor algebra

We briefly introduce some definitions in tensor algebra from tensor related literature (Kolda and Bader, 2009; Cai et al., 2015).

Definition 0 (Reorder).

Let the ordered sets $P^{(1)},\dots,P^{(G)}$ , where $P^{(g)}=\{p^{(g)}_{1},\dots,p^{(g)}_{n_{g}}\}\subset\{1,2,\dots,N\}$ , be a partitioning of the modes $\{1,2,\dots,N\}$ s.t., $\sum_{g}^{G}n_{g}=N$ . The reordering of an $N^{th}$ -order tensor $\mathcal{X}\in\mathbf{R}^{D_{1}\times\cdots\times D_{N}}$ into ordered sets is defined as $re(\mathcal{X})^{(P^{(1)},\dots,P^{(G)})}\in\mathbf{R}^{J^{(1)}\times\dots% \times J^{(G)}}$ , where $J^{(g)}=\prod_{n\in P^{(g)}}D_{n}$ .

Given a tensor $\mathcal{X}\in\mathbf{R}^{D^{(1)}_{1}\times\cdots\times D^{(1)}_{N}\times D^{(% 2)}_{1}\times\cdots\times D^{(G)}_{N}}$ , we partition the modes into $G$ , $P^{(g)}=\{gN+1,\cdots,g(N+1)\}$ . The element is given by $re(\mathcal{X})^{(P^{(1)},\dots,P^{(G)})}_{i^{(1)},\dots,i^{(G)}}=\mathcal{X}_% {d^{(1)}_{1},\dots,d^{(1)}_{N},d^{(2)}_{1},\dots,d^{(G)}_{N}}$ , where $i^{(1)}=1+\sum_{g=1}^{N}(d^{(1)}_{g}-1)\prod_{n=1}^{g-1}D^{(1)}_{n}$ .

Special cases of reordering are vectorization and matricization. Vectorization happens when $G=1$ . $vec(\mathcal{X})=re(\mathcal{X})^{(\{-1\})}\in\mathbf{R}^{D}$ , where $D=\prod_{n=1}^{N}D_{n}$ and $\{-1\}$ refers to the remaining unset modes. Mode-n matricization happens when $G=2$ and $P^{(1)}$ is a singleton. $mat(\mathcal{X})^{(n)}=re(\mathcal{X})^{(\{n\},\{-1\})}\in\mathbf{R}^{D_{n}% \times D^{(\backslash n)}}$ , where $D^{(\backslash n)}=\prod_{m=1(m\neq n)}^{N}D_{m}$ .

3.2. Graphical lasso

We use graphical lasso as a part of our model. Given the mode-(N+1) matricization of the $(N+1)^{th}$ -order TTS, $mat(\mathcal{X})^{(N+1)}\in\mathbb{R}^{T\times D}$ , the graphical lasso (Friedman et al., 2008) estimates the sparse Gaussian inverse covariance matrix (i.e., network) $\theta\in\mathbb{R}^{D\times D}$ , also known as the precision matrix, with which we can interpret pairwise conditional independencies among $D$ variables, e.g., if $\theta_{i,j}=0$ then variables $i$ and $j$ are conditionally independent given the values of all the other variables. The optimization problem is given as follows:

(1)		$\displaystyle\textrm{minimize}_{\theta\in S^{p}_{++}}$	$\displaystyle\lambda\|\|\theta\|\|_{od,1}-\sum_{t=1}^{T}ll(mat(\mathcal{X})^{(N+1)% }_{t,},\theta),$
	$\displaystyle ll(x,\theta)=$	$\displaystyle-\frac{1}{2}(x-\mu)^{T}\theta(x-\mu)$
(2)			$\displaystyle+\frac{1}{2}\log\textrm{det}\theta-\frac{D}{2}\log(2\pi),$

where $\theta$ must be a symmetric positive definite ( $S^{p}_{++}$ ). $ll(x,\theta)$ is the log-likelihood and $\mu\in\mathbf{R}^{D}$ is the empirical mean of $mat(\mathcal{X})^{(N+1)}$ . $\lambda\geq 0$ is a hyperparameter for determining the sparsity level of the network, and $\|\cdot\|_{od,1}$ indicates the off-diagonal $\ell_{1}$ -norm. Since Eq. (1) is a convex optimization problem, its solution is guaranteed to converge to the global optimum with the alternating direction method of multipliers (ADMM) (Boyd et al., 2011) and can speed up the solution time.

3.3. Network-based tensor time series clustering

A real-world complex $\mathcal{X}$ cannot be expressed by a single static network because it contains multiple sequence patterns, each of which has a distinct relationship/network. Moreover, we rarely know the optimal number of clusters and cluster assignments in advance. To address this issue, we want to provide an appropriate cost function and achieve subsequence clustering by minimizing the cost function. We now formulate the network-based TTS clustering problem. It assumes that $T$ time steps of $\mathcal{X}$ can be divided into $m$ time segments based on $K$ networks (i.e., clusters). Let $cp$ denote a starting point set of segments, i.e., $cp=\{cp_{1},cp_{2},\dots,cp_{m}\}$ , the $i$ -th segment of $\mathcal{X}$ is denoted as $\mathcal{X}_{cp_{i}:cp_{i+1}}$ where $cp_{m+1}=T+1$ . We group each of the $T$ points into one of the $K$ clusters denoted by a cluster assignment set $\mathcal{F}=\{f_{1},f_{2},\dots,f_{K}\}$ , where $f_{k}\subset\{1,2,\dots,T\}$ , and we refer to all subsequences in the cluster $k$ as $\mathcal{X}[f_{k}]\subset\mathcal{X}$ . Then, letting $\Theta$ be a model parameter set, i.e., $\Theta=\{\theta_{1},\theta_{2},\dots,\theta_{K}\}$ , each $\theta_{k}\in\mathbb{R}^{D\times D}$ is a sparse Gaussian inverse covariance matrix that summarizes the relationships of variables in $\mathcal{X}[f_{k}]$ . Therefore, the entire cluster parameter set is given by $\mathcal{M}=\{\mathcal{M}_{1},\mathcal{M}_{2},\dots,\mathcal{M}_{K}\}$ , consisting of $\mathcal{M}_{k}=\{\theta_{k},f_{k}\}$ . Overall, the problem that we want to solve is written as follows.

Problem 1 ().

Given a tensor time series $\mathcal{X}$ , estimate:

•

a cluster assignment set, $\mathcal{F}=\{f_{k}\}_{k=1}^{K}$
•

a model parameter set, $\Theta=\{\theta_{k}\}_{k=1}^{K}$
•

the number of clusters $K$

that minimizes the cost function Eq. (10).

4. Proposed DMM

In this section, we propose a new model with which to realize network-based TTS clustering, namely, DMM. We first describe our model $\theta$ , and then we define the criterion for determining the cluster assignments and the number of clusters.

4.1. Multimode graphical lasso

Assume $K,\mathcal{F}$ are given, here, we address how to define and infer the model $\theta_{k}$ . The original graphical lasso allows $\theta_{k}$ to connect any pairs of variables in a tensor; however, it is too high-dimensional to reveal relationships separately in terms of the non-temporal modes. To avoid the over-representation, we aim to capture the multi-aspect relationships by separating $\theta_{k}$ into multimode to which we add a desired constraint for interpretability.

We assume that $\theta$ is derived from $N$ networks, $\{A^{(1)},\dots,A^{(N)}\}$ , where $A^{(n)}\in\mathbf{R}^{D_{n}\times D_{n}}$ is the $n$ -th network. For example, an element $a^{(n)}_{i,j}\in A^{(n)}$ refers to the relationship between the $i$ -th and $j$ -th variables of mode-n, In each network, the goal is to capture the dependencies between $D_{n}$ variables. We also assume that there are no relationships except among variables that differ only at mode-n. Thus, $\theta=\theta^{(N)}$ becomes an $N^{th}$ hierarchical matrix of shape $D\times D$ . $\theta^{(n)}$ can be written as follows:

\displaystyle\theta^{(n)}=\begin{pmatrix}\theta^{(n-1)}&C^{(n)}_{1,2}&\cdots&% \cdots&C^{(n)}_{1,D_{n}}\\ C^{(n)}_{2,1}&\theta^{(n-1)}&\cdots&&\vdots\\ C^{(n)}_{3,1}&C^{(n)}_{3,2}&\cdots&\ddots&\vdots\\ \vdots&\ddots&\cdots&C^{(n)}_{D_{n}-2,D_{n}-1}&C^{(n)}_{D_{n}-2,D_{n}}\\ \vdots&&\cdots&\theta^{(n-1)}&C^{(n)}_{D_{n}-1,D_{n}}\\ C^{(n)}_{D_{n},1}&\ddots&\cdots&C^{(n)}_{D_{n},D_{n}-1}&\theta^{(n-1)}\\ \end{pmatrix},

where $\theta^{(1)}=A^{(1)}$ and $C^{(n)}_{i,j}\in\mathbb{R}^{\prod_{m=1}^{n-1}D_{m}\times\prod_{m=1}^{n-1}D_{m}}$ is a diagonal matrix whose diagonal element is $a^{(n)}_{i,j}\in A^{(n)}$ , i.e., $C^{(n)}_{i,j}=a^{(n)}_{i,j}\cdot\delta_{i.j}$ allows edges that differ only at mode-n, where $\delta_{i.j}$ is the Kronecker delta.

We extend graphical lasso to obtain $\theta$ by inferring a sparse $A^{(n)}$ from a TTS. The optimization problem is written as follows:

	$\displaystyle\textrm{minimize}_{A^{(n)}\in S^{p}_{++}}$	$\displaystyle\lambda\|\|A^{(n)}\|\|_{od,1}$
(8)			$\displaystyle-\sum_{t}^{T}ll_{n}(re(\mathcal{X})^{(\{N+1\},\{-1\},\{n\})}_{t,:% ,:},A^{(n)}),$
	$\displaystyle ll_{n}(re(\mathcal{X})_{t,:,:},A^{(n)})=\sum_{d=1}^{D^{(% \backslash n)}}$	$\displaystyle\{-\frac{1}{2}(re(\mathcal{X})_{t,d,:}-\mu_{d})^{T}A^{(n)}(re(% \mathcal{X})_{t,d,:}-\mu_{d})$
(9)			$\displaystyle+\frac{1}{2}\log\mathrm{det}A^{(n)}-\frac{D_{n}}{2}\log(2\pi)\}/D% ^{(\backslash n)},$

where $\mu_{d}\in\mathbb{R}^{D_{n}}$ is the empirical mean of the variable $re(\mathcal{X})_{:,d,:}\in\mathbb{R}^{T\times D_{n}}$ . Eq. (8) is a convex optimization problem solved by ADMM. We divide the log-likelihood by $D^{(\backslash n)}$ to scale the sample size.

4.2. Data compression

To determine the cluster assignment set $\mathcal{F}$ and the number of clusters $K$ , we use the MDL principle (Grünwald, 2007), which follows the assumption that the more we compress the data, the more we generalize its underlying structures. The goodness of the model $\mathcal{M}$ can be described with the following total description cost:

	$\displaystyle Cost_{T}(\mathcal{X};\mathcal{M})\ =$	$\displaystyle Cost_{A}(\mathcal{F})+Cost_{M}(\Theta)+$
(10)			$\displaystyle Cost_{C}(\mathcal{X}\|\mathcal{M})+Cost_{\ell_{1}}(\Theta).$

We describe the four terms that appear in Eq. (10).

Coding length cost. $Cost_{A}(\mathcal{F})$ is the description complexity of the cluster assignment set $\mathcal{F}$ , which consists of the following elements: the number of clusters $K$ and segments $m$ require $\log^{*}(K)+\log^{*}(m)$ . ²²2Here, $\log^{*}$ is the universal code length for integers. The assignments of the segments to clusters require $m\times\log^{*}(K)$ . The number of observations of each cluster requires $\sum_{k=1}^{K}\log^{*}(|f_{k}|)$ .

	$\displaystyle Cost_{A}(\mathcal{F})=$	$\displaystyle\log^{}(K)+\log^{}(m)+$
(11)			$\displaystyle m\times\log^{}(K)+\sum_{k=1}^{K}\log^{}(\|f_{k}\|).$

Model coding cost. $Cost_{M}(\Theta)$ is the description complexity of the model parameter set $\Theta$ , which consists of the following elements: the diagonal values of each cluster at each hierarchy, which has sizes $D_{n}\times 1$ , require $D_{n}(\log(D_{n})+c_{F})$ , where $c_{F}$ is the floating point cost. ³³3We used $4\times 8$ bits in our setting. The positive values of $A^{(n)}\in\mathbf{R}^{D_{n}\times D_{n}}$ require $|A^{(n)}_{k}|_{\neq 0}(\log(D_{n}(D_{n}-1)/2)+c_{F})$ , where $|\cdot|_{\neq 0}$ describes the number of non-zero elements in a matrix.

	$\displaystyle Cost_{M}(\Theta)=$	$\displaystyle\sum_{k=1}^{K}\sum_{n=1}^{N}\{D_{n}(\log(D_{n})+c_{F})+\log^{*}(\|% A^{(n)}_{k}\|_{\neq 0})+$
(12)			$\displaystyle\|A^{(n)}_{k}\|_{\neq 0}(\log(D_{n}(D_{n}-1)/2)+c_{F})\}/(D_{n}^{2}% N).$

We divide by $D_{n}^{2}N$ to deal with the change of data scale.

Data coding cost. $Cost_{C}(\mathcal{X}|\mathcal{M})$ is the data encoding cost of $\mathcal{X}$ given the cluster parameter set $\mathcal{M}$ . Huffman coding (Böhm et al., 2007) uses the logarithm of the inverse of probability (i.e., the negative log-likelihood) of the values.

(13)

\displaystyle Cost_{C}(\mathcal{X}|\mathcal{M})=\sum_{k=1}^{K}\sum_{n=1}^{N}% \sum_{t\in f_{k}}ll_{n}(re(\mathcal{X})^{(\{N+1\},\{-1\},\{n\})}_{t,:,:},A^{(n% )}_{k}).

$\ell_{1}$ -norm cost. $Cost_{\ell_{1}}(\Theta)$ is the $\ell_{1}$ -norm cost given a model $\Theta$ .

(14)

\displaystyle Cost_{\ell_{1}}(\Theta)=\sum_{k=1}^{K}\sum_{n=1}^{N}

\displaystyle\lambda||A^{(n)}_{k}||_{od,1}.

Discovering an optimal sparse parameter $\lambda$ capable of modeling data is a challenge as it affects clustering results. However, the parameter value can be determined by using MDL to choose the minimum total cost (Miyaguchi et al., 2017).

Our next goal is to find the best cluster parameter set $\mathcal{M}$ that minimizes the total description cost Eq. (10).

5. Optimization algorithms

Algorithm 1 DMM

(\mathcal{X},\mathbf{w})

1: Input:

(N+1)^{th}

-order TTS

\mathcal{X}

and initial segment sizes set

\mathbf{w}

2: Output: Cluster parameters

\Theta

and cluster assignments

\mathcal{F}

3: Initialize

cp

with

\mathbf{w}

;

cp=

CutPointDetector

(\mathcal{X},cp)

; /* Finds the best cut point set */

5: /* ClusterDetector */

K=1

; Initialize

\Theta=\{\theta_{1}\}

;

\mathcal{F}=\{\{1,\dots,T\}\}

;

7: Compute

Cost_{T}(\mathcal{X};\{\Theta,\mathcal{F}\})

;

8: repeat

K=K+1

; Initialize

\Theta

for

K

clusters;

10: repeat

11:

\mathcal{F}=

SegmentAssignment

(\mathcal{X},\Theta,cp)

; /* E-step */

12:

\Theta=

NetworkInference

(\mathcal{X},\mathcal{F})

; /* M-step */

13: until

\mathcal{F}

is stable;

14: Compute

Cost_{T}(\mathcal{X};\{\Theta,\mathcal{F}\})

;

15: until

Cost_{T}(\mathcal{X};\{\Theta,\mathcal{F}\})

converges;

16: return

\mathcal{M}=\{\Theta,\mathcal{F}\}

;

Thus far, we have described our model based on graphical lasso and a criterion based on MDL. The most important question is how to discover good segmentation and clustering. Here, we propose an effective and scalable algorithm, which finds the local optimal of Eq. (10). The overall procedure is summarized in Alg. 1. Given an $(N+1)^{th}$ -order TTS $\mathcal{X}$ , the total description cost Eq. (10) is minimized using the following two sub-algorithms.

(1)

CutPointDetector: finds the number of segments $m$ and their cut points, i.e., the best cut point set $cp$ of $\mathcal{X}$ .
(2)

ClusterDetector: finds the number of clusters $K$ and the cluster parameter set $\mathcal{M}$ .

5.1. CutPointDetector

The first goal is to divide a given $\mathcal{X}$ into $m$ segments (i.e., patterns), but we assume that no information is known about them in advance. Therefore, to prevent a pattern explosion when searching for their optimal cut points, we introduce CutPointDetector based on the divide-and-conquer method (Keogh et al., 2001).

Specifically, it recursively merges a small segment set of $\mathcal{X}$ while reducing its total description cost, because neighboring subsequences typically exhibit the same pattern. We define $\mathbf{w}$ as a set of user-defined initial segment sizes, i.e., $\mathbf{w}=\{w_{i}\}_{i=1}^{m}$ , such as the number of days in each month or any small constant. An example illustration is shown in Fig. 2. Let $\theta_{i:i+1}$ be a model of $\mathcal{X}\{cp_{i}:cp_{i+1}\}$ at the $i^{th}$ segment. Given the three subsequent segments illustrated in Fig. 2 (a), we evaluate whether to merge the middle segment with either of the side segments (Fig. 2 (b)(c)). The total description cost for Fig. 2 (a) is given by $Cost_{T}(\mathcal{X};\{\theta_{i:i+1},\theta_{i+1:i+2},\theta_{i+2:i+3}\})$ , where we omit the cluster assignment (e.g., $\{j\}_{j=cp_{i}}^{cp_{i+1}-1}\}$ ) from the cost for clarity. If the cost for the original three segments is reduced by merging, it eliminates the unnecessary cut point and employs a new model $\theta$ for the merged segment. By repeating this procedure for each segment, $m$ decreases monotonically until convergence. See Appendix B.1 for the detailed procedure.

5.2. ClusterDetector

DMM searches for the best number of clusters by increasing $K=1,2,\dots,m$ , while the total description cost $Cost_{T}(\mathcal{X};\mathcal{M})$ is decreasing. To compute the cost, however, we must solve two problems, namely obtain the cluster assignment set $\mathcal{F}$ and the model parameter set $\Theta$ , either of which affects the optimization of the other. Therefore, we design ClusterDetector with the expectation and maximization (EM) algorithm. In the E-step, it determines $\mathcal{F}$ to minimize the data coding cost, $Cost_{C}(\mathcal{X}|\mathcal{M})$ , which is achieved by solving:

(15)

\displaystyle\mathop{\rm arg~{}min}\limits_{k\in\{1,\dots,K\}}Cost_{C}(% \mathcal{X}|\{\theta_{k},\{j\}_{j=cp_{i}}^{cp_{i+1}-1}\}),

for the $i$ -th segment, and then inserts time points from $cp_{i}$ to $cp_{i+1}$ (i.e., $\{j\}_{j=cp_{i}}^{cp_{i+1}-1}$ ) to the best $k$ -th cluster $f_{k}\in\mathcal{F}$ . In the M-step, for $1\leq k\leq K$ the algorithm infers $A^{(n)}_{k}(1\leq n\leq N)$ according to Eq. (8) to obtain $\theta_{k}\in\Theta$ for a given $\mathcal{X}[f_{k}]$ . Note that ClusterDetector starts by randomly initializing $\Theta$ .

Theoretical analysis.

Lemma 0 ().

The time complexity of DMM is $O(T\prod_{m=1}^{N}D_{m})$ , where $T$ is the data length, and $D_{m}$ is the number of variables at mode-m in $($ N+1 $)^{th}$ -order TTS $\mathcal{X}\in\mathbb{R}^{D_{1}\times\cdots\times D_{N}\times T}$ .

Proof.

Please see Appendix B.2. ∎

6. Experiments

In this section, we demonstrate the effectiveness of DMM on synthetic data. We use synthetic data because there are clear ground truth networks with which to test the clustering accuracy.

6.1. Experimental setting

6.1.1. Synthetic datasets

We randomly generate synthetic $($ N+1 $)^{th}$ -order TTS, $\mathcal{X}\in\mathbb{R}^{D_{1}\times\cdots\times D_{N}\times T}$ , which follows a multivariate normal distribution $vec(\mathcal{X}_{t})\sim\mathcal{N}(0,\theta^{-1})$ . Each of the $K$ clusters has a mean of $\vec{0}$ , so that the clustering results are based entirely on the structure of the data. For each cluster, we generate a random ground truth inverse covariance matrix $\theta$ as follows (Mohan et al., 2014; Hallac et al., 2017b):

(1)

For $n=1,\dots N$ , set $A^{(n)}\in\mathbb{R}^{D_{n}\times D_{n}}$ equal to the adjacency matrix of an Erdős-Rényi directed random graph, where every edge has a $20\%$ chance of being selected.
(2)

For every selected edge in $A^{(n)}$ , set $a^{(n)}_{i,j}\sim$ Uniform $([-0.6,-0.3]\cup[0.3,0.6])$ . We enforce a symmetry constraint whereby every $a^{(n)}_{i,j}=a^{(n)}_{j,i}$ .
(3)

Construct a hierarchical matrix $\theta_{tem}\in\mathbb{R}^{D\times D}$ using $\{A^{(n)}\}_{n=1}^{N}$ .
(4)

Let $c$ be the smallest eigenvalue of $\theta_{tem}$ , and set $\theta=\theta_{tem}+(0.1+|c|)I$ , where $I$ is an identity matrix. This ensures that $\theta$ is invertible.

6.1.2. Evaluation metrics

We run our experiments on four different temporal sequences: $\mathsf{A}$ : “1,2,1”, $\mathsf{B}$ : “1,2,3,2,1”, $\mathsf{C}$ : “1,2,3,4,1,2,3,4”, $\mathsf{D}$ : “1,2,2,1,3,3,3,1”, (for example, $\mathsf{A}$ consists of three segments and two clusters $\theta_{1}$ and $\theta_{2}$ .) We set each cluster in each example to have $100G$ observations, where $G$ is the number of segments in each cluster (e.g., $\mathsf{A}$ has $T=300$ ), and cut points are set randomly. We generate each dataset ten times and report the mean of the macro-F1 score.

6.1.3. Baselines

We compare our method with the following two state-of-the-art methods for time series clustering using the graphical lasso as their model.

•

TAGM (Tozzo et al., 2021): combines HMM with a graphical lasso by modeling each cluster as a graphical lasso and assuming clusters as hidden states of HMM.
•

TICC (Hallac et al., 2017b): uses the Toeplitz matrix to capture lag correlations and inter-variable correlations and penalizes changing clusters to assign the neighboring segments to the same cluster.

We do not compare with other clustering methods that ignore the network, such as K-means and DTW, because they do not show good results (Hallac et al., 2017b).

6.1.4. Parameter tuning

DMM and the baselines require a sparsity parameter for $\ell_{1}$ -norm. We varied $\lambda=\{0.5,1,2,4\}$ and set $\lambda=4$ for DMM and $\lambda=0.5$ for the baselines, which produces the best results. A matricization of tensor $mat(\mathcal{X})^{(N+1)}\in\mathbb{R}^{T\times D}$ and the true number of clusters are given to the baselines since the number of clusters need to be set. To tune TICC, we varied the regularization parameter $\beta=\{4,16,64,256\}$ and set $\beta=16$ , and set the window size $w=1$ , which is the correct assumption considering the data generation process. DMM requires us to specify $\mathbf{w}$ . We use the same $w_{i}$ (s.t., $i=1,\dots,m$ ) for all initial segments, and we set $w_{i}=4$ .

6.2. Results

6.2.1. Clustering accuracy

We take four different temporal sequences $\mathsf{A}$ $\sim$ $\mathsf{D}$ , and two different data sizes (i) and (ii) to observe the ability of DMM as regards clustering TTS. Table 1 shows the clustering accuracy for the macro-F1 scores for each dataset. ${}^{\dagger}$ shows TAGM and TICC set the number of clusters $K=\{2,3,4,5\}$ by Bayesian information criterion (BIC). As shown, DMM outperforms the baselines in most of the datasets, even for the (i) $2^{nd}$ -order TTS datasets. In particular, the difference in (ii) is even more noteworthy. Because TAGM and TICC cannot handle $3^{rd}$ -order TTS due to the limitation imposed by the matricization of the tensor.

Table 1. Macro-F1 score of clustering accuracy for eight different temporal sequences, comparing DMM with state-of-the-art methods (higher score is better). Best results are in bold, and second best results are underlined.

{}^{\dagger}

indicates a method where the number of clusters is set by BIC. (i):

2^{nd}

-order TTS

D_{1}=10

, (ii):

3^{rd}

-order TTS

D_{1}=D_{2}=10

\mathsf{A}

: “1,2,1”,

\mathsf{B}

: “1,2,3,2,1”,

\mathsf{C}

: “1,2,3,4,1,2,3,4”,

\mathsf{D}

: “1,2,2,1,3,3,3,1.”

Data		DMM	TAGM	TAGM ${}^{\dagger}$	TICC	TICC ${}^{\dagger}$
(i)	$\mathsf{A}$	$\underline{0.955}$	$0.915$	$0.915$	$\mathbf{0.997}$	$\mathbf{0.997}$
	$\mathsf{B}$	$\mathbf{0.926}$	$\underline{0.897}$	$0.756$	$0.884$	$0.825$
	$\mathsf{C}$	$\mathbf{0.956}$	$0.770$	$\underline{0.811}$	$0.725$	$0.756$
	$\mathsf{D}$	$\mathbf{0.960}$	$0.907$	$0.912$	$0.857$	$\underline{0.952}$
(ii)	$\mathsf{A}$	$\mathbf{0.961}$	$0.514$	$0.514$	$\underline{0.932}$	$0.923$
	$\mathsf{B}$	$\mathbf{0.962}$	$0.462$	$0.431$	$\underline{0.844}$	$0.770$
	$\mathsf{C}$	$\mathbf{0.941}$	$0.359$	$0.396$	$\underline{0.704}$	$0.594$
	$\mathsf{D}$	$\mathbf{0.980}$	$0.438$	$0.432$	$\underline{0.838}$	$0.741$

6.2.2. Effect of total number of variables

We next examine how the number of variables $D_{1}$ affects each method as regards accurately finding clusters. We take the $\mathsf{C}$ example and vary $D_{1}=5\sim 50$ for (a) $2^{nd}$ -order TTS and (b) $3^{rd}$ -order TTS. As shown in Fig. 3, our method outperforms the baselines for all $D_{1}$ in both tensors. The performance of TAGM and TICC worsens as $D_{1}$ increases, while DMM maintains its performance even though $D_{1}$ increases due to our well-defined total description cost that can handle the change in data scale. TAGM and TICC are less accurate in Fig. 3 (b) than Fig. 3 (a) since they cannot deal with $3^{rd}$ -order TTS.

6.2.3. Scalability

We perform experiments to verify the time complexity of DMM. As described in Lemma 1, the time complexity of DMM scales linearly in terms of the data size. Fig. 4 shows the computation time of DMM when we vary $D_{1}$ (Fig. 4 (a)) and $T$ (Fig. 4 (b)). Thanks to our proposed optimization algorithm, the time complexity of DMM scales linearly with $D_{n}$ and $T$ .

7. Case study

We perform experiments on real data to show the applicability of DMM and demonstrate how DMM can be used to obtain meaningful insights from TTS.

7.1. Experimental setting

7.1.1. Datasets

We describe our datasets in detail.

Table 2. The data size and attributes for each dataset.

ID	Dataset	Size	Description
#1	E-commerce	(11, 10, 1796)	(query, state, day)
#2	VoD	(8, 10, 1796)
#3	Sweets	(9, 10, 1796)
#4	Covid	(6, 10, 3652)	(query, country, day)
#5	GAFAM	(5, 10, 1796)	(query, country, day)
#6	Air	(6, 12, 1461)	(pollutant, site, day)
#7	Car-A	(6, 10, 4, 3241)	(sensor, lap, driver, meter)
#8	Car-H	(6, 10, 4, 4000)	(sensor, lap, driver, meter)

Google Trends (#1 $\sim$ #5). We use the data from Google Trends. Each tensor contains daily web-search counts. #4 Covid was collected over $10$ years from Jan. 1st $2013$ to Dec. 31st $2022$ to include the effect of COVID-19. Other datasets are from Jan. 1st $2015$ to Dec. 31st $2019$ to avoid the effect of COVID-19. The datasets include five query sets (Appendix C.1). We collect the data from two target areas: three datasets from the top $10$ populated US states and two from the top $10$ countries ranked by GDP score. We normalize the data every month to achieve clustering that only considers the network.

Air (#6). We use Air data that collected daily concentrations of six pollutants at $12$ nationally-controlled monitoring sites in Beijing, China from Mar. 1st $2013$ to Feb. 29th $2016$ (Zhang et al., 2017). We fill the missing values by linear interpolation and normalize the data every month.

Automobile (#7, #8). We use two automobile datasets with different driving courses. #7 Car-A is a city course and #8 Car-H is a highway course. We observe six sensors every meter: Brake, Speed, GX (X Accel), GY (Y Accel), Steering angle, Fuel Economy. Four drivers drive $10$ laps of the same course, hence each dataset forms a $4^{th}$ -order tensor. We normalize the data every $10$ meters.

The size and attributes of the datasets are given in Table 2.

7.1.2. Hyperparameter

To tune DMM, we vary the sparsity parameter $\lambda=\{0.5,1,2,4\}$ and set the value that produces the minimum total description cost Eq. (10). We fix the initial window size $w$ depending on the dataset, equal to the normalization period. For a fair comparison, for TAGM and TICC, we set the sparse parameter equal to DMM, and the number of clusters equal to that found by DMM. For TICC, we vary the regularization parameter $\beta=\{4,16,64,256\}$ and set the parameter with BIC.

7.2. Results

7.2.1. Applicability

We show the usefulness of DMM for analyzing real-world TTS.

Table 3. The number of clusters (# Cl.) and segments (# Seg.), and log-likelihood (LL) of eight real-world datasets, comparing DMM with state-of-the-art methods. The bold font and underlines show methods providing the best and second best LL, respectively (higher is better).

		DMM		TAGM		TICC
Data	# Cl.	# Seg.	LL	# Seg.	LL	# Seg.	LL
#1	$2$	$10$	$\mathbf{{-1.89}\mathrm{e}{5}}$	$485$	$\underline{{-1.92}\mathrm{e}{5}}$	$3$	${-1.97}\mathrm{e}{5}$
#2	$2$	$2$	$\underline{{-1.68}\mathrm{e}{5}}$	$527$	$\mathbf{{-1.65}\mathrm{e}{5}}$	$2$	$\underline{{-1.68}\mathrm{e}{5}}$
#3	$2$	$7$	$\mathbf{{-1.90}\mathrm{e}{5}}$	$502$	$\mathbf{{-1.90}\mathrm{e}{5}}$	$17$	$\mathbf{{-1.90}\mathrm{e}{5}}$
#4	$4$	$4$	$\underline{{-2.85}\mathrm{e}{5}}$	$1778$	$\mathbf{{-2.73}\mathrm{e}{5}}$	$5$	${-2.88}\mathrm{e}{5}$
#5	$2$	$2$	$\underline{{-9.28}\mathrm{e}{4}}$	$519$	$\mathbf{{-9.10}\mathrm{e}{4}}$	$3$	${-9.48}\mathrm{e}{4}$
#6	$6$	$13$	$\underline{{-5.19}\mathrm{e}{4}}$	$929$	$\mathbf{{-4.82}\mathrm{e}{4}}$	$10$	${-6.34}\mathrm{e}{4}$
#7	$11$	$11$	$\mathbf{{-5.89}\mathrm{e}{5}}$	$1300$	$\underline{{-6.33}\mathrm{e}{5}}$	$12$	${-9.36}\mathrm{e}{5}$
#8	$5$	$12$	$\underline{{-1.06}\mathrm{e}{6}}$	$974$	$\mathbf{{-1.02}\mathrm{e}{6}}$	$6$	${-1.16}\mathrm{e}{6}$

Modeling accuracy. Since there are no labels for TTS, we review the modeling accuracy of DMM by comparing the number of segments and the log-likelihood, which explains the goodness of clustering according to our objective function based on MDL. We use cluster assignments to calculate the log-likelihood (Eq. (2)). Table 3 shows the results. DMM finds a reasonable number of segments and a higher log-likelihood than TICC. TAGM switches clusters with the transition matrix of HMM. This works well on synthetic datasets when there are clear transitions. However, it is not suitable for real-world datasets, which contain noises and whose network changes gradually. As a result, TAGM finds the cluster assignments that maximize the log-likelihood regardless of the number of segments. TICC assigns neighboring time steps to the same cluster using a penalty $\beta$ . Thus, its number of segments is close to DMM. However, TICC is not suitable for tensors, and the log-likelihood is worse than DMM for most datasets.

Computation time.

We compare the computation time needed for processing real data in Fig. 5. DMM is the fastest for most datasets since it infers the network for each mode. In contrast, TAGM and TICC compute the entire network at once. Therefore, they are more affected by the number of variables at each mode than DMM, resulting in a longer computation time. Note that the computation time of TAGM and TICC at $2^{nd}$ -order TTS is comparable to DMM.

7.2.2. Interpretability

We show how the clustering results presented by DMM make sense. We have already shown the results of DMM for clustering over #4 Covid in Section 1 (see Fig. 1). Please also see the results in #1 E-commerce in Appendix C.2.

Air. We compare the clustering results of DMM, TAGM and, TICC over #6 Air regarding cluster assignments (Fig. 6) and obtained networks (Fig. 7). Fig. 6 (a) shows the original sensor data at Aoti Zhongxin. Fig. 6 (b) shows that DMM assigns Apr. through Oct. of each year to cluster #2, capturing the yearly seasonality (Zhang et al., 2017). The cluster assignments of TAGM (see Fig. 6 (c)) switch frequently, and TICC (see Fig. 6 (d)) assigns most of the period to cluster #4. Both cluster assignments are far from interpretable. Fig. 7 shows the networks obtained with each method. The cluster of DMM (see Fig. 7 (a)) includes the pollutant network and the location network. The pollutant network has a strong edge between PM2.5 and PM10, and the location network, whose nodes are plotted on the map, has edges only between closely located nodes, both of which match our expectation and accordingly indicate that DMM discovers interpretable networks. TAGM and TICC (see Fig. 7 (b)(c)) find a network for all variables. Although the networks are sparse, the large number of nodes and edges hampers our understanding of the networks. Due to the simplicity of networks generated by DMM, their interpretability surpasses those of other methods (Du et al., 2019). Consequently, DMM provides interpretable clustering results that can reveal underlying relationships among variables of each mode and is suitable for modeling and clustering TTS.

8. Conclusion

In this paper, we proposed an efficient tensor time series subsequence clustering method, namely DMM. Our method characterizes each cluster by multiple networks, each of which is the dependency network of a corresponding non-temporal mode. These networks make our results visible and interpretable, enabling the multifaceted analysis and understanding of tensor time series. We defined a criterion based on MDL that allows us to find clusters of data and determine all user-defined parameters. Our algorithm scales linearly with the input size and thus can apply to the massive data size of a tensor. We showed the effectiveness of DMM via extensive experiments using synthetic and real datasets.

Acknowledgements.

The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was supported by JSPS KAKENHI Grant-in-Aid for Scientific Research Number JP21H03446, JP22K17896, NICT JPJ012368C03501, JST CREST JPMJCR23M3, JST-AIP JPMJCR21U4.

References

(1)
Aghabozorgi et al. (2015) Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. 2015. Time-series clustering–a decade review. Information systems 53 (2015), 16–38.
Alaee et al. (2021) Sara Alaee, Ryan Mercer, Kaveh Kamgar, and Eamonn Keogh. 2021. Time series motifs discovery under DTW allows more robust discovery of conserved structure. Data Mining and Knowledge Discovery 35 (2021), 863–910.
Bai et al. (2019) Lei Bai, Lina Yao, Salil S. Kanhere, Xianzhi Wang, and Quan Z. Sheng. 2019. STG2Seq: Spatial-Temporal Graph to Sequence Model for Multi-step Passenger Demand Forecasting. In IJCAI. 1981–1987.
Batalo et al. (2022) Bojan Batalo, Lincon S Souza, Bernardo B Gatto, Naoya Sogi, and Kazuhiro Fukui. 2022. Analysis of Temporal Tensor Datasets on Product Grassmann Manifold. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4869–4877.
Berndt and Clifford (1994) Donald J. Berndt and James Clifford. 1994. Using Dynamic Time Warping to Find Patterns in Time Series. In Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop, Seattle, Washington, USA, July 1994. Technical Report WS-94-03. 359–370.
Böhm et al. (2007) Christian Böhm, Christos Faloutsos, Jia-Yu Pan, and Claudia Plant. 2007. Ric: Parameter-free noise-robust clustering. TKDD 1, 3 (2007), 10–es.
Boyd et al. (2011) Stephen P. Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 3, 1 (2011), 1–122.
Cai et al. (2015) Yongjie Cai, Hanghang Tong, Wei Fan, Ping Ji, and Qing He. 2015. Facets: Fast Comprehensive Mining of Coevolving High-order Time Series. In KDD. 79–88.
Du et al. (2019) Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretable machine learning. Commun. ACM 63, 1 (2019), 68–77.
Friedman et al. (2008) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 3 (2008), 432–441.
Gatto et al. (2021) Bernardo B Gatto, Eulanda M dos Santos, Alessandro L Koerich, Kazuhiro Fukui, and Waldir SS Junior. 2021. Tensor analysis with n-mode generalized difference subspace. Expert Systems with Applications 171 (2021), 114559.
Grünwald (2007) Peter D Grünwald. 2007. The minimum description length principle. MIT press.
Hallac et al. (2017a) David Hallac, Youngsuk Park, Stephen P. Boyd, and Jure Leskovec. 2017a. Network Inference via the Time-Varying Graphical Lasso. In KDD. 205–213.
Hallac et al. (2017b) David Hallac, Sagar Vare, Stephen P. Boyd, and Jure Leskovec. 2017b. Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data. In KDD. 215–223.
Harutyunyan et al. (2019) Hrayr Harutyunyan, Daniel Moyer, Hrant Khachatrian, Greg Ver Steeg, and Aram Galstyan. 2019. Efficient Covariance Estimation from Temporal Data. arXiv preprint arXiv:1905.13276 (2019).
Hirano and Tsumoto (2006) Shoji Hirano and Shusaku Tsumoto. 2006. Cluster analysis of time-series medical data based on the trajectory representation and multiscale comparison techniques. In ICDM. IEEE, 896–901.
Jing et al. (2021) Baoyu Jing, Hanghang Tong, and Yada Zhu. 2021. Network of Tensor Time Series. In WWW, Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia (Eds.). 2425–2437.
Kawabata et al. (2021) Koki Kawabata, Siddharth Bhatia, Rui Liu, Mohit Wadhwa, and Bryan Hooi. 2021. Ssmf: Shifting seasonal matrix factorization. Advances in Neural Information Processing Systems 34 (2021), 3863–3873.
Keogh (2002) Eamonn Keogh. 2002. Exact Indexing of Dynamic Time Warping. In VLDB (Hong Kong, China). 406–417.
Keogh et al. (2001) Eamonn J. Keogh, Selina Chu, David M. Hart, and Michael J. Pazzani. 2001. An Online Algorithm for Segmenting Time Series. In Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November - 2 December 2001, San Jose, California, USA. IEEE Computer Society, 289–296.
Kolda and Bader (2009) Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455–500.
Liu et al. (2020) Yu Liu, Quanming Yao, and Yong Li. 2020. Generalizing tensor decomposition for n-ary relational knowledge bases. In WWW. 1104–1114.
Madabhushi and Lee (2016) Anant Madabhushi and George Lee. 2016. Image analysis and machine learning in digital pathology: Challenges and opportunities. Medical image analysis 33 (2016), 170–175.
Matsubara et al. (2014) Yasuko Matsubara, Yasushi Sakurai, and Christos Faloutsos. 2014. AutoPlait: Automatic Mining of Co-Evolving Time Sequences. In SIGMOD. 193–204.
Matsubara et al. (2016) Yasuko Matsubara, Yasushi Sakurai, and Christos Faloutsos. 2016. Non-Linear Mining of Competing Local Activities. In WWW.
Miyaguchi et al. (2017) Kohei Miyaguchi, Shin Matsushima, and Kenji Yamanishi. 2017. Sparse graphical modeling via stochastic complexity. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 723–731.
Miyajima et al. (2007) Chiyomi Miyajima, Yoshihiro Nishiwaki, Koji Ozawa, Toshihiro Wakita, Katsunobu Itou, Kazuya Takeda, and Fumitada Itakura. 2007. Driver modeling based on driving behavior and its evaluation in driver identification. IEEE 95, 2 (2007), 427–437.
Mohan et al. (2014) Karthik Mohan, Palma London, Maryam Fazel, Daniela Witten, and Su-In Lee. 2014. Node-Based Learning of Multiple Gaussian Graphical Models. J. Mach. Learn. Res. 15, 1 (jan 2014), 445–488.
Monti et al. (2014) Ricardo Pio Monti, Peter Hellyer, David Sharp, Robert Leech, Christoforos Anagnostopoulos, and Giovanni Montana. 2014. Estimating time-varying brain connectivity networks from functional MRI time series. NeuroImage 103 (2014), 427–443.
Nakamura et al. (2023) Kota Nakamura, Yasuko Matsubara, Koki Kawabata, Yuhei Umeda, Yuichiro Wada, and Yasushi Sakurai. 2023. Fast and Multi-aspect Mining of Complex Time-stamped Event Streams. In WWW. 1638–1649.
Namaki et al. (2011) A. Namaki, A.H. Shirazi, R. Raei, and G.R. Jafari. 2011. Network analysis of a financial market based on genuine correlation and threshold method. Physica A: Statistical Mechanics and its Applications 390, 21 (2011), 3835–3841.
Papadimitriou et al. (2005) Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos. 2005. Streaming pattern discovery in multiple time-series. (2005).
Plant and Böhm (2011) Claudia Plant and Christian Böhm. 2011. Inconco: interpretable clustering of numerical and categorical objects. In KDD. 1127–1135.
Ramoni et al. (2000) Marco Ramoni, Paola Sebastiani, and Paul R. Cohen. 2000. Multivariate Clustering by Dynamics. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. AAAI Press, 633–638.
Rogers et al. (2013) Mark Rogers, Lei Li, and Stuart J Russell. 2013. Multilinear Dynamical Systems for Tensor Time Series. In NIPS. 2634–2642.
Rudin (2019) Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence 1, 5 (2019), 206–215.
Rue and Held (2005) Havard Rue and Leonhard Held. 2005. Gaussian Markov random fields: theory and applications. CRC press.
Ruiz et al. (2012) Eduardo J. Ruiz, Vagelis Hristidis, Carlos Castillo, Aristides Gionis, and Alejandro Jaimes. 2012. Correlating Financial Time Series with Micro-Blogging Activity. In WSDM (Seattle, Washington, USA). Association for Computing Machinery, New York, NY, USA, 513–522.
Takahashi et al. (2017) Tsubasa Takahashi, Bryan Hooi, and Christos Faloutsos. 2017. AutoCyclone: Automatic Mining of Cyclic Online Activities with Robust Tensor Factorization. In WWW (Perth, Australia). 213–221.
Tan et al. (2015) Kean Ming Tan, Daniela Witten, and Ali Shojaie. 2015. The cluster graphical lasso for improved estimation of Gaussian graphical models. Computational statistics & data analysis 85 (2015), 23–36.
Tomasi et al. (2021) Federico Tomasi, Veronica Tozzo, and Annalisa Barla. 2021. Temporal Pattern Detection in Time-Varying Graphical Models. In ICPR. 4481–4488.
Tomasi et al. (2018) Federico Tomasi, Veronica Tozzo, Saverio Salzo, and Alessandro Verri. 2018. Latent Variable Time-varying Network Inference. In KDD. 2338–2346.
Tozzo et al. (2021) Veronica Tozzo, Federico Ciech, Davide Garbarino, and Alessandro Verri. 2021. Statistical Models Coupling Allows for Complex Local Multivariate Time Series Analysis. In KDD. 1593–1603.
Vlachos et al. (2002) Michail Vlachos, George Kollios, and Dimitrios Gunopulos. 2002. Discovering similar multidimensional trajectories. In Proceedings 18th international conference on data engineering. IEEE, 673–684.
Wu et al. (2019) Xunxian Wu, Tong Xu, Hengshu Zhu, Le Zhang, Enhong Chen, and Hui Xiong. 2019. Trend-Aware Tensor Factorization for Job Skill Demand Analysis.. In IJCAI. 3891–3897.
Wytock and Kolter (2013) Matt Wytock and Zico Kolter. 2013. Sparse Gaussian conditional random fields: Algorithms, theory, and application to energy forecasting. In International conference on machine learning. PMLR, 1265–1273.
Xiong and Yeung (2004) Yimin Xiong and Dit-Yan Yeung. 2004. Time series clustering with ARMA mixtures. Pattern Recognition 37, 8 (2004), 1675–1689.
Xuan and Murphy (2007) Xiang Xuan and Kevin Murphy. 2007. Modeling Changing Dependency Structure in Multivariate Time Series. In ICML (Corvalis, Oregon, USA). Association for Computing Machinery, New York, NY, USA, 1055–1062.
Yuan and Lin (2006) Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 68, 1 (2006), 49–67.
Zhang et al. (2017) Shuyi Zhang, Bin Guo, Anlan Dong, Jing He, Ziping Xu, and Song Xi Chen. 2017. Cautionary tales on air-quality improvement in Beijing. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 473, 2205 (2017), 20170457.
Zolhavarieh et al. (2014) Seyedjamal Zolhavarieh, Saeed Aghabozorgi, Ying Wah Teh, et al. 2014. A review of subsequence time series clustering. The Scientific World Journal 2014 (2014).

Appendix A Proposed Model

Table 4 lists the main symbols we use throughout this paper.

Appendix B Algorithms

B.1. CutPointDetector

Alg. 2 shows the overall procedure for CutPointDetector, which is a subalgorithm of Alg. 1. For clarity, we describe the total description cost as $Cost_{T}(\mathcal{X};\{\Theta\})$ . The cluster assignment set for $\Theta[id]$ is a corresponding segment.

B.2. Proof of Lemma 1

Proof.

The computational cost of the DMM depends largely on the number of CutPointDetector iterations and the cost of inferring $\Theta$ at each iteration. Consider that all segments are eventually merged. Since the total computational time needed to infer $\Theta$ is the sum of $\{A^{(1)},\cdots,A^{(N)}\}$ inferences, we discuss the case of $A^{(n)}$ . When $T\prod_{m=1(m\neq n)}^{N}D_{m}\gg D_{n}$ , at each iteration, inferring $A^{(n)}$ for all segments takes $O(D_{n}T\prod_{m=1(m\neq n)}^{N}D_{m})$ thanks to ADMM. If the number of segments is halved at each iteration, the number of iterations is $\log_{2}|\mathbf{w}|$ . If the number of segments decreases by one at each iteration, the number of iterations is $|\mathbf{w}|$ , but this is unlikely to happen. $T\gg\log_{2}|\mathbf{w}|$ , and so the computation cost related to $A^{(n)}$ is $O(T\prod_{m=1}^{N}D_{m})$ . Since $T,D_{n}\gg N$ , the repetition of inference for each mode is negligible. Therefore, the time complexity of DMM is $O(T\prod_{m=1}^{N}D_{m})$ . ∎

Appendix C Case Study

C.1. Datasets

We describe the query set we used for Google Trends in Table 5.

C.2. Results

Total description cost. We compare the total description cost of DMM with TAGM and TICC on real-world datasets in Fig. 8. As shown, DMM achieves the lowest total description cost of all the datasets. TAGM has many segments, which results in the large coding length cost. TICC is not capable of handling tensor, which results in higher data coding cost compared with DMM.

E-commerce. We demonstrate how effectively DMM works on the #1 E-commerce dataset. Fig. 9 shows the result of DMM for clustering over #1 E-commerce. Fig. 9 (a) shows the clustering results of the original TTS, where each color represents a cluster. DMM finds 10 segments and two clusters. We name the blue cluster “Dairy products” and the pink cluster “Online sales.” DMM assigns every Nov. to “Online sales”, the period of Black Friday and Cyber Monday. Fig. 9 (b) shows the query and state networks for each cluster. The query network of “Daily products” shows that there are edges between the local daily products companies ( “costco”, “walmart”, and “target”). On the other hand, with the query network of “Online sales”, there are many edges, especially related to large e-commerce companies ( “amazon” and “ebay”), and the state network shows that the top four populated states ( “CA”, “TX”, “FL”, and “NY”) form edges, indicating the similarity of online shopping among the big states.

Table 4. Symbols and definitions.

Symbol	Definition
$D_{n}$	Number of variables at mode-n
$N$	Number of modes excluding temporal mode
$T$	Number of timestamp
$\mathcal{X}$	$($ N+1 $)^{th}$ -order TTS, i.e., $\mathcal{X}=\{\mathcal{X}_{1},\mathcal{X}_{2},\dots,\mathcal{X}_{T}\}\in% \mathbb{R}^{D_{1}\times\cdots\times D_{N}\times T}$
$\mathcal{X}_{t}$	$N^{th}$ -order tensor at $t^{th}$ time step, i.e., $\mathcal{X}_{t}\in\mathbb{R}^{D_{1}\times\cdots\times D_{N}}$
$D$	Total product of variables excluding $T$ , i.e., $D=\prod_{n=1}^{N}D_{n}$
$D^{(\backslash n)}$	Total product of variables excluding $D_{n}$ and $T$ , i.e., $D^{(\backslash n)}=\prod_{m=1(m\neq n)}^{N}D_{m}$
$K$	Number of clusters
$m$	Number of segments
$cp$	Cut points, i.e., $cp=\{cp_{1},cp_{2},\dots,cp_{m}\}$
$cp_{i}$	Starting point of segment $i$ , i.e., $cp_{1}=1,cp_{m+1}=T+1$
$\Theta$	Model parameter set, i.e., $\Theta=\{\theta_{1},\theta_{2},\dots,\theta_{K}\}$
$\theta$	Hierarchical Teoplitz matrix of shape $\theta\in\mathbb{R}^{D\times D}$ consists of $\{A^{(1)},\cdots,A^{(N)}\}$
$A^{(n)}$	Precision matrix of mode-n, i.e., $A^{(n)}\in\mathbb{R}^{D_{n}\times D_{n}}$
$\mathcal{F}$	Cluster assignment set, i.e., $\mathcal{F}=\{f_{1},f_{2},\dots,f_{K}\}$
$\mathcal{M}$	Cluster parameter set, i.e., $\mathcal{M}=\{\mathcal{F},\Theta\}$
$Cost_{A}(\mathcal{F})$	Coding length cost: description complexity of $\mathcal{F}$
$Cost_{M}(\Theta)$	Model coding cost: description complexity of $\Theta$
$Cost_{C}(\mathcal{X}\|\mathcal{M})$	Data coding cost: negative log-likelihood of $\mathcal{X}$ given $\mathcal{M}$
$Cost_{\ell_{1}}(\Theta)$	$\ell_{1}$ -norm cost: penalty for $\Theta$
$Cost_{T}(\mathcal{X};\mathcal{M})$	Total description cost: total cost of $\mathcal{X}$ given $\mathcal{M}$

Algorithm 2 CutPointDetector

(\mathcal{X},cp)

1: Input:

(N+1)^{th}

-order TTS

\mathcal{X}

and initial cut points set

cp

2: Output: The best cut point set

cp

3: repeat

id=0

cp_{new}=\phi

;

\Theta_{S}=\{\theta_{cp_{0}:cp_{1}},\theta_{cp_{1}:cp_{2}},\dots,\theta_{cp_{m% }:cp_{m+1}}\}

\Theta_{E}=\{\theta_{cp_{0}:cp_{2}},\theta_{cp_{2}:cp_{4}},\dots\}

\Theta_{O}=\{\theta_{cp_{1}:cp_{3}},\theta_{cp_{3}:cp_{5}},\dots\}

8: while

id<length(\mathcal{X})

9: if

id

is even then

10:

\Theta_{Left}=\Theta_{O}

;

\Theta_{Right}=\Theta_{E}

;

11:

id_{Left}=\lfloor id/2\rfloor

;

id_{Right}=\lfloor id/2\rfloor+1

;

12: else if

id

is odd then

13:

\Theta_{Left}=\Theta_{E}

;

\Theta_{Right}=\Theta_{O}

;

14:

id_{Left}=\lfloor id/2\rfloor+1

;

id_{Right}=\lfloor id/2\rfloor+1

;

15: end if

16:

C_{solo}=Cost_{T}(\mathcal{X};\{\Theta_{S}[id],\Theta_{S}[id+1],\Theta_{S}[id+% 2]\})

;

17:

C_{left}=Cost_{T}(\mathcal{X};\{\Theta_{Left}[id_{Left}],\Theta_{S}[id+2]\})

;

18:

C_{right}=Cost_{T}(\mathcal{X};\{\Theta_{S}[id],\Theta_{Right}[id_{Right}]\})

;

19: if

min(C_{solo},C_{left},C_{right})=C_{solo}

then

20:

cp_{new}=cp_{new}\cup cp[id]

;

id+=1

;

21: else if

min(C_{solo},C_{left},C_{right})=C_{left}

then

22:

cp_{new}=cp_{new}\cup cp[id+1]

;

id+=2

;

23: else if

min(C_{solo},C_{left},C_{right})=C_{right}

then

24:

cp_{new}=cp_{new}\cup cp[id],cp[id+2]

;

id+=3

;

25: end if

26: end while

27:

cp=cp_{new}

;

28: until

cp

is stable;

29: return

cp

;

Table 5. Google Trends query set.

Name

Query

#1 E-commerce

Amazon/Apple/BestBuy/Costco/Craigslist/Ebay/

Homedepot/Kohls/Macys/Target/Walmart

#2 VoD

AppleTV/ESPN/HBO/Hulu/Netflix/Sling/

Vudu/YouTube

#3 Sweets

Cake/Candy/Chocolate/Cookie/Cupcake/

Gum/Icecream/Pie/Pudding

#4 Covid

Covid/Corona/Flu/Influenza/Vaccine/Virus

#5 GAFAM

Amazon/Apple/Facebook/Google/Microsoft