Abstract
Multimodal fusion is becoming more common as it proves to be a powerful approach to identify complementary information from multimodal datasets. However, simulation of joint information is not straightforward. Published approaches mostly employ limited, provisional designs that often break the link between the model assumptions and the data for the sake of demonstrating properties of fusion techniques. This work introduces a new approach to synthetic data generation which allows full-compliance between data and model while still representing realistic spatiotemporal features in accordance with the current neuroimaging literature. The focus is on the simulation of joint information for verification of stochastic linear models, particularly those used in multimodal data fusion of brain imaging data.
Our first goal is to obtain a benchmark ground-truth in which estimation errors due to model mismatch are minimal or none. Then we move on to assess how estimation is affected by gradually increasing model discrepancies towards a more realistic dataset. The key aspect of our approach is that it permits complete control over the type and level of model mismatch, allowing for more educated inferences about the limitations and caveats of select stochastic linear models. As a result, impartial comparison of models is possible based on their performance in multiple different scenarios.
Our proposed method uses the commonly overlooked theory of copulas to enable full control of the type and level of dependence/association between modalities, with no occurrence of spurious multimodal associations. Moreover, our approach allows for arbitrary single-modality marginal distributions for any fixed choice of dependence/association between multimodal features. Using our simulation framework, we can rigorously challenge the assumptions of several existing multimodal fusion approaches.
Our study brings a new perspective to the problem of simulating multimodal data that can be used for ground-truth verification of various stochastic multimodal models available in the literature, and reveals some important aspects that are not captured or are overlooked by ad hoc simulations that lack a firm statistical motivation.
Keywords: Multimodal, simulation, fusion, ICA, stochastic, copula, multidimensional
1. Introduction
Multimodal data fusion by stochastic linear models of the form shown in Equation (1) is becoming increasingly popular in neuroimaging research. Put simply, these models assume separability of each modality’s data Xm into latent, unobservable stochastic variables Sm which are linearly mixed through an unknown mixing matrix Am, where m=1, …,M and M is the total number of modalities available. Typically, multimodal data consists of a collection of multimodal features pooled from a number of subjects. These features are often the result from single-modality first-level (i.e., subject-level) analyses. Thus, most data fusion approaches are characterized as second-level analyses (i.e., pooled group analyses). The joint decompositions of the multimodal Xm can then be made sensitive to multimodal associations among the mixing matrices Am (Calhoun et al., 2006a; Correa et al., 2010; Sui et al., 2011) or among the component variables Sm (Groves et al., 2011; Sui et al., 2010). Here, we consider Am is a (N × C) matrix and Sm is a (C × V) matrix, where N is the number of subjects, C is the number of joint multimodal components, and V is the number of data points (e.g., voxels or timepoints).
(1) |
Our interest is in the question “How does the performance of a multimodal model change as the properties of Am and Sm become less compliant with the underlying assumptions?” Historically, multimodal fusion studies have relied on the simulation of artificial datasets to showcase new models but seldom to address this question. Surely, well-designed simulations are useful to “debug” new algorithms, and make comparison and proof-of-concept demonstrations. Nevertheless, we have only recently observed a greater interest in carefully designed neuroimaging simulations that could venture after particular limitations and caveats of some popular models (Allen et al., 2012; Erhardt et al., 2012). Unfortunately, multimodal simulation studies have time and again neglected the virtues from the field of computerized simulation (Banks, 1998). This is a well-developed area that aims at making accurate predictions about the evolution and behavior of complex systems over time, typically modeling changes and state transitions via differential equations. Evidently, the class of stochastic linear models we consider here is not nearly as elaborate. Still, we believe some of the same principles, especially those related with ground-truth verification (Oberkampf and Roy, 2010), apply when synthetic multimodal data is to be generated from these simpler models. We would expect this to improve our ability to rigorously evaluate any limitations of such models and, thus, better address the question above.
Translating this into practice involves making complete and gradual assessments of each multimodal model in different scenarios, both model-inspired and realistic ones. In realistic designs, multimodal associations are simulated based on the current literature, using prior knowledge about how associations are formed in real data. Model-based designs, on the other hand, simulate associations in agreement with the chosen model, complying with all its assumptions and limitations. Generally, when real data is well understood, a realistic design can help select the model that best recovers real data properties, whereas a model-based design allows ground-truth verification of a model implementation. Current multimodal simulation approaches are, for the most part, a blend between these two types of design (see examples in Appendix A). To our knowledge, however, no multimodal investigation has attempted to procedurally explore the spectrum of designs spanned between these extremes. Our premise throughout this work is that the optimal procedure to study the strengths and limitations of any model is by starting with a faithful model-based design, and gradually add discrepancies and violations to specific assumptions towards more realistic designs. In practice, however, we have observed that designing general multimodal associations in a manner that enables controlled model discrepancies is a great challenge. Lack of control leads to ad hoc solutions that often are rigid and only reflect the assumptions of a single stochastic model. We believe that addressing these issues would considerably increase our ability to recognize the key limitations of each model, especially in the anticipated cases where real data fails to comply with the model assumptions.
With that in mind, we introduce a new framework for synthetic data generation flexible enough to comply with many stochastic linear models of the form given in Equation (1) while admitting gradual, controlled transitions from model-based to realistic designs. This is attained in a way that still provides realistic neurophysiological features. As we demonstrate in our results, our approach to simulation helps identify unreported limitations and success conditions of two popular multimodal data fusion methods in neuroimaging: joint independent component analysis (jICA) (Calhoun et al., 2006a) and canonical correlation analysis (CCA) (Li et al., 2007b). Below, we outline which kinds of multimodal models are supported and the guiding principles backing our approach. We end this section with an overview of our framework.
1.1. Supported Multimodal Data Fusion Models
Data fusion employs exploratory analysis to infer associations from multimodal data. A recent review (Sui et al., 2012) indicated data fusion via data-driven stochastic linear models as one of the main approaches for joint analysis of multiple brain imaging modalities. Single-modality examples include popular methods like PCA (Andersen et al., 1999; Moeller and Strother, 1991), CCA (Friman et al., 2001; Li et al., 2007b), and ICA (Calhoun and Adali, 2006; Calhoun et al., 2001; McKeown et al., 1998). In the case of multimodal data, the latest methods have proposed generalizations that accommodate and extend the principles and techniques from the single-modality cases. For example, multi-set CCA (Correa et al., 2010; Li et al., 2009) generalizes CCA to the case of three or more modalities, and, likewise, jICA attempts to generalize single-modality ICA. Generally speaking, the challenge with multimodal approaches resides in the estimation of relevant statistics from the unknown, typically higher-dimensional multimodal distributions that drive multimodal associations.
In order to clarify the extent of our approach to simulations, we build on the work of Sui et al. (Sui et al., 2012) and classify multimodal methods using the following criteria: nature, goal, approach, mechanism, input type, number of modalities, and use of priors. Nature refers to whether or not a method is stochastic. Goal refers to what kind of multimodal association is explicitly pursued (mixing-level, component-level, or both). The approach describes how the multimodal data is handled and can be principled (pursuing distinct decompositions for each dataset Xm), ad hoc (imposing same A or same S on all (or a subset of) modalities, e.g., concatenating the multimodal datasets), or mixed. The mechanism utilized to obtain results is data-driven, hypotheses-driven, or hybrid. The input type is commonly problem-specific and can be first-level data (i.e., the same data used for single modality subject-level analysis) or second-level data (typically results pooled from many subject-level analyses). The number of modalities simultaneously handled by a method is either two (2-way) or more (M -way, M ≥2). Some methods may or may not make use of priors, rendering them semi-blind or blind, respectively. Table 1 summarizes various popular and upcoming stochastic methods in the literature and how they classify under each of these criteria. In this work, we will consider only stochastic, second-level, data-driven methods, with any form of goal, approach, number of modalities, and use or not of priors.
Table 1:
Name | Author, Year | Nature | Goal | Approach | Mechanism | Input Type | Number of Modalities | Use of Priors |
---|---|---|---|---|---|---|---|---|
Joint ICA | (Calhoun et al., 2006a) | Stochastic | Mixings | Ad hoc | Data-driven | 2nd-level | M-way | Blind |
Parallel ICA | (Liu and Calhoun, 2006) | Stochastic | Mixings | Principled | Data-driven | 2nd-level | 2-way | Blind |
Multimodal CCA | (Correa et al., 2008) | Stochastic | Mixings | Principled | Data-driven | 2nd-level | 2-way | Blind |
CC-ICA | (Sui et al., 2009) | Stochastic | Mixings | Ad hoc | Hybrid | 2nd-level | 2-way | Semi-Blind |
CCA+ICA | (Sui et al., 2010) | Stochastic | Both | Principled | Data-driven | 2nd-level | 2-way | Blind |
Multi-set CCA | (Correa et al., 2010) | Stochastic | Mixings | Principled | Data-driven | Both | M-way | Blind |
Linked ICA | (Groves et al., 2011) | Stochastic | Both | Mixed | Data-driven | 2nd-level | M-way | Blind |
mCCA+jICA | (Sui et al., 2011) | Stochastic | Mixings | Mixed | Data-driven | 2nd-level | M-way | Blind |
1.2. Five Principles for Synthetic Multimodal Data Generation
Ad hoc approaches to synthetic multimodal data generation are, to no surprise, typically biased in favor some models simply because of the wide variety of models and increased difficulty of simulating high-dimensional multimodal distributions. A statistically rigorous design of multidimensional distributions, together with a careful experimental design (Myers et al., 2009), should have been prioritized in previous publications. In order to address this issue, we define basic simulation design principles for statistically adequate, unbiased assessments about the strengths and weaknesses of high-dimensional data fusion models. A highlight of these principles is the emphasis on careful design of the (joint) distributions assigned to each multimodal feature. As we will show, these guiding principles improve the contrast between advantages and limitations of the stochastic models under evaluation.
We establish five guiding principles for synthetic multimodal data generation in order to help experimenters make conscious decisions about their simulation choices:
Declare your intent: simulations should be designed with a clear purpose in mind (e.g., demonstrate the strengths of a model, compare models in a specific scenario, assess the limitations of a model, etc.). Understanding the purpose of a simulation helps identify its requirements.
Be fair: simulations must be as fair as possible to the models being studied, taking into account their limitations, strengths, and assumptions. Unfairness occurs whenever conclusions about the quality of the final estimation are derived only from synthetic data that does not fully comply with the model. For example, consider an experimenter interested in testing a certain model in a challenging scenario, but his model’s implementation is not fit for discontinuous distributions. Using data simulated from discontinuous distributions would, therefore, incur additional estimation error to the results on top of the “scenario effect,” acting like an uncontrolled factor in a poor experimental design and hampering any conclusive assertions about the model. In such case, it would be a better approach to limit the simulation to continuous distributions, and study the case of discontinuous distributions with a separate, independent experiment instead.
Control all key aspects of the simulation: the experimenter must try to be conscious of all decisions (e.g., choice of joint distributions, underlying dependences, and noise levels) and strive to control the possible sources of estimation error. This is critical in attaining fairness and allows rightful assertions about the performance of a model.
Commit to simple: this is expressed in two ways: 1) make the simulation only as complex and challenging as necessary to assess the effects of interest (preferably only one effect at a time) and to accomplish the declared intent; 2) start with the simplest case and gradually increase complexity. The latter gives a better picture of gradual performance variations.
Make realistic simulations: simulations should be designed to reflect real physiological features as much as possible, according to what is being studied. This is quite possibly the most challenging principle because it can easily conflict with the principles of fairness and simplicity (e.g., assigning random samples into feature maps in an ad hoc fashion may lead to unexpected, spurious dependences across different maps; see the example in Figure 3).
This work is focused on how to successfully comply with these principles. For a brief assessment of how relevant previous works have failed to comply with these principles and the implications of ignoring the points we raise, see Appendix A.
1.3. Summary of Our Strategy
In light of the outlined principles, we conjecture that there are two minimal requirements for sound simulation of stochastic multimodal models: 1) synthetic multimodal data should be randomly sampled from carefully selected joint distributions; and 2) the process utilized for assignment of randomly sampled values into “feature-maps” (e.g., spatial maps or timecourses) must not disrupt the intended joint distributions. A multimodal simulation adhering to these requirements can offer full control over the multidimensional joint distribution of multimodal features, leading to more accurate performance assessments of various stochastic multimodal data fusion models.
Our new framework for generation of synthetic multimodal data starts with the design of the underlying joint distribution of each multimodal source. First, we define the statistical properties of each modality in a given multimodal source, such as mode, variance, skewness, and kurtosis, and then pick the level and type of association to be imposed between the modalities in that source. Multimodal association is a crucial point of our approach. Unlike prior studies, we explicitly account for associations at the joint distribution level. This significantly expands the set of conditions testable in a simulation study and enables exploring implicit limitations of multimodal approaches. In particular, we use the commonly overlooked theory of copulas (Joe, 1997; Nelsen, 2006) to model the type and level of association between modalities at the joint probability density function (pdf) level. This is a novel, powerful approach which, to our knowledge, has not yet been explored in the multimodal simulation setting. The main benefit of copulas is their ability to assign controlled levels of dependency between modalities while simultaneously allowing for any arbitrary choice of marginal distribution for each individual modality.
After random sampling from all joint distributions, the sampled values are assigned into feature maps (spatial maps, timecourses, other) to give them meaning in the context of neuroimaging. The idea behind it is simple: the sampled values for a given source should be assigned to voxels in a spatial map (or points in a timecourse) so that it resembles real functional or structural features. For example, in the case of brain functional magnetic resonance imaging (fMRI), larger sample values would be “clumped” together to look, as much as possible, like realistic activation maps. However, previous works exploring this avenue (Groves et al., 2011) have overlooked key implications to the joint multimodal source distributions, in particular the emergence of uncontrolled cross-component dependences. This is a crafty issue that can very easily occur whenever the assignment process is done independently for each feature. Unlike existing approaches, we perform the assignment of each multimodal sample value to the same voxel/timepoint of every multimodal source in the simulation. This way, joint samples are moved together during the assignment process, preserving the originally designed cross-component associations and preventing any uncontrolled ones. We show that this assignment task can be formulated as a discrete optimization problem, which we solve using simulated annealing (SA) (Ross, 2006).
Finally, we have noticed that in simulation studies quality assessments often occur in two forms: quantitatively, focusing greatly on the precision of recovered mixing coefficients, and qualitatively, focusing on the visual resemblance to the original feature configuration. When the synthetic data is stochastic, however, the estimated distributions can be assessed as well. Thus, we also focus our attention on the quality of the estimated source distributions, and introduce some additional quality assessment measures that are general as well as more technical than simple visual inspection of the recovered features.
Our simulation framework therefore provides unique solutions to the problems of controlling dependence in joint multimodal distributions and securing their structure during the assignment process. Many benefits stem from such an approach, particularly the increased reliance of simulation results and the wider range of testable scenarios exploring particular aspects of novel and classical multimodal models at new, greater depths. Indeed, our previous work (Silva and Calhoun, 2012) suggests that current approaches to multimodal data simulation are inadequate, since previously unreported limitations became evident under our novel, principled framework. Additionally, our approach is backwards compatible with several relevant previous works, admitting for example the distribution choices from Groves et al. (Groves et al., 2011) and the mixing matrix configurations from Sui et al. (Sui et al., 2010).
Our contributions include: 1) establishing principles and minimal requirements for multimodal synthetic data generation; 2) providing a general framework that reliably attains these requirements; and 3) showing the improved assessments regarding strengths and limitations of popular data fusion methods attainable with our framework. The remainder of this work gives detailed descriptions of our proposed simulation framework (Materials and Methods), illustrates its benefits and application in both hybrid- and fully-simulated settings (Results), and discourses about the significance of the obtained results (Discussion). Finally, we outline our conclusions and the advances made to the field (Conclusions).
2. Materials and Methods
We would like to generate synthetic data that can be used for verification of data-driven stochastic linear models, which attempt to factorize the data Xm according to Equation (1). There are two elements to be simulated: the mixing coefficients (columns) of each mixing matrix Am, and the source variables (rows) of each source matrix Sm. Associations at the mixing level are generally defined by elevated correlation levels between corresponding columns of Am from each modality, or by assigning the same matrix A to all or some of the modalities. Associations at the source (or component) level are determined by the dependence type and strength in the joint multimodal distribution of each joint source (and in some cases by assigning the same sources S to all or some of the modalities). A joint source is the collection of multimodal features defined by corresponding rows of Sm from each modality. In the following, we describe how our approach to principled synthetic multimodal data generation constructs these associations. Am is a (N × C) matrix and Sm is a (C × V) matrix. Also, we write Am in terms of its columns (Ami) as and Sm in terms of its rows (Smi) as for m=1, …,M. Likewise, the i-th joint multimodal source is defined as , and the i-th set of multimodal associated mixings as .
2.1. Source Design and Generation
Our strategy for multimodal simulation can be split in two stages. Stage 1 consists of random sampling C joint multimodal components from one or more well-defined joint distributions. After random sampling, Stage 2 consists of assigning the sampled values into spatiotemporal feature maps. For the remainder of this work, we focus on the case of spatial maps for ease of explanation. However, the approach can also be used for timecourses without any changes.
2.1.1. Stage 1
The first step consists of selecting the marginal and joint pdfs from which data will be sampled as well as the type and level of association (if any) which will be imposed between the different modalities. For simplicity, we focus on the case of C independent multimodal components throughout. In this case, the total joint distribution of all components factors into C independent (and smaller) joint distributions. Thus, each of the C joint sources can be sampled independently of the others. Also, each joint component must contain M co-registered modalities. Co-registration of the modalities is a mandatory requirement in our approach since otherwise the joint distributions are not well-defined.
To illustrate how we proceed with the design of a single M -dimensional joint distribution, consider the following example with M =2. First, we select the marginal distributions of the i-th joint component, e.g., we may choose Modality 1 (S1i) to be gammadistributed and Modality 2 (S2i) to be t-distributed. Second, we choose the type and level of association between the two modalities, e.g., we may choose a form of association that is described by a Frank copula with Kendall rank correlation τ=0.65.
Copulas are simply a class of multivariate distributions with uniformly distributed marginals. Many such distributions exist in the literature such as the Archimedean and elliptical copulas. The term “copula” means a link or bond and was first used by Sklar to refer to this class of joint distributions. The concept, however, was explored before using different nomenclature (Nelsen, 2006). The virtue of copulas stems from Sklar’s theorem, which states that all continuously differentiable joint cumulative distribution functions (CDF) can be written as a (multivariable) function of their marginals. This multivariable function is called copula. Specifically, in the bivariate case (Sklar, 1959),
(2) |
where and are the univariate CDFs of S1i and S2i, and are the uniformly distributed random variables obtained from the CDF transform (also known as the probability integral transform) of S1i and S2i, and is the joint CDF of U1i and U2i (i.e., the copula). The CDF transform is a typically non-linear and widely employed transform in image processing (known as histogram equalization, (Bovik, 2009)) and neural networks (used as a neuron’s output nonlinearity).
Under the continuously differentiable assumption for and , it can be shown that the joint probability density function (pdf) can be written as:
(3) |
where is the copula (joint) pdf of U1i and U2i. This is a powerful statement for two reasons: 1) the joint pdf can be recast as the product of its marginals times the copula pdf and, therefore, 2) the dependence structure between S1i and S2i is entirely captured by . Note that the case of independence is attained when .
Essentially, Equations (2) and (3) outline the strategy for generation of simulated joint sources. First, generate random samples from . Second, use the inverse CDFs (ICDF, or quantile function) of S1i and S2i, according to the desired marginal distribution, to transform U1i and U2i into S1i and S2i, respectively. This generally non-linear transformation is well-known in statistics as the inverse probability integral transform (Casella and Berger, 2002) and in image processing as histogram shaping (Bovik, 2009).
The benefit of such an approach to data generation is that it allows full control over the type and level of dependence between S1i and S2i solely by controlling the dependence between U1i and U2i via the parameters of the selected copula . Once the dependence structure is established, the transformation in the second step allows for arbitrary choice of marginal distribution, without altering the already specified dependence structure.
Thus, sampling from a bivariate copula gives M =2 sets of values, each set with V uniformly distributed values, which are associated in the way described by the specific copula. Then, by means of the ICDF, the uniformly distributed samples can be assigned any desired marginal distribution. This procedure always produces M sets of values, each with a selected marginal distribution and jointly distributed according to the choice of copula (see Figure 1). We consider the final collection of V M-dimensional values as a joint sample of the i-th joint multimodal source Si, i =1,2, …,C. In a way, these M -dimensional values are analogous to multivariate measurements (i.e., multiple dependent variables) in a multivariate analysis of covariance (MANCOVA) setting. In the case of C independent joint components, this procedure is repeated for each joint component independently. Needless to say, the same approach can be used to generate the set of columns Ai.
2.1.2. Stage 2
Once all C joint sources are sampled we assign the sampled values into spatiotemporal feature maps. This process starts by initially defining reference spatial maps (or time courses) describing the shape and form of each modality in each joint source. For spatial maps, these references are generated in similar fashion as the artificial sources in (Calhoun et al., 2006a). However, in the framework we propose, they only represent the expected spatial configuration of the joint sources rather than the actual sources themselves. Here, we consider the most constraint case with a total of MC references, one per modality for each joint component.
Together, all C joint components, each with M modalities, constitute a larger MC-dimensional joint distribution with each single feature Smi corresponding to one dimension. In order to guarantee that this joint distribution is not altered during the assignment process, the assignment of the k-th joint sample value to the v-th voxel (k,v=1,2, …,V) is made simultaneously for all Smi. That is, the k-th sample value from all MC sample sets is assigned to the same v-th voxel in each Smi, at every step of the assignment procedure (Figure 4(a)).
Before proceeding with more details, consider this toy example. Suppose there is only C=1 source and M =1 modality with V voxels. After random sampling V values from a onedimensional distribution, the goal is to assign the obtained samples into a feature map such that it looks like a reference map. Let’s call the final source map S11 and the reference map R11. The straightforward solution to this problem is to vectorize both S11 and R11 (recall that both have the same number of values V), rank the values in the R11 vector, and organize the random sample values into S11 such that in the end the ranking of S11 equals the ranking of R11. In other words, solving this problem is just a matter of matching the ranks of R11 and S11 (Figure 2). The benefit of such an approach is that it admits any probability distribution and also grants control over the final spatiotemporal configuration of the final source.
Extending this idea to the multimodal case is slightly more complicated, however. In the previous toy example, the hardest part of solving the assignment problem was sorting R11 to find the ranks of its elements. In the case of M = 2, the solution is not as obvious since references are necessary for every Smi (one per modality: R1i and R2i). Clearly, the ranks of R1i are more than likely to differ from the ranks of R2i. Consequently, organizing the joint samples into S1i and S2i so as to individually match the ranks of R1i and R2i, respectively, would cause some joint samples to be split apart, destroying the dependencies inherited from the original joint distribution (Figure 3). Clearly, any solution to the assignment problem in the multimodal case must be able to preserve the connections embedded in the joint samples.
Our approach to the assignment problem ensures that, above all, joint samples are not split apart, which guarantees that the experiment is statistically valid. Specifically, we frame the assignment problem as a discrete optimization problem. The function we propose to optimize is the L1-norm of the difference between each Smi and its corresponding reference map Rmi. This cost function is to be minimized over the space of all possible shuffles of the V joint sample values. We avoid the problem of breaking the joint samples by imposing a constraint that every element of a joint sample must be always assigned to the same voxel location (Figure 4).
This effectively restricts the search space to the set of all possible permutations (or shuffles) of V values into V voxel locations. Thus, each permutation d corresponds to one point of a finite discrete search space of size . In order to find a shuffle that simultaneously approximates all MC samples to their corresponding reference maps, we search over this discrete space utilizing a discrete optimization tool called simulated annealing (SA).
2.1.3. Simulated Annealing (SA)
The roots of simulated annealing are in the theory of Markov chain random processes and, particularly, in the Metropolis-Hastings (M-H) algorithm (Ross, 2009). Markov chain is a type of random process (in fact, a random sequence) defined by a transition matrix Q that describes the probability of alternating between neighboring states. Ultimately, a careful choice of Q confers the random process a limiting distribution π roughly corresponding to the frequency that each state is visited, in the limit of an infinite sequence. Careful design of Q allows control over various properties of π, which is the driving principle of both M-H and, particularly, SA. In our problem, we consider every possible permutation d as a different state and then design Q so that the limiting distribution π = π(d) is mostly concentrated on a set of optimal d* which more closely approximate all reference maps simultaneously. The similarity between samples and reference maps is measured by some cost function F(d), which is maximized by the optimal d*.
Let be the maximum attainable value of the cost function F(d), the solution set be the set of all points in that yield a maximum F*, be the number of points , and d* denote any optimal solution in . Then, define a probability mass function (pmf) with parameter λ > 0 over all as:
(4) |
It is simple to verify (Ross, 2006, p.240) that
(5) |
where is an indicator function with if and 0 otherwise. Clearly, this particular choice of distribution for q(d|λ) can be easily tuned to concentrate virtually all of the probability uniformly over d* by simply making λ very large. Back to the mechanics of Markov chains, and particularly M-H, the question becomes whether it would be possible to design Q in such a way that π(d) converges to q(d|λ) in the long run. Generating such a Markov chain is interesting because, effectively, it enables the generation of random sequences of visited states d from q(d|λ), most belonging to the set d*. Indeed, an approach exists which warrants such property: it is called the Metropolis-Hastings algorithm (Appendix C).
Simulated annealing (SA) is a practical work-around to the problem of slowness resulting from simply setting λ to a very high number. It replaces the fixed λ by a gradually increasing one: , where n is the iteration number. SA effectively allows more states to be visited before convergence to the limiting distribution, which considerably increases the chances of visiting an optimal state d faster. After a finite amount of time n = nmax, we select the state (permutation) d with maximal F(d) from the set of visited states and the problem is (approximately) solved. Using this clever formulation the resulting stochastic process can be guaranteed to visit an optimal solution point in with probability 1, as long as the Markov chain is run “long enough.”
In order to use this technique to solve the assignment problem, we need to clearly define the cost function, 2) the parameter K, and 3) how long is “long enough” for such problem. As mentioned earlier, the cost-function we choose is the L1-norm between the C joint sources and the corresponding reference maps :
(6) |
where smi,v and rmi,v are the values of Smi and Rmi, respectively. The notation Si(d) is to indicate a reshuffle d of Si, and smi,v(d) is to indicate the value at voxel v for the current reshuffle d. In order to minimize G(d), we maximize F(d)= −G(d).
For better stability, we normalize each reference Rmi so that its minimum and maximum values match the minimum and maximum of the corresponding Smi using the range operator range(·) to indicate the difference between the maximum and the minimum values, that is,
(7) |
where is the normalized reference.
K must be set according to the magnitude of the change in the cost function when F(d = j) < F(d = i) and essentially affects how likely the Markov chain is to move to state j in such case. Things to consider in the choice of K are the desired probability α(j,i) (Appendix C) of moving to a state that is worse than the current, at some specific time step n = nk (here, ), for some reasonable fixed cost function change F(d = j) − F(d = i) (which may require running the Markov chain a few times to identify a typical value).
Finally, we set nmax to correspond to a reasonable amount of time (which is problem dependent). In our case all 110 spatial maps presented in Figure 5, inset D, were simultaneously obtained with nmax = 3·106 iterations, which took about 17 hours in a PC Intel Core i7 CPU 870 at 2.93GHz, 3GB RAM. We also implemented some other stop controls which consider whether or not a stationary point has been reached (i.e., no jumps from current state d=i).
Figure 5 summarizes our proposed framework and presents the synthetic joint sources generated for our “fully-simulated” experiments in Section 2.3. A high resolution version of the final spatial maps is available as supplementary material (see Inline Supplementary Figure 1 for full appreciation).
2.2. Hybrid-Data Experiment
Here, we replicate the original simulation from Calhoun et al. (Calhoun et al., 2006a) but design the joint components with controlled dependence between modalities instead (Figure 6). Our intent is to showcase some attributes of our framework on an already-published setting, even though it violates some of the principles from Section 1.2 (see Appendix A). Thus, we generate one two-dimensional (C = 1, M = 2) artificial joint source using our proposed framework and add it to real data from the same 30 subjects as in Calhoun et al., using the same data-generation parameters of that work (Calhoun et al., 2006a). Whereas in the original work no attention was given to the joint distribution of the artificial source S1, here we use our proposed framework to control exactly to which extent the two modalities are related in the joint source. As in the original work, the goal is to check if the artificial source added to the real data can be recovered by jICA. Note, however, that jICA (using the FIT toolbox, http://mialab.mrn.org/software/fit) concatenates multimodal data into a single virtual dataset (an ad hoc approach) (Calhoun et al., 2006b). It assumes the same underlying marginal distribution (pdf) for all modalities, and also independence between all modalities, which follows from the use of an ICA algorithm based on i.i.d. samples. The only multimodal association that is explicitly pursued is, thus, via the shared mixing matrix A, set to be identical for both modalities (A1 = A2 = A).
In contrast to that, our artificial joint source is generated from a non-independent bivariate distribution with non-equal marginals. The reference maps we used were obtained from the SimTB simulation toolbox (available at http://mialab.mrn.org/software/simtb). Our choice of marginals is such that, after concatenation, the resulting (mixed) distribution seen by the ICA algorithm is bimodal (i.e., has two modes). Altogether, this design makes for a very interesting setting that challenges the non-linearity used in Infomax (the sigmoid function, not recommended for bimodal distributions), as well as the independence and same-marginal assumptions of jICA. We acknowledge that such design does not meet the principle of simplicity since it changes three different factors simultaneously. Nevertheless, our strive is to illustrate how differently the original simulation could have been designed and we defer the more principled design for our next set of simulations.
After adding this artificial source to real data features (the GLM beta maps from fMRI) from 30 healthy individuals who had performed an auditory oddball task and a Sternberg memory task (Calhoun et al., 2006a), jICA was run 20 times on the resulting hybrid data in an ICASSO framework (Himberg et al., 2004) with a different random initialization in each run. The number of components was automatically estimated to be 9 by the minimum description length (MDL) algorithm (Li et al., 2007a). The sources estimated by jICA were spatially correlated with the artificial ground-truth and the one with highest correlation was considered for further analysis.
2.3. Fully-Simulated Data Experiment
In this section, we demonstrate the full capabilities of our multimodal simulation approach. Our focus is on illustrating all careful considerations that go into properly designing synthetic sources and experiments, according to the outlined principles from Section 1.2. In order to contrast with the hybrid simulation design from Section 2.2, we choose to further test the assumptions of jICA in this section. Thus, our intent is to identify what assumptions in the jICA model are critical to guarantee good, robust performance. Our simulation choices will, therefore, be tailored to support this intent. As mentioned earlier, in addition to the obvious independence between the joint sources Si, i = 1, …,C, there are three key assumptions in the jICA model for any fixed joint source Si : 1) independence of the multimodal sources Smi, m=1,…,M, 2) identical marginal distribution for every Smi, m=1,…,M, and 3) identical mixing coefficients Ami, m=1,…,M. In order to properly carry out this simulation, we will consider each of these three assumptions as a factor in our experimental design and, thus, we need to define the factor levels.
For factor 1, the two obvious extremes are “dependent” and “independent.” However, dependence can be hard to quantify as it may occur in many forms and at arbitrary levels. Complete dependence reduces to a deterministic relationship (linear or not), which is out of scope for this work and will not be pursued. Instead, we prefer to qualify dependence using the concept of copulas introduced in Section 2.1.1, and since jICA is recommended only for 2-way fusion, as indicated in Table 1, we consider only two-dimensional copulas. Different copulas capture different forms (or types) of dependence. Moreover, their parameters typically determine the strength (or level) of dependence. One case we deem interesting is when dependence is entirely due to higher order statistics (HOS) and not to simple linear relationships (i.e., no linear correlation). Currently, however, the literature on the expected type of dependence between multimodal sources is virtually inexistent. Thus, we elect to test three copulas which illustrate dependence due only to HOS (namely, the circularly symmetric (cs), the non-linear circularly symmetric (nlcs), and the Ferguson copulas), and two that are largely driven by linear correlation (t, Frank). The level of dependence is tuned by the Kendall tau (τ) in t and Frank copulas. The parameter choices in cs and nlcs copulas (Perlman and Wellner, 2011), however, mostly introduce linear correlation to the dependence structure, so we opt to use only their “no-correlation” forms. The Ferguson copula (Ferguson, 1995) is defined in terms of a pdf g(z) over the interval z ∈ [0,1]. We intentionally make g(z) bimodal by setting , which gives the pattern in Figure 8 (e). Further details about these copulas are provided in Appendix B.
For factor 2, the range of level options is either “same” or “different.” Thus, we need to choose a “baseline” distribution for one of the modalities and a “different” distribution for the second modality (recall that jICA is recommended only for 2-way fusion). Also, the default jICA setup in the FIT toolbox uses the Infomax algorithm to perform the ICA decomposition. Thus, in order to be fair to jICA and reduce the sources of estimation error, we must select distributions which Infomax can separate well. This prevents uncontrolled factors from interfering with our factor of interest. The Infomax implementation in the FIT toolbox uses the sigmoid non-linearity by default, which separates continuous super-Gaussian distributions well. Super-Gaussianity roughly corresponds to excess-kurtosis above 0. Thus, we elect to use distributions with excess-kurtosis tuned to 1.5: the zero-mean, unit-variance scale-location Student-t distribution with ν=8 degrees of freedom as the “baseline” and the unit-variance left-skew generalized extreme value (GEV) distribution with shape parameter ξ=−0.83 as the “different” distribution. The GEV distribution is evaluated with either zero-mean or zero-mode.
For factor 3, the level choices are also “same” or “different.” For simplicity, however, we opt to use the Pearson correlation to determine the similarity between A1i and A2i, and a bivariate Normal distribution with zero mean, unit variance to generate random draws for both columns. With this setting, ρ=1 makes both columns identical, ρ=0.9 makes them similar, and ρ=0.3 makes them very different.
Note that each of the three factors is always present simultaneously (though at different levels) in each joint source Si. In order to better keep track of the 23 =8 combinations, we define the following eight source types below:
Type 1: A joint source whose spatial maps Smi, m=1,…,M, are independent. The mixing coefficients Ami, m=1,…,M, in this joint source are identical, and so are the marginal distributions , m=1,…,M. This is the only source type fully compliant with all jICA assumptions.
Type 2: A joint source with spatial maps Smi, m=1,…,M, that are dependent between modalities. The mixing coefficients Ami, m=1,…,M, in this source are identical, and so are the marginal distributions , m=1,…,M. The type/level of dependence is controlled by the choice of copula and its parameters.
Type 3: A joint source with different marginal distributions , m=1,…,M. The spatial maps Smi, m=1,…,M, are independent and the mixing coefficients Ami, m=1,…,M, are identical. The marginal pdf of S1i has the same t-distribution as in Type 1 sources but S2i has a left-skew GEV distribution with either the same mode or the same mean as S1i.
Type 4: A joint source with different mixing coefficients Ami, m=1,…,M. The spatial maps Smi, m=1,…,M, are independent and the marginal distributions , m=1,…,M are identical (as in Type 1 sources). The differences are controlled by correlation, which can be high or low.
Types 5–8: These joint sources combine the properties of Types 2–4 according to Table 2.
Table 2: Source Types Generated for the Fully-Simulated Experiment.
Source Types Generated for Fully-Simulated Experiment | ||||
---|---|---|---|---|
Source Properties | ||||
Type | Independence | Same Marginals | Same Mixing Coeffs. | Number of Occurrences |
1 | √ | √ | √ | 5 |
2 | x | √ | √ | 10 |
3 | √ | x | √ | 10 |
4 | √ | √ | x | 10 |
5 | x | x | √ | 5 |
6 | √ | x | x | 5 |
7 | x | √ | x | 5 |
8 | x | x | x | 5 |
Total | 55 |
Sources of Types 1–4 are “pure” in that they either have only one factor misaligned with the assumptions of jICA, or none. Sources of Type 5–8 combine one or more of these misalignments into a single joint source. Put together, these eight source types give rise to eleven different source instances (see joint sources 1–11 in Table B.1), which are then replicated for each of the five choices of copula. This results in a final pool of 55 joint multimodal source instances. After random sampling, reshuffling is performed based on 110 reference maps indicating the spatial configuration for each source Smi. Figure 5, inset D, illustrates one out of ten source pool replicas we used in this work.
Once the factors and their levels are established and incorporated into the joint source pool, we need to design the different test scenarios (or test cases) that can probe the effect of each factor on the performance of jICA. Here we use the principle of simplicity, starting with a fully-compliant, model-based design, and then gradually dropping assumptions by including non-compliant sources from the pool into the experiment. For brevity, we focus on the “pure” effect cases, which imply we will work only with sources of Type 1–4 from the source pool. Note, however, how our approach nicely allows for ample-sized, carefully controlled, and fairly realistic source pools. Table 3 summarizes the eight test cases studied in this work. Below is a description of each test case, using the term “mixtures” to refer to the rows of Xm, m=1,…,M :
Table 3: Test Cases for the Fully-Simulated Experiment.
Test Cases for Fully-Simulated Experiment | ||
---|---|---|
Test Case | Source Types Present in Xm | Total Number of Sources in Xm |
1 | Type 1 only | 6 |
2 | Types 1 + 2 | 16 |
3 | Types 1 + 3 | 16 |
4 | Types 1 + 4 | 16 |
5 | Types 1 + 2 + 3 | 26 |
6 | Types 1 + 2 + 4 | 26 |
7 | Types 1 + 3 + 4 | 26 |
8 | Types 1 + 2 + 3 + 4 | 36 |
Test Case 1: N =6 mixtures of 5 joint sources of Type 1, plus a Noise source (C=6).
Test Case 2: N=16 mixtures of 5 joint sources of Type 1, and 10 joint sources of Type 2, plus a Noise source (C=16).
Test Case 3: N=16 mixtures of 5 joint sources of Type 1, and 10 joint sources of Type 3, plus a Noise source (C=16).
Test Case 4: N=16 mixtures of 5 joint sources of Type 1, and 10 joint sources of Type 4, plus a Noise source (C=16).
Test Case 5: N=26 mixtures of 5 joint sources of Type 1, 10 joint sources of Type 2, and 10 joint sources of Type 3, plus a Noise source (C=26).
Test Case 6: N=26 mixtures of 5 joint sources of Type 1, 10 joint sources of Type 2, and 10 joint sources of Type 4, plus a Noise source (C=26).
Test Case 7: N=26 mixtures of 5 joint sources of Type 1, 10 joint sources of Type 3, and 10 joint sources of Type 4, plus a Noise source (C=26).
Test Case 8: N=26 mixtures of 5 joint sources of Type 1, 10 joint sources of Type 2, 10 joint sources of Type 3, and 10 joint sources of Type 4, plus a Noise source (C=26).
As implied by the Test Case descriptions above, we also included a noise source in all our experiments. However, using the principle of simplicity, we take a conservative approach and consider only a single additive multimodal noise source and model it as an extra joint source (in fact, the C -th one, SC=Z) following a zero-mean, unit-variance, uncorrelated multivariate Gaussian distribution. It is well-established in the literature that ICA can recover at most one Gaussian source, so we do not anticipate our choice of additive Gaussian noise would have any significant effect on our results. The l-th mixture Xml, l=1,…,N, is computed as Xml =Yml +amlZm, where is the l-th noiseless mixture from modality m, amli is the mixing coefficient of the i-th joint source for Xml, and aml is the mixing coefficient of Zm for Xml. Together, amli and aml constitute the l-th row of Am. First, the amli are generated according to the description for Factor 3 above (Note: ). The signal-to-noise ratio (SNR) of each Xml is conservatively set to be 30dB (low noise), which is attained by appropriate selection of aml. In turn, for any fixed l, aml may differ across modalities. However, allowing that is a violation of the mixing coefficient assumption in jICA (i.e., Factor 3 set to level “different”). We therefore enforce the same value for all modalities (aml =al, m=1,…,M). We still guarantee the minimum desired SNR of 30dB in all modalities by using the Yml with smallest second moment to compute al (see details in Appendix D). While the mixing coefficients (Ai) of all other 55 joint sources are generated only once for each source pool replica, the constants al also have to be computed separately for each Test Case in Table 3.
For each Test Case, the number of mixtures (N) is set to be equal to the total number of joint sources (C). In turn, the mixing matrix Am is always square. This is advantageous since it eliminates the need for dimensionality reduction (and especially PCA), which is required when N >C. Altogether, the choices of square Am, high SNR (low noise), and aml =al, m=1,…,M, mark the last piece the of effort to minimize the presence of uncontrolled sources of error while attaining minimally realistic sources.
As indicated earlier, we generate a total of 10 replicas of the source pool described above. JICA was run 21 times, utilizing a clustering-based estimation reliability package called ICASSO (Himberg et al., 2004), on each of the 10 replicas of Xm, for each of the 8 Test Cases above, always extracting a number of sources C=N, utilizing the Infomax algorithm and the default feature normalization in the FIT toolbox.
2.4. A Comparison Experiment
Here we give an example of a comparison study. We opt for comparing jICA against CCA. Our intent is to compare the performances of jICA and CCA in the scenarios described below. As we show in Section 3.2, jICA’s performance is heavily dependent on Factor 3 and generally fails to recover the sources when A1i ≠ A2i, which is the case for source Types 4, 6, 7, and 8. On the other hand, CCA is constructed to find maximally correlated (i.e., linearly dependent; see Factor 1) sources, regardless of their mixing coefficients Ai. Therefore, we should expect it to perform better on sources with dependence, i.e., source Types 2, 5, 7, and 8.
We will consider the following 4 scenarios, exploring the main effects of Factors 1 and 3 on the performance of jICA and CCA:
N =6 mixtures of 5 joint sources of Type 1, plus a Noise source (C=6);
N=11 mixtures of 10 joint sources of Type 2, plus a Noise source (C=11);
N=11 mixtures of 10 joint sources of Type 4, plus a Noise source (C=11);
N=6 mixtures of 5 joint sources of Type 7, plus a Noise source (C=6).
The sources used in this experiment come from the same 10 source pool replicas described in Section 2.3.
2.5. Quality Assessment:
Here we consider the problem of defining appropriate quality measures. The first thing to note is that the basic model from Equation (1), Xm =AmSm, makes no statement about the ordering or the expected scale of the sources Sm. This means that methods attempting to recover Sm and Am from Xm have some freedom and may find valid solutions up to permutation and scaling ambiguities. This is exactly the case with ICA-based approaches. However, even methods like PCA and CCA, which are constructed to find a very specific ordering and scaling of the estimated sources, are not guaranteed to find the scale and permutation of the ground-truth sources. For that reason, quality assessment of any solution to Equation (1) should first remove the scaling and permutation ambiguities by means of matching and rescaling techniques. Of course, this is only relevant when the ground-truth is available to the experimenter. In the following, we introduce an approach for matching and rescaling source estimates in the nontrivial case of multimodal data. This helps minimize the presence of spurious, misleading errors due to invalid comparisons between estimates that are incorrectly matched to, or out of scale with, a GT.
2.5.1. Matching Source Estimates
Here we describe how we match the estimated joint sources to the ground-truth (GT) Si. Our strategy consists of computing a correlation matrix between the columns of the GT Am and those from the estimate , utilizing Kendall’s rank correlation τ. The Kendall correlation τ is more reliable than the Pearson correlation ρ (Embrechts et al., 2002; Joe, 1997), and a high τ means there is very good agreement between the GT and the estimate. While matching based on correlation between the rows of the GT Sm and those from the estimate is also possible, and should give similar results when the sources are successfully recovered, Kendall correlation is very time consuming on very large Smi (i.e., large V) and thus we choose not to do so. Note, however, that when the estimation is bad (i.e., the method performed badly), matching based on Am or Sm would likely result in different matches. We believe this is not much of an issue, however, since either way the conclusions would probably be the same, that the method performed badly. Also, correlations based on Smi are more accurate when N<V, and should be used when maximal accuracy is preferred over efficiency. Needless to say, in such cases mutual information (MI) may be an even better choice over Kendall correlation if V is in the order of thousands. In general, unless the researcher has special interest in the quality of either the sources or the mixing coefficients, the choice over matching based on Sm or Am, respectively, should be made based on how computationally fast and accurate the measurement (here, Kendall correlation) will be. In this work, we opted for computational speed, using Am to establish the matching since V >>N. Note that correlations based on Sm would have been more accurate but the robustness of Kendall correlations somewhat mitigates for the lower accuracy of using Am. An alternative would be to compute both source- and mixing-based Kendall correlations and combine them by either selecting the largest or the smallest correlation for each element of the correlation matrix. This would either favor strengths or penalize weaknesses of multimodal methods, respectively. Ultimately, the decision is dependent on the problem and the needs/interests of the simulation study being conducted.
In our experiments, the information contained in the estimated mixing coefficients sufficed to correctly match to Si. However, we have also observed that multiple can be matched to a single Si in some cases, particularly when the multimodal method used to obtain fails to preserve the joint distribution and splits the modalities from a single Si into multiple . This typically happens when the method assumes identical, shared mixing coefficients Am, m=1,…,M, but in reality they are not (e.g., when jICA is applied on Test Cases that include sources of Type 4; see Figure 12). The problem in such cases is that the Smi in the i-th joint source are each estimated by a different . In order to account for that, we have worked out a set of rules that define a successful match: 1) an estimate index j may be matched to some GT index i only once; 2) at most M matches can be made to a GT index i, and only one per modality; 3) finally, there can be at most C matches in total. Below, we provide an algorithm that permits multiple to match a single Si using these three rules:
compute a matrix Tm of absolute value Kendall correlations for each modality. Each Tm contains the correlations between all Ami and all ;
compute a summary correlation matrix Tmax consisting only of the maximum correlations across modalities;
define a new matrix Jmax that contains the modality number m that originated each entry of Tmax. If , then the correlation value in was obtained from T2. In general, contains the highest (in absolute value) out of M correlations between Si and , i,j=1,…,C. contains the number of the modality with the highest correlation;
start an empty list of matches. Each entry in the list will contain a triplet: the row i of Tmax, the column j of Tmax, and the corresponding modality number in
- scan through the correlations in Tmax in descending order. For each correlation do:
- IF an estimate index j has not been matched yet, AND (a GT index i occurred less than M times AND not for the modality in )), THEN insert the current triplet in the list.
- IF the list contains C rows, THEN the list is full, quit the algorithm. ELSE repeat step 5 on the next largest .
It is worth noting that this algorithm will work for the case of M=1 as well. We have observed that applying the proposed matching algorithm on each modality separately (using Tm) leads to matchings that differ from those obtained using Tmax, which we indicate in Figure 12 by “Alt-Match”.
2.5.2. Rescaling Source Estimates
Once we have found a correspondence between estimate and GT it is simple to address the scaling issue. First, for each item in the list of matches, if we choose to fit a least-squares linear regression model between the estimate and the ground-truth Smi. This is valid since we are assuming the model from Equation (1). Once we have the model fit, we take only the scaling parameter βmj and use it to rescale . When , we prefer to adjust the interquartile range (IQR) of to match the IQR of Smi instead, since the linear fit approach works well only when estimation is at least somewhat good.
Note, however, that for the same joint source , βmj might differ across modalities. This is fine in most cases, except for methods that assume the same mixing matrix for all modalities (including jICA). For example, given that the manner in which jICA concatenates all modalities into a single virtual dataset, then only one βj should be computed and used for all modalities in . In such cases, the modality from which we compute βj is selected from .
2.5.3. Assessing Performance
We consider two forms of performance evaluation: one global and the other specific. A global performance indicator can be obtained by understanding that the estimation of Am and Sm is obtained via an unmixing matrix . Since the ground-truth Am is a square, invertible matrix by design, it is reasonable to expect that when the estimation is successful (I is the identity matrix), after correcting for scaling and permutation ambiguities. The normalized Moreau-Amari’s inter-symbol interference (ISI) (Amari et al., 1996; Macchi and Moreau, 1995) is invariant to scaling and permutation ambiguities, as opposed to, say, the metric used in (Daubechies et al., 2009), which makes it advantageous for assessing estimation accuracy. The normalized ISI is defined as:
Here, hik are the elements of the matrix , and C is the number of sources. The normalized ISI is bounded between 0 and 1 and the lower the value the better the separation performance. ISI is zero if and only if the model is identified up to the scaling and permutation ambiguities. Since Am, m=1,…,M, can differ from each other, the ISI may differ across modalities and, thus, we report each separately.
Global performance indicators such as the ISI are great for quick assessments but lack specificity. We believe it is also important to assess the quality of individual estimates . For that reason, we adopt two measures: the normalized mean square error (NMSE) and the normalized mutual information (NMI) between estimates and GTs Smi, m=1,…,M. Note that both NMSE and NMI are sensitive to scaling and permutation ambiguities and, therefore, we use only matched and rescaled estimates with these measures. A low NMSE indicates good performance. From experimental observation, NMSE < 0.1 (< 10% error) is indicative of “good” estimation. NMSE < 0.001 (< 0.1% error) is “excellent.” NMSE > 0.5 (> 50% error) is “very poor.” NMSE between 0.1 and 0.5 is just “poor.” On the other hand, low NMI values mean “not related” and, thus, “very poor” estimation. Large values mean “identical” and, thus, “excellent” estimation performance. We estimated mutual information using the estimator proposed in (Peng et al., 2005) and observed that NMI > 0.6 is equivalent to NMSE < 0.1, NMI > 0.9 is equivalent to NMSE < 0.001, and NMI < 0.37 is equivalent to NMSE > 0.5. Both NMI and NMSE have behaved equivalently in all reported results, although their relationship is clearly non-linear. Further details about NMSE and NMI are provided in Appendix E.
Both NMSE and NMI were computed for every matched pair , m=1,…,M. Since there are only two modalities in our experiments, we reported the obtained values in a scatterplot with the value obtained in Modality 1 and Modality 2 along the x-axis and y-axis, respectively.
3. Results
In the following, we review the results for the experiments described in Sections 2.2–2.4. In the hybrid case, the synthetic source is designed to display some level of multimodal association at the joint distribution level in addition to the usual mixing level association. In the fully-simulated case, we are much more careful with the joint source design, in an attempt to assess each of the basic jICA assumptions. In particular, we consider the main effects of source Types 1–4 in Test Cases 1–4, and their interactions with each other in Test Cases 5–8. In all fully-simulated test cases we include sources of Type 1 to serve as a “baseline” for comparison purposes (recall that Type 1 sources are fully compliant with the jICA assumptions). Finally, in the comparison study, we take a basic approach and consider only the main effects of source Types 1,2,4, and 7 on the performances of both jICA and CCA.
3.1. Hybrid-Data Experiment
We start with the spatial maps of the estimated joint sources (Figure 9).
The salient spatial map features of Modalities 1 (S11) and 2 (S21) are very nicely captured by the jICA estimates while only S11 had its background more correctly estimated (see also the scatter plots in Figure 11). Careful analysis of these spatial maps reveals the existence of relationships between the two ground-truth spatial maps which are apparently lost after the estimation (notice how the “activation regions” from S11 are slightly present in S21 but not so in ). This is indication that the multimodal association that was present in the ground-truth joint pdf was not entirely captured in the estimate.
Figure 10 (a) also indicates the effect of the sigmoid non-linearity used in Infomax and its limitation in modeling distributions with multiple dominant modes. Notice how the algorithm selects one of the modes of the concatenated distribution and not the other (the estimated distribution is centered on the left mode, which corresponds to S11). This is indication that the results favor S11 over S21.
Indeed, Figure 11 (a–b) further illustrates this point: the distribution of S11 is better estimated than that of S21. However, Figure 11 (c–d) shows that the estimation is not entirely good, otherwise all points would have fallen on top of the red line. These plots indicate that the voxels closer in value to the mode (in each ground-truth) were more likely to be incorrectly estimated than those further out in the tails of the marginal distribution. This is the reason why the salient visible features (larger absolute values) of each modality have been nicely recovered in both maps. Nevertheless, the bulk of the voxels (in what might seem as just background noise) were incorrectly estimated, and the multimodal associations present in those voxels were lost.
3.2. Fully-Simulated Data Experiment
We start with the Kendall rank correlation matrices for each test case of one dataset replica. The values for τ indicate the level of association between ground-truth and estimates. |τ|=1 means that the estimated mixing coefficients and the ground-truth mixing coefficients are completely dependent (which is the case when the estimation is successful). The row numbers indicate the joint components (from the list of 55 described in Table B.1 and illustrated in Figure 5) that are present in that test case. Component 56 is the Gaussian noise source described previously and is present in all test cases. The column numbers indicate the component numbers in the order in which they were estimated by jICA, and we display them in the order they were matched to the ground-truth. We threshold the images at |τ|>0.375 for display purposes. The circles and stars indicate to which ground-truth the estimates were matched. We use stars for GT sources of Type 4 and cyan dots for all other types. The green dots (Alt-Match) correspond to the matching obtained when we consider each modality individually, and illustrates how they can differ from the proposed matching algorithm results. The green and yellow stars indicate the lowly and highly correlated versions of Type 4 GT sources, respectively.
For all test cases that did not include sources of Type 4, the matching results were the same for our algorithm and for those obtained from each modality individually. However, when sources of Type 4 with ρ=0.3 were present in the mixtures, we observed that sometimes more than one estimate were matched to a single ground-truth, once in each modality (e.g., Test Case 4, GT source 17). This is an interesting and novel result. It suggests that jICA “wastes” two component estimates to recover the information from both modalities of a single GT source. As a result of that, some GT sources end up with no matches, which we interpret as a miss by jICA. This is possibly why in Test Case 7, GT source 39 had a match in only one modality; the other modality required an extra component to be estimated but was missed because there were no sources left to estimate. This kind of “omission” behavior from jICA was very seldom observed in sources of Type 4 with ρ=0.9, possibly because good estimation of the mixing coefficients from either modality led to the same match in the other modality (recall that sources of Type 4 have different mixing coefficients but independent, identical marginal pdfs).
Next, in Figure 13, we present the global performance measurement ISI for both modalities in one replica of all test cases presented in Table 3. We have used color coding to identify the Test Case, and shape to identify the number of modalities with “good” estimation (since in some cases each modality may have a different mixing matrix Am). Smaller values indicate better performance. Test cases which include sources of Type 4 tend to perform much worse than all other test cases.
Next, we consider the more specific performance measures: NMSE and NMI. Here we want to assess the quality of each by comparing it to its matched GT source Smi. Notice that when several estimates are matched to the same ground-truth, we plot the NMSE for each match, which is why in such cases the GT source index number (i) appears more than once in the same plot. NMI plots were consistent with the NMSE plots and are available in the supplemental material. Figure 14 (a) shows the NMSE between estimated sources and ground-truth sources Smi, while Figure 14 (b) shows the NMSE between estimated mixing coefficients and ground-truth mixing coefficients Ami. For Test Case 1 we see that if all model assumptions are met, the quality of the estimation is remarkably good. In Test Case 2, only the assumption of independence at the joint source level is violated, leading to only a small change in performance. In Test Case 3, only the assumption of equal marginal distributions is violated, leading to a considerable reduction in performance. In Test Case 4, we observe the worst results among the first four “pure” test cases, suggesting that considerable differences in the mixing coefficients between modalities (i.e., low mixing level associations) lead to incorrect estimations by jICA.
In Test Cases 4, 6, 7, and 8 the presence of Type 4 sources had an interesting effect: it amplified the errors in non-Type 4 sources. This is an important observation as it suggests that the validity of the identical mixing level assumption is fundamental to warrant the use of jICA in practice.
In Figure 15 we illustrate the “baseline” case of source Type 1. Here, the GT sources are fully compliant with the jICA assumptions (i.e., only strong mixing level associations are present: A1 = A2). No surprise, the results are excellent. In particular, panel (a) indicates that the recovered spatial maps are nearly identical to the ground-truth (the error is near zero). Panel (b) illustrates the nearly perfect recovery of the marginal distributions for each modality and for the concatenated case as well. Panel (c) is a scatter plot of the estimated versus ground-truth values for each modality. The near-one Kendall τ suggests the estimate is very accurate, which is consistent with the fact that all points in the scatter plot fall right on top of the dashed diagonal line. Panel (d) compares the estimated and ground-truth mixing coefficient values (recall that these are identical between modalities (A1 = A2) for source Types 1, 2, and 3 but not for source Type 4 (A1 ≠ A2)). Finally, panel (e) compares the estimated and GT multimodal joint distributions. The estimated joint distribution is obtained by “deconcatenating” the joint source post hoc (i.e., after estimation). Clearly, all results tell a concise story: sources of Type 1 are easily recovered by jICA with very good, accurate estimation.
In Figure 16, we illustrate the case of Type 3 sources. These sources have only mixing level associations and their marginals have either different means or different modes. Figure 16 illustrates the case of different means (GT source 4). We observed that the estimation is poor (high NMSE) but mostly because jICA incorrectly estimated the center (mean) of the marginal distributions. This is reflected in the scatter plots of panel (c), which show the cloud of dots are parallel to the dashed diagonal line. Nevertheless, jICA seems to capture the marginal distribution shapes quite well. Moreover, we notice that the cloud of dots in the scatter plots of panel (c) suggest that values near the mode of the marginals are estimated with less accuracy than the values near their tails. This is consistent with panel (a), which shows that the salient features (“activation regions”) in the spatial maps are nicely recovered. The estimation of mixing coefficients is also remarkably good (Kendall τ =0.933). Contrary to our expectations, we observed that jICA was able to estimate sources whose marginals had the same mean but different modes.
In Figure 17 we illustrate the details of sources of Type 4 with low mixing level association (ρ =0.3). Note the two estimates that were necessary to extract the information from a single joint source. We observed no cross-component contamination, however. Also, once we realize that sources of Type 4 are actually sources with no multimodal associations, it is no surprise that the modalities end up in two separate components. Although we have not tested it, we conjecture that in cases were mixing level associations are not present but joint source level associations are (sources of Type 7 and 8), the results are likely to be non satisfactory. This is based on the fact that in the presence of Type 4 sources, even Type 1 sources can be very poorly estimated (see NMSE of Type 1 sources in Test Case 4, Figure 14).
3.3. Comparison Experiment
The estimations obtained with CCA and jICA were compared for quality on four different scenarios, each representing a combination of levels from Factors 1 and 3 (Section 2.3). Based on the results of the fully-simulated experiment above, jICA should perform well on sources of Type 1 and 2, but not on sources of Type 4 and 7 since these have different mixing coefficients (A1i ≠ A2i). This is exactly what we conclude about jICA’s performance based on the ISI plot in Figure 18. CCA, on the other hand, is constructed to find maximally (linearly) correlated sources, so we would expect it to show some improvement over jICA on sources of Type 2 and 7 since these joint sources have dependence between modalities. Indeed, Figure 18 indicates CCA does better at finding sources with dependence than sources without it (Types 1 and 4). However, for sources with dependence and identical mixing coefficients (Type 2) jICA still performs much better than CCA. It is only for sources of Type 7 (dendence with different mixing coefficients) that CCA outperforms jICA. This was expected. The surprising aspect of the result is that we expected a wider performance gap between the two methods in this case, but that was not the case.
In order to better understand the results from above, we explore the individual source estimates using NMSE. Figure 19 gives the complete picture. As indicated by the CCA results on Type 2 sources, CCA did perform as well as jICA on sources that contained linear dependence in their structure (sources 13, 14, 24, and 25). Similarly, CCA was able to attain good estimation of Type 7 sources that had linear dependence between the modalities (sources 21 and 32), clearly outperforming jICA as was expected.
4. Discussion
4.1. What have we learned?
We have demonstrated our new, statistically motivated approach to synthetic multimodal data generation and its benefits to the study of stochastic linear data fusion models. The approach was developed in an attempt to more accurately control the type and strength of multimodal associations, especially for associations occurring at the joint distribution level. As a result, we were able to systematically construct joint sources with arbitrary combinations of mixing and joint distribution level associations (sources of Type 1–8). This was necessary in order to enable proper performance assessment of the stochastic linear data fusion model we chose to study, jICA.
Our assessments of jICA indicated that successful performance is heavily dependent upon the validity of the mixing-level assumption (Factor 3) in the given dataset. The model performance collapses even if this assumption is violated in only a small subset of the joint sources (about 1/3 of all joint sources in Test Case 8 were of Type 4).
The hybrid-simulation results indicated that jICA estimates the tails of ‘well-behaved’ pdfs quite well. In particular, we have noticed that the jICA estimate tends to choose one modality over the other when two very salient modes exist (Figure 10 (a)). Further evaluation of the marginal distributions of each modality clearly indicates that the estimated distribution is quite biased toward Modality 1 in Figure 11 (a–b). Moreover, the scatter plots in that figure demonstrate that the jICA estimation is reliable and accurate mainly for voxels with larger absolute values (tails of the marginal pdfs). These results suggest that more careful attention to details (in what may appear only as background) is necessary since interesting associations may be obscured or even lost otherwise. Violations of the assumptions of independent and identical marginals of a joint source are not unlikely to occur in real data and may be missed to some extent in the jICA estimation.
The fully-simulated data results, on the other hand, give a deeper understanding of the effect of each assumption violation we tested. We observed that when all other assumptions hold, the violation of the independence assumption at the joint source level has little effect on the results, contrary to our initial belief. After careful analysis of the implementation of jICA, and especially the concatenation approach, we conclude that retrieving the joint information post hoc (i.e., by “deconcatenating” the joint sources) after the estimation is complete provides satisfactory results when strong mixing level associations exist. Further investigation with sources of Type 5 through 8 would help clarify the necessary conditions.
We also observed that all Type 3 sources with zero mode had their center (mean) estimated in error, although the shape of the marginals were nicely recovered. Our interpretation is that these errors occur when the means of the marginals differ from each other. Recall that in all tests the marginal of the first modality was always a t-distribution with both mean and mode at zero, and no skew. Also, the second modality in sources of Type 3 always had a left skew GEV distribution, with either the mean or the mode at zero. Thus, if the mean is at zero, skewness and mode are different between the two modalities, and if the mode is at zero, skewness and mean are different. The conclusion is that the two distributions can be different in many ways but when their means are different the jICA estimate suffers more. Lastly, notice that the marginal distributions we tested here are reasonably simple and do not differ too much between modalities. Our conclusion is that in real data this might not be the case and the jICA estimates might be in (greater) error. One way to circumvent this issue in the future would be to directly model the multimodal distribution of each joint component (instead of the concatenated weighted average of the marginals). In such case, the jICA estimates (using concatenation) could be used to initialize the solver of such higher-complexity model.
The results from varying mixing coefficients across modalities are by far the most compelling. They suggest that the mere presence of a few sources with no mixing level associations in the dataset can lead to poor estimation of all other sources, regardless of their type. This is an observation that, to the best of our knowledge, had not been reported in the literature before. Its greatest impact is in the implications to the case with jointly associated sources that have no mixing level associations (sources of Type 7 and 8). Indeed, the results from the comparison experiment confirm that jICA fails to recover sources of Type 7, even though these sources contain multimodal associations in the form of dependence in the multimodal distribution. Finally, the comparison between jICA and CCA indicated the latter improves upon jICA on sources of Type 7 that contain linear dependence. Therefore, we conjecture that direct modeling of the multimodal distributions of each joint source could effectively lead to further improvements.
4.2. Multimodal applications and beyond
The jICA demonstrations above were illustrative of the in-depth evaluations that can be made with our simulation framework. A number of different models from the literature could be considered under many different scenarios utilizing the same principled design approach and synthetic source generation mechanisms we put forward here. The ability to carefully control the form and level of dependence between modality features makes our approach particularly unique and valuable to the neuroimaging community because it enables experimenters to directly test novel forms of meaningful multimodal associations. As we indicated here, even hybrid simulation designs could benefit from our approach. While hybrid designs do not necessarily obey the principles we outlined, they are useful for the challenge they represent and their closeness to real data. And they can be even more useful if the synthetic portion of the simulation does obey those principles.
Furthermore, the principles and strategies outlined here can be easily adapted to allow investigators to pursue new, challenging questions about classical stochastic latent variable models such as PCA, CCA, and ICA. For example, our data generation framework can be adjusted to the single-modality case, like we did in (Calhoun et al., 2013), where we focused on showing how little effect sparsity has on ICA of brain fMRI. Surely, the controversy of (Daubechies et al., 2009) should not reflect into this study. However, it is important to highlight that the referenced work had some serious issues with respect to the interpretation of sparsity and the notion of sources. Our recent publication provides further details (Calhoun et al., 2013). In summary, in light of the principles we introduce here, if an experimenter’s intent is to assess the effect of sparsity on the performance of ICA, then all other factors (including joint and marginal distributions, mixing coefficients, noise, among others) should be controlled and kept fixed while sparsity with a carefully selected measure, alone, is varied from one test case to another.
Also in the single-modality case, our approach could be adjusted to investigate how dependence between sources alters the behavior of methods like PCA and ICA, which has been the focus of other generalizations of ICA such as multidimensional ICA (Cardoso, 1998), independent subspace analysis (ISA) (Hyvärinen et al., 2009), and independent vector analysis (IVA) (Anderson et al., 2014 (to appear); Kim et al., 2006), among others. Even further, our framework is relevant for analysis in general, and could be extended to assess the effects of different dependence structures in different groups of subjects and, ultimately, test complex forms of group differences.
4.3. Sampling of high-dimensional distributions and sample dependence
In this work we focused on sampling from independent joint distributions, in which case the MC-dimensional joint distribution factorizes as the product of C M-dimensional joint sources. If M>2, M-dimensional copulas would be required. However, M -dimensional copulas are scarcer in the literature, which limits the types of dependence that can be tested in higher dimensions. Also, one could choose to introduce dependence between joint sources as well and, thus, require sampling directly from the large MC-dimensional distribution. In such cases, approaches such as rejection sampling, slice sampling, Metropolis-Hastings, and Gibbs sampling might be a better alternative. However, many of these sampling approaches are based on the theory of Markov Chains and, therefore, generate dependent rather than i.i.d. samples. With such methods it is not possible to completely eliminate sample dependence. This could be an issue in the more complex case of spatial and/or temporal processes, which have a natural, controllable dependence structure between neighboring voxels/timepoints. The additional sample dependence from the sampling technique could interfere with the dependence structure of the underlying process and render the experiment less informative than it could be.
We also acknowledge that classical stochastic models such as ICA, CCA, and PCA are ultimately concerned with the stationary statistics of the underlying sources, not their spatial/temporal configuration, which is why a reordering of the data values does not entail a different solution for these methods. In fact, random shuffles would likely make the data more i.i.d. However, a reordering of the values towards a specific spatial/temporal configuration, like the one we proposed here, does cause samples to become dependent. We argue that the type of sample dependence that emerges from our reshuffling strategy is actually desirable in some cases. Since we start from i.i.d. samples, the dependence emerges only after reshuffling and is actually favorable to stochastic methods that are sensitive to spatial/temporal configuration information, such as the class of semi-blind approaches, including ICA-R (Lin et al., 2007; Lu and Rajapakse, 2000). Specifically, ICA-R introduces modifications to the original ICA model that enforce spatial correlation constraints with known a priori spatial maps (Du and Fan, 2013; Lin et al., 2010), favoring the identification of specific spatial patterns. Other examples include algorithms that directly take into account the correlation among the samples (i.e., among neighboring voxel values) along with higher-order statistics (Du et al., 2011; Li and Adali, 2010). A reordering of the data values would change the results of such algorithms. This is further indication of the importance of designing realistic sources with reasonable activation regions, even though the simpler models may fail to take such structure into account. To conclude, we note that a different approach called conditional sampling (Daubechies et al., 2009; Groves et al., 2011) entails multiple sampling distributions and a final distribution of mixture type, which is typically undesirable and leads to a dependence structure much harder to control in a simulation setting.
5. Conclusions
Our work describes a series of issues associated with current ad hoc approaches utilized for generation of synthetic multimodal data in demonstrations of new multimodal models. We have observed an overall lack of statistical rigor in previous simulation studies, which we believe have often resulted in an incomplete picture about the strengths and, in particular, the limitations of these models. Our work attempts to fill this gap by borrowing ideas from the field of computerized simulation, which we break down into a set of principles for synthetic data generation. Based on these principles and with a firm grasp on statistical rigor, we proposed a novel framework for simulation of stochastic multimodal models that is generic, admits a gradual, controlled transition between various designs, and is able to accommodate realistic neurological features. With this approach it becomes possible to rightfully assess specific scenarios and model assumption violations.
The simulation framework introduced here opens a wide window into how multivariate stochastic processes must be tested and studied before their application to neuroimaging data and widespread interpretation of ensuing results takes place. The takeaway is that careful design of the simulated datasets, in agreement with the proposed principles outlined here, is fundamental to address specific questions in a careful, incremental, and controlled fashion.
We showcase this in the case of joint independent component analysis and present novel results about its limitations and conditions for success that have not been reported before. Specifically, we have demonstrated the impact violations to the main assumptions of jICA may have in real datasets. Our results indicate that jICA tends to correctly estimate mainly the tails of the underlying marginal pdfs while also arbitrarily favoring the center of one of the modalities when two modes are equally salient. Results also suggest that physiological features distributed with “well-behaved” tails could be nicely captured even when the ground-truth modalities have different distributions. More importantly, our results suggest that the mere presence of only a few sources with no mixing level associations in the dataset can lead to poor estimation of all other sources, regardless of their type.
We believe our work makes an important contribution to the field by providing fair grounds for assessment and comparison of new, upcoming multimodal data fusion models, and serves as a new baseline for multimodal simulation design. The implication to neuroimaging is quite straightforward: better understanding of a model’s strengths and limitations leads to reliable interpretation of the underlying forces driving multimodal associations. Therefore, we recommend that future simulation works be developed with grater care and in line with principles and examples we outline here.
Supplementary Material
Highlights.
Current ad hoc simulations hinder rigorous evaluation of fusion methods
We establish guiding principles for proper generation of synthetic multimodal data
We introduce a solid, principled simulation framework for data-driven models
Framework showcase: we push the assumptions of leading methods, joint ICA and CCA
We bring more clarity about joint ICA and CCA’s performance under certain conditions
Acknowledgements
The authors would like thank Prof. Marios Pattichis, PhD at the Electrical and Computer Engineering Dept. in The University of New Mexico (UNM) for valuable input and orientation.
This work was funded by NIH/NIBIB 1R01EB006841 (Calhoun).
Abbreviations:
- SA
Simulated Annealing
- MH
Metropolis-Hastings
probability distribution function
- FIT
fusion ICA toolbox
- ISI
inter-symbol interference
- ICDF
inverse cumulative distribution function
Appendix A
A Brief Review of Previous Multimodal Simulation Works
Here, we review how simulations have been conducted in previous works and identify in what ways they have been in agreement (or not) with the principles outlined in Section 1.2. We start with the first multimodal simulation designed for jICA (Calhoun et al., 2006a), which proposed a very simple approach, combining real data with artificial signals to demonstrate a strength of that model. By construction, jICA identifies joint multimodal components that share identical mixing coefficients Am. However, the proposed simulations did not account for the stochastic nature of the components. Particularly, the designed joint components were not random samples from a joint pdf but, rather, points from a single deterministic 2D function (cosine with exponential decay), and the joint distribution dependence was unknown and not controlled. One limitation is that this approach cannot guarantee the simulated components will conform to the jICA assumption of independent modalities because of the undefined, arbitrary dependence structure of deterministic sources. Also, this approach lacks the benefit of multiple replications because it cannot generate different random sampling instances from the same distribution, even though in practice deterministic values can be thought of as one particular instance of random sampling. As a result, ensemble averages from multiple replications cannot be reported. This simulation approach, therefore, fails to meet the principles of fairness and control. To be considered fair, the method of data generation should have narrowed possible estimation errors by controlling the dependence structure between the multimodal spatial maps. Lastly, the marginal distributions had a point mass function at zero, which means the pdfs were actually discontinuous and, thus, not in line with the Infomax implementation used for ICA.
Next, we consider the simulation strategy for CCA+ICA (Sui et al., 2010), and notice that it also did not fully consider the joint structure of the source components, using deterministic 8-bit gray-scale images as the multimodal spatial maps rather than random samples from a controlled joint pdf. In contrast to Calhoun et al. (Calhoun et al., 2006a), this work explores some limitations of jICA by violating some of its assumptions. Specifically, the simulation setup was intended to challenge the jICA assumption of identical pdfs for all modalities in the same joint source component. This was achieved by, in some cases, assigning different images (with different pdfs) to each modality in the same joint component. However, that work also used different mixing matrices for each modality to challenge the jICA assumption of identical mixing coefficients. This was particularly relevant (and somewhat realistic) in the context of that work but also induced an additional source of estimation error, making it ambiguous whether jICA failed due to the different mixing matrices or the different modality distributions in the same joint source component. In addition to that, it is unclear how different the distributions of each modality are (in the case they are assigned different images), neither to what degree the modalities are independent. As in (Calhoun et al., 2006a), this simulation is not entirely compliant to the principles of fairness and control. More importantly, it is not committed to simple since it tries to simultaneously account for both mixing- and component-level effects.
Finally, we consider the work on linked ICA (Groves et al., 2011). Although their multimodal simulations use random samples from carefully designed marginal distributions, the sampling strategy uses conditional sampling (see Section 4.3). This is due to the implicit use of an indicator function to define “active regions” in the spatial maps (the sampling distribution is not identical but, instead, conditioned on the spatial location of each random sample). The same issue can be observed in the simulations used to support some inaccurate claims about ICA in Daubechies et al. (Calhoun et al., 2013; Daubechies et al., 2009). The concern is that the final source distributions become of mixture type, and in Groves et al. (Groves et al., 2011) they present discontinuities, with probability masses at zero. Like other ad hoc simulations, this is not following the principle of fairness since common implementations of ICA are not optimized for discontinuous distributions. This may act as an additional source of unintended errors, and could lead to faulty conclusions too. Moreover, the work provides no description of how the samples are assigned into spatial maps. This procedure is very important since it plays a major role in controlling dependences in the multimodal joint distributions. When different modality groups (as defined in Groves et al. (Groves et al., 2011)) are co-registered (e.g., smoothed summary structural and functional MRI maps from the same subject) the assignment of sample values into spatial maps, one map at a time, would likely lead to uncontrolled dependencies (see Figure 3).
We conclude that current approaches to synthetic multimodal data generation are perhaps intuitive and straightforward in the way they are designed but do not enable full control of the underlying joint distributions, which renders such experiments less informative than they could be. Also, a recurring issue in all multimodal simulation experiments we have reviewed is that most of the source design effort is directed to the spatial configuration of the “active regions” in the source rather than how its underlying joint distribution may relate to other sources. This is like prioritizing the principle of realistic sources over the principle of fairness, both of which have naturally conflicting goals but are equally important. Currently, the assignment of sample values into maps can be generally qualified as an ad hoc procedure lacking any particular care for the resulting joint distribution between modalities at the component level, or even across components. It is likely that unwanted/uncontrolled dependencies between modalities will be present in such simulations, possibly working as yet another source of estimation errors. Also, a full evaluation of the quality of the estimated sources was not presented (the estimated source distribution is often not compared against the ground-truth distribution); only the mixing coefficients were compared to the ground-truth. All of these motivate the pursuit of a more careful design that guarantees total control of the joint distributions and more in depth performance evaluations.
Appendix B
Equation Chapter (Next) Section 1
Distribution Formulas and Detailed Source Definitions
Copula Formulations
The bivariate t-Copula: this is an elliptical copula defined as (Demarta and McNeil, 2005):
(B.1) |
where is the quantile function (or ICDF) of a standard univariate t-distribution with ν degrees of freedom, and ρ is the linear correlation coefficient and admits the following relationship to Kendall’s τ:
(B.2) |
which we used to parameterize the level of dependence. In our experiments we used the copula random sampler available in MATLAB’s Statistics Toolbox using ν=1 and τ=0.4 or τ=0.8.
The Frank Copula: this copula is a symmetric Archimedean copula defined as (Nelsen, 2006):
(B.3) |
where α admits the following relationship to Kendall’s τ:
(B.4) |
and D1(α) is a Debye function of first kind. In our experiments we used the copula random sampler available in MATLAB’s Statistics Toolbox using τ=0.4 or τ=0.8.
Circularly Symmetric (cs): this is a unique copula with a dependence structure that is not due to linear correlation (ρ=0). Its copula pdf is defined as (Perlman and Wellner, 2011):
(B.5) |
Non-Linear Circular Symmetry (nlcs): based on the cs copula above, the nlcs copula also has a dependence structure that is not due to linear correlation (ρ=0). Its pdf is defined as (Perlman and Wellner, 2011):
(B.6) |
Ferguson copulas: this class of copula pdf is defined as (Ferguson, 1995):
(B.7) |
where g(·) is a well-defined pdf in the interval [0,1]. In our experiments, we are interested in dependence structure that is not due to linear correlation (i.e., we want ρ=0). Thus, we choose g(z) to be symmetric about 1/2. Furthermore, in order to explore a different distribution form, we choose g(z) to be oscillatory with two positive peaks and zero-valued valleys in z ∈ [0,1] as .
Table B.1 below gives detailed descriptions of the distribution choices used to generate the source pool described in Section 2.3.
Table B.1:
Joint Source Number | Source Type | Copula | Marginal pdf of S2i | Mixing Coeff Association | |
---|---|---|---|---|---|
[S1i ~ tν=8(μ=0,σ=1)] | Type | ρ | |||
1 | 1 | Independence | same as S1i | Identical | 1 |
2 | 2 | cs | same as S1i | Identical | 1 |
3 | 2 | cs | same as S1i | Identical | 1 |
4 | 3 | Independence | gev(mode = 0) | Identical | 1 |
5 | 3 | Independence | gev(μ = 0) | Identical | 1 |
6 | 4 | Independence | same as S1i | Correlated | 0.3 |
7 | 4 | Independence | same as S1i | Correlated | 0.9 |
8 | 5 | cs | gev(μ = 0) | Identical | 1 |
9 | 6 | Independence | gev(μ = 0) | Correlated | 0.3 |
10 | 7 | cs | same as S1i | Correlated | 0.3 |
11 | 8 | cs | gev(μ = 0) | Correlated | 0.3 |
12 | 1 | Independence | same as S1i | Identical | 1 |
13 | 2 | t(ν=1,τ=0.4) | same as S1i | Identical | 1 |
14 | 2 | t(ν=1,τ=0.8) | same as S1i | Identical | 1 |
15 | 3 | Independence | gev(mode = 0) | Identical | 1 |
16 | 3 | Independence | gev(μ = 0) | Identical | 1 |
17 | 4 | Independence | same as S1i | Correlated | 0.3 |
18 | 4 | Independence | same as S1i | Correlated | 0.9 |
19 | 5 | t(ν=1,τ=0.4) | gev(μ = 0) | Identical | 1 |
20 | 6 | Independence | gev(μ = 0) | Correlated | 0.3 |
21 | 7 | t(ν=1,τ=0.4) | same as S1i | Correlated | 0.3 |
22 | 8 | t(ν=1,τ=0.4) | gev(μ = 0) | Correlated | 0.3 |
23 | 1 | Independence | same as S1i | Identical | 1 |
24 | 2 | Frank(τ=0.4) | same as S1i | Identical | 1 |
25 | 2 | Frank(τ=0.8) | same as S1i | Identical | 1 |
26 | 3 | Independence | gev(mode = 0) | Identical | 1 |
27 | 3 | Independence | gev(μ = 0) | Identical | 1 |
28 | 4 | Independence | same as S1i | Correlated | 0.3 |
29 | 4 | Independence | same as S1i | Correlated | 0.9 |
30 | 5 | Frank(τ=0.4) | gev(μ = 0) | Identical | 1 |
31 | 6 | Independence | gev(μ = 0) | Correlated | 0.3 |
32 | 7 | Frank(τ=0.4) | same as S1i | Correlated | 0.3 |
33 | 8 | Frank(τ=0.4) | gev(μ = 0) | Correlated | 0.3 |
34 | 1 | Independence | same as S1i | Identical | 1 |
35 | 2 | nlcs | same as S1i | Identical | 1 |
36 | 2 | nlcs | same as S1i | Identical | 1 |
37 | 3 | Independence | gev(mode = 0) | Identical | 1 |
38 | 3 | Independence | gev(μ = 0) | Identical | 1 |
39 | 4 | Independence | same as S1i | Correlated | 0.3 |
40 | 4 | Independence | same as S1i | Correlated | 0.9 |
41 | 5 | nlcs | gev(μ = 0) | Identical | 1 |
42 | 6 | Independence | gev(μ = 0) | Correlated | 0.3 |
43 | 7 | nlcs | same as S1i | Correlated | 0.3 |
44 | 8 | nlcs | gev(μ = 0) | Correlated | 0.3 |
45 | 1 | Independence | same as S1i | Identical | 1 |
46 | 2 | Ferg(g(z)) | same as S1i | Identical | 1 |
47 | 2 | Ferg(g(z)) | same as S1i | Identical | 1 |
48 | 3 | Independence | gev(mode = 0) | Identical | 1 |
49 | 3 | Independence | gev(μ = 0) | Identical | 1 |
50 | 4 | Independence | same as S1i | Correlated | 0.3 |
51 | 4 | Independence | same as S1i | Correlated | 0.9 |
52 | 5 | Ferg(g(z)) | gev(μ = 0) | Identical | 1 |
53 | 6 | Independence | gev(μ = 0) | Correlated | 0.3 |
54 | 7 | Ferg(g(z)) | same as S1i | Correlated | 0.3 |
55 | 8 | Ferg(g(z)) | gev(μ = 0) | Correlated | 0.3 |
56 | 9 | Independence | Gaussian noise | Identical | 1 |
Appendix C
Equation Chapter (Next) Section 1
The Metropolis-Hastings Algorithm
The Metropolis-Hastings (M-H) algorithm approach works by constructing a time-reversible Markov chain whose stationary probabilities are π(d)=q(d|λ). The first step is to define a neighboring system over to determine which are the neighboring states of current state (permutation) d=i. The neighboring system we adopt here is the one defined by swapping any two voxel locations. In other words, the current spatial configuration of voxel values corresponds to the current state and the swap of any two voxel values defines a new spatial configuration and, thus, a new state, which is considered an immediate neighbor of the current state. It is not hard to show that under this definition the number of neighbors for any given d is . Next, we consider how to control the probability of moving between neighboring states in the discrete search space in a way that yields a stochastic process analogous to a random-walk over .
First, consider a Markov chain with transition matrix T for which p(i,j), indicates the probability of moving from state i to a neighbor state j. Then define a Markov chain (sequence) {Xn, n>0} as follows: when Xn=i, generate a random sample (a new state) from a random variable X with probability P[X = j|i] = p(i,j), j = 1, …,b. Here, and we set P[X = j|i] to be uniformly distributed over all b immediate neighbors of i and 0 otherwise. If the new sampled state is X = j then set Xn+1 = j with probability α(i,j) and Xn+1=i with probability 1 − α(i,j). It can be shown (Ross, 2009, p.262) that by setting
(C.1) |
the Markov chain will have a stationary probability π(d)=q(d|λ) which is also the limiting probability when n→∞, as we desired. However, setting λ to a very high value makes the convergence to the limiting distribution considerably slower. A practical work-around which gives rise to simulated annealing (SA) is to replace the fixed λ by a gradually increasing one, such as , which leads to
(C.2) |
Appendix D
Equation Chapter (Next) Section 1
The variance of a sum of independent random variables is the sum of the variances of each random variable. Also, the variance of a random variable multiplied by a constant is the constant squared times the variance of the random variable. Lastly, the second moment of a random variable Y corresponds to its average power and can be written as:
(D.1) |
where denotes the expectation operator, and Var[·] denotes the variance operator.
With that in mind, recall that the marginal distributions of all joint sources are known and set to have unit variance. Thus, we can use the fact that all joint sources are independent to derive the variance of any particular mixture Yml, l =1,…,N, where, is the l-th noiseless linear mixture from modality m, and
(D.2) |
where amli are the mixing coefficients of Smi for Xml.
Let Xml =Yml +amlZm, where Zm is a zero-mean, unit variance random Gaussian noise independent of Yml, and aml is the mixing coefficient of Zm for the l -th mixture Xml. Our goal is to determine aml such that the final SNR is 30dB. We formulate the SNR as the ratio of average powers between the corrupted signal Xml and the noise Zm :
(D.3) |
Solving for αml:
(D.4) |
The constant aml can be different for each modality because may differ for each modality m. This can happen whenever the factor levels are not set to “same” for either factor 3 (the mixing coefficients) or factor 2 (the marginal pdfs). We would like to enforce the same mixing coefficient for all modalities (i.e., aml =al, m=1,…,M). We can still guarantee a SNR of at least 30dB in each modality by selecting the smallest , m=1,…,M, to compute al.
Appendix E
Equation Chapter (Next) Section 1
The NMSE, for each modality m=1,…,M, is simply the mean square error (MSE), i.e., the average power of the error between estimated and ground-truth Sm, normalized by the average power of Smi:
(E.1) |
NMI is the mutual information MI(·) between and Smi, divided (i.e., normalized) by the sum of the entropies H(·) of and Smi:
(E.2) |
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
The authors declare no conflict of interest.
References
- Allen E, Erhardt E, Wei Y, Eichele T, Calhoun V, 2012. Capturing inter-subject variability with group independent component analysis of fMRI data: a simulation study. NeuroImage 59, 4141–4159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amari S, Cichocki A, Yang HH, 1996. A New Learning Algorithm for Blind Signal Separation. Advances in Neural Information Processing Systems 8, 757–763. [Google Scholar]
- Andersen AH, Gash DM, Avison MJ, 1999. Principal component analysis of the dynamic response measured by fMRI: a generalized linear systems framework. 17, 795–815. [DOI] [PubMed] [Google Scholar]
- Anderson M, Fu G-S, Phlypo R, Adalı T, 2014. (to appear). Independent vector analysis: Identification conditions and performance bounds. IEEE Trans. Signal Processing [Google Scholar]
- Banks J, 1998. Handbook of Simulation: principles, methodology, advances, applications, and practice, 1 ed. Wiley-Interscience, New York, NY. [Google Scholar]
- Bovik A, 2009. The Essential Guide to Image Processing, 1 ed. Academic Press, Boston, MA. [Google Scholar]
- Calhoun V, Potluru V, Phlypo R, Silva R, Pearlmutter B, Caprihan A, Plis S, Adalı T, 2013. Independent component analysis for brain fMRI does indeed select for maximal independence 19th Annual Meeting of the Organization for Human Brain Mapping (OHBM), Seattle, WA. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calhoun VD, Adali T, 2006. ‘Unmixing’ Functional Magnetic Resonance Imaging with Independent Component Analysis. IEEE Eng.in Medicine and Biology 25, 79–90. [DOI] [PubMed] [Google Scholar]
- Calhoun VD, Adali T, Kiehl KA, Astur RS, Pekar JJ, Pearlson GD, 2006a. A Method for Multi-task fMRI Data Fusion Applied to Schizophrenia. Hum.Brain Map 27, 598–610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calhoun VD, Adali T, Liu J, 2006b. A Feature-based Approach to Combine Functional MRI, Structural MRI, and EEG Brain Imaging Data. Proc.EMBS [DOI] [PubMed] [Google Scholar]
- Calhoun VD, Adali T, Pearlson GD, Pekar JJ, 2001. A Method for Making Group Inferences from Functional MRI Data Using Independent Component Analysis. Hum.Brain Map 14, 140–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cardoso JF, 1998. Multidimensional Independent Component Analysis. Proc.IEEE Int.Conf.Acoustics, Speech, Signal Processing (ICASSP) [Google Scholar]
- Casella G, Berger R, 2002. Statistical Inference, 2 ed. Duxbury Press, Thomson Learning. [Google Scholar]
- Correa NM, Eichele T, Adalı T, Li Y-O, Calhoun VD, 2010. Multi-set canonical correlation analysis for the fusion of concurrent single trial ERP and functional MRI. Neuroimage 50, 1438–1445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Correa NM, Li Y-O, Adalı T, Calhoun VD, 2008. Canonical Correlation Analysis for Feature-Based Fusion of Biomedical Imaging Modalities and Its Application to Detection of Associative Networks in Schizophrenia. IEEE J Sel Topics Signal Process 2, 998–1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daubechies I, Roussos E, Takerkart S, Benharrosh M, Golden C, D’Ardenne K, Richter W, Cohen J, Haxby J, 2009. Independent component analysis for brain fMRI does not select for independence. PNAS 106, 10395–10397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Demarta S, McNeil A, 2005. The t Copula and Related Copulas. International Statistical Review 73, 111–129. [Google Scholar]
- Du W, Li H, Li X-L, Calhoun V, Adalı T, 2011. ICA of fMRI data: Performance of Three ICA Algorithms and the Importance of Taking Correlation Information into Account IEEE Symp. on Biomedical Imaging, Chicago, IL, pp. 1573–1576. [Google Scholar]
- Du Y, Fan Y, 2013. Group information guided ICA for fMRI data analysis. NeuroImage 69, 157–197. [DOI] [PubMed] [Google Scholar]
- Embrechts P, Mcneil A, Straumann D, 2002. Correlation and Dependency in Risk Management: Properties and Pitfalls Risk Management: Value at Risk and Beyond. Cambridge University Press, Cambridge, UK, pp. 176–223. [Google Scholar]
- Erhardt E, Allen E, Wei Y, Eichele T, Calhoun V, 2012. SimTB, a simulation toolbox for fMRI data under a model of spatiotemporal separability. NeuroImage 59, 4160–4167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferguson T, 1995. A class of symmetric bivariate uniform distributions. Stat Pap 36, 31–40. [Google Scholar]
- Friman O, Cedefamn J, Lundberg P, Borga M, Knutsson H, 2001. Detection of neural activity in functional MRI using canonical correlation analysis. Magn Reson Med 45, 323–330. [DOI] [PubMed] [Google Scholar]
- Groves AR, Beckmann CF, Smith SM, Woolrich MW, 2011. Linked independent component analysis for multimodal data fusion. NeuroImage 54, 2198–2217. [DOI] [PubMed] [Google Scholar]
- Himberg J, Hyvarinen A, Esposito F, 2004. Validating the independent components of neuroimaging time series via clustering and visualization. Neuroimage. 22, 1214–1222. [DOI] [PubMed] [Google Scholar]
- Hyvärinen A, Hurri J, Hoyer P, 2009. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision, 1 ed. Springer. [Google Scholar]
- Joe H, 1997. Multivariate Models and Multivariate Dependence Concepts, 1 ed. CRC Press, New York, NY. [Google Scholar]
- Kim T, Lee I, Lee T-W, 2006. Independent vector analysis: Definition and algorithms 40th Asilomar Conf. Signals, Systems, Comput, Pacific Grove, CA, pp. 1393–1396. [Google Scholar]
- Li X, Adali T, 2010. Blind spatiotemporal separation of second and/or higher-order correlated sources by entropy rate minimization. ICASSP, Dallas, TX, pp. 1934–1937. [Google Scholar]
- Li Y, Adali T, Calhoun VD, 2007a. Estimating the Number of Independent Components for Functional Magnetic Resonance Imaging Data. Hum. Brain Map [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y-O, Adali T, Calhoun VD, 2007b. A Multivariate Model for Comparison of Two Datasets and its Application to FMRI Analysis In: Diamantaras K (Ed.), Machine Learning for Signal Processing. IEEE, Thessaloniki, pp. 217–222. [Google Scholar]
- Li Y-O, Adalı T, Wang W, Calhoun VD, 2009. Joint Blind Source Separation by Multiset Canonical Correlation Analysis. IEEE Trans Signal Process 57, 3918–3929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Q, Liu J, Zheng Y, Liang H, Calhoun VD, 2010. Semiblind spatial ICA of fMRI using spatial constraints. Hum. Brain Map 31, 1076–1088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Q, Zheng Y, Yin F, Liang H, Calhoun VD, 2007. A Fast Algorithm for One-unit ICA-R. Information Sciences 177, 1265–1275. [Google Scholar]
- Liu J, Calhoun VD, 2006. A novel approach to analyzing fMRI and SNP data via parallel independent component analysis. Proc.SPIE [Google Scholar]
- Lu W, Rajapakse JC, 2000. Constrained Independent Component Analysis Adv.Neural Inf.Proc.Sys MIT Press, Cambridge, MA, pp. 570–576. [Google Scholar]
- Macchi O, Moreau E, 1995. Self-Adaptive Source Separation by Direct or Recursive Networks. Proc. International Conference on Digital Signal Processing, Limasol, Cyprus, pp. 122–129. [Google Scholar]
- McKeown MJ, Makeig S, Brown GG, Jung TP, Kindermann SS, Bell AJ, Sejnowski TJ, 1998. Analysis of fMRI Data by Blind Separation Into Independent Spatial Components. Hum.Brain Map 6, 160–188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moeller JR, Strother SC, 1991. A Regional Covariance Approach to the Analysis of Functional Patterns in Positron Emission Tomographic Data. J Cereb Blood Flow Metab 11, A121–A135. [DOI] [PubMed] [Google Scholar]
- Myers RH, Montgomery DC, Anderson-Cook CM, 2009. Response Surface Methodology: Process and Product Optimization Using Designed Experiments, 3 ed. John Wiley & Sons, Hoboken, NJ. [Google Scholar]
- Nelsen RB, 2006. An Introduction to Copulas, 2 ed. Springer; New York, New York, NY. [Google Scholar]
- Oberkampf W, Roy C, 2010. Verification and Validation in Scientific Computing, 1 ed. Cambridge University Press, Cambridge. [Google Scholar]
- Peng H, Long F, Ding C, 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27, 1226–1238. [DOI] [PubMed] [Google Scholar]
- Perlman M, Wellner J, 2011. Squaring the Circle and Cubing the Sphere: Circular and Spherical Copulas. Symmetry 3, 574–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ross SM, 2006. Simulation, 4 ed. Academic Press, New York, NY. [Google Scholar]
- Ross SM, 2009. Introduction to Probability Models, 10 ed. Academic Press, New York, NY. [Google Scholar]
- Silva R, Calhoun V, 2012. An Assessment of the Limitations of Joint ICA in Multimodal Data Fusion 18th Annual Meeting of the Organization on Human Brain Mapping, HBM 2011. OHBM, Beijing, China. [Google Scholar]
- Sklar A, 1959. Fonctions de répartition à n dimensions et leurs marges. Publ Inst Statist Univ Paris 8, 229–231. [Google Scholar]
- Sui J, Adali T, Pearlson G, Yange H, Sponheim SR, White T, Calhoun VD, 2010. A CCA + ICA based model for multi-task brain imaging data fusion and its application to schizophrenia. Neuroimage 51, 123–134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sui J, Adali T, Pearlson GD, Clark VP, Calhoun VD, 2009. A method for accurate group difference detection by constraining the mixing coefficients in an ICA framework. Hum. Brain Map 30, 2953–2970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sui J, Adali T, Yu Q, Chen J, Calhoun VD, 2012. A Review of Multivariate Methods for Multimodal Fusion of Brain Imaging Data. Journal of Neuroscience Methods 204, 68–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sui J, Pearlson G, Caprihan A, Adali T, Kiehl KA, Liu J, Yamamoto J, Calhoun VD, 2011. Discriminating schizophrenia and bipolar disorder by fusing fMRI and DTI in a multimodal CCA+ joint ICA model. Neuroimage 57, 839–855. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.