The Top-K Tau-Path Screen For Monotone Association: Yu Et Al. 2011
The Top-K Tau-Path Screen For Monotone Association: Yu Et Al. 2011
Abstract
A pair of variables that tend to rise and fall either together or in opposition
are said to be monotonically associated. For certain phenomena, this tendency
is causally restricted to a subpopulation, as, for example, an allergic reaction
to an irritant. Previously, Yu et al. (2011) devised a method of rearranging
observations to test paired data to see if such an association might be present in
a subpopulation. However, the computational intensity of the method limited
its application to relatively small samples of data, and the test itself only judges
if association is present in some subpopulation; it does not clearly identify the
subsample that came from this subpopulation, especially when the whole sam-
ple tests positive. The present paper adds a “top-K” feature (Sampath and
Verducci (2013)) based on a multistage ranking model, that identifies a concise
subsample that is likely to contain a high proportion of observations from the
subpopulation in which the association is supported. Computational improve-
ments incorporated into this top-K tau-path (TKTP) algorithm now allow the
method to be extended to thousands of pairs of variables measured on sample
sizes in the thousands. A description of the new algorithm along with measures
of computational complexity and practical efficiency help to gauge its poten-
tial use in different settings. Simulation studies catalog its accuracy in various
settings, and an example from finance illustrates its step-by-step use.
Keywords: Kendall’s tau, nonparametric correlation, ranking, copula,
mixtures of distributions, unsupervised classification, computational
complexity
1. Introduction
The emergence of “big data” has provided the opportunity for scientists to
focus on key subpopulations. These are often identified by having higher mean
values on some relevant aspect that is measurable by variables in the database.
Here we change the criterion to having higher correlational values.
∗ Correspondingauthor.
Email address: [email protected] (Srinath Sampath)
There are many potential applications. In cancer research, the mechanisms
whereby some cancers become chemoresistant are inherent in their gene net-
works, not all of which have been identified. One way to discover novel gene
networks is to look for correlations between pairs of genes across many different
types of cancer cells. The subpopulation of cancers that possess this network
will match the subpopulation that supports the correlations. In marketing, a
goal is to identify subpopulations that will be most responsive to a particular
campaign. In finance, association between a company’s earnings and a key com-
modity price may be restricted to certain macroeconomic conditions which are
not known or directly measured. In all these cases, and more, the basic prob-
lem is to identify, as concisely as possible, a highly correlated subsample that
includes most of the members of the subpopulation that supports association
between the variables. This is thus a kind of unsupervised classification, with a
novel criterion.
Previous work on this problem is relatively recent. Yu et al. (2011) developed
the tau-path algorithm for re-sorting a sample so that the Kendall tau coefficient
is (optimally) decreasing. In checking the behavior of the tail of this tau-path,
they devised a test for correlation in a subsample versus the null hypothesis of
independence. A limitation of operating under this null hypothesis is that the
whole population tends to test positive when the supporting subpopulation is
large or when the association is strong, which is not uncommon in datasets with
large sample sizes. Bamattre et al. (2015) discussed some procedures for the
much more difficult problem of testing for heterogeneity of association in which
the distribution under the null hypothesis may have many different forms. A
more modestly-scoped approach, suggested in Sampath and Verducci (2013),
assumes a positive test for overall correlation, and attempts to discover if this
correlation really is supported only within a proper subpopulation. They applied
a multistage model of agreement between rankings to the re-sorted sample to
indicate the point where predictive information stops. This method, called
the top-K tau-path (TKTP), is intended to provide good coverage of the true
underlying supporting population, but this has not been investigated in detail
until now, mainly because the original implementation was not scalable to large
datasets.
The current paper covers two issues not previously resolved. The first is
extending the tau-path method to large samples. The original code was practical
only up to sample sizes n < 100; the new code easily handles n up to 10,000
on a personal computer. This extension to large samples makes it possible to
address the second issue: how to characterize the performance of TKTP beyond
the simple task of testing against independence. Simulation studies demonstrate
the pattern between size of selected subsample and percent of the true samples
included.
Section 2 provides a thorough description of TKTP. Section 3 details the
design of the new algorithms, including their computational complexity and
runtime efficiency. Performance characteristics of their coverage proportions
appear in Section 4, and a new example of predicting trends in stock prices in
Section 5. The last section of the paper summarizes all findings and discusses
2
mathematical underpinnings.
3
2.2. Seriation and the Tau-Path
A simple yet powerful clustering technique to discover hidden structure and
underlying patterns in data involves the permutation of multidimensional data
along a single dimension. In archeology, the chronological ordering of similar
relics could yield new insight into the different eras based on the similarity of
adjacent objects. Known as seriation, this technique finds application in a wide
variety of fields of scientific inquiry. Liiv (2010) provides an in-depth history of
this technique and its use in many disciplines. In unsupervised learning, seri-
ation often manifests in the form of matrix reordering, where multidimensional
matrices are permuted based on the magnitude of a particular characteristic,
potentially leading to insights into underlying associations in the data.
In the context of Kendall’s τ , Yu et al. (2011) permuted the sample concor-
dance matrix in a very specific manner described in the next section. By means
of either of two backward conditional search (BCS) algorithms, the Fast Back-
ward Conditional Search (FastBCS) and the Full Backward Conditional Search
(FullBCS) algorithms, they selected increasing subsets of the sample matrix in
such a way that the associated tau coefficients become monotonically decreas-
ing; the tau coefficient corresponding to the 2 × 2 subset is therefore at least
as large as the one corresponding to the 3 × 3 subset, and so on. The resulting
ordering of the sample matrix is referred to as the tau-path.
Using the tau-path statistic, Yu et al. (2011) also propose a test for inde-
pendence between the two variables versus the alternative that the variables
are correlated only in a subpopulation. By simulating a large number of inde-
pendent samples of size n from independent X and Y populations, establishing
the tau-path for each sample, and then calculating the percentiles of Kendall’s
tau coefficient at each step along the tau-path, and picking a fixed percentile
to serve as the boundary at each step; for any fixed percentile q, the boundary
points {qi | i = 2, . . . , n} are calibrated so that no more than a proportion α of
the random tau-paths exceed the boundaries. Note that the same simulation
may be achieved by sampling pairs of independent permutations of {1, . . . , n}.
The tau-path approach thus provides an ordering of the sample bivariate
data along the path of strongest association. So far the focus of this subsection
has been on uncovering positive associations between the two variables; should
a negative association between the variables be sought, this is readily accom-
plished by multiplying the Y ’s by −1 before performing the tau-path analysis.
Another aspect of the ordered data that is worth computing is the stopping
stage or endpoint of association; the stage in the tau-path ordered data where
association ends and randomness sets in. Especially in these times when large
datasets are readily available, and speed of computation and immediacy of re-
sults are important considerations, the need to prune the full dataset down to
the appropriately-sized subset for further analysis is critical. The key notion
here is that as one progresses down the tau-path from most- to least-associated
sample pairs, the two data elements in any pair should provide largely consis-
tent information about their strength of association. When adjacent data pairs
routinely provide inconsistent information about their strength of association,
4
it is possible that the level of association has deteriorated and randomness has
taken over. In the context of the tau-path ordering, multistage ranking models
provide an appropriate framework for extracting the endpoint of agreement.
2.3. The Multistage Ranking Model and the TKTP Stopping Rule
The primary challenge with traditional models that assess the degree of as-
sociation between two ranked lists is their mathematical complexity. Whereas a
multinomial distribution provides a full characterization of the population, the
large number of ensuing parameters leads to intractable likelihood equations,
even when the number of objects being ranked is small. Fligner and Verducci
(1988) proposed a family of forward multistage ranking models that vastly re-
duce the computational effort and also provide clear insights into the underlying
processes. The multistage ranking framework is outlined here and the TKTP
stopping rule then developed.
Given a group of n objects, the relative value of the objects to an assessor—
whether qualitative or numerical—can be expressed by the assignment of ranks
to the objects. Two representations of the assessor’s preferences are popular;
the first is the ranking or permutation, a vector of length n that gives, for each
object on the original list, the assessor’s respective rank, namely the quantity
(1 + the number of other objects that the assessor considers superior). This
representation of the ranked preferences is denoted
π = [π(1), . . . , π(n)].
π −1 (j) = i if π(i) = j, i = 1, . . . , n, j = 1, . . . , n.
Our interest lies in measuring the extent of agreement between the prefer-
ences of two assessors who independently rank the n objects, and in determining
the last stage where agreement still exists and random noise follows, the stop-
ping stage or endpoint of agreement between the two assessors. This is achieved
by anchoring one assessor’s ranking of the objects as the reference ranking or
ground truth, and then examining the stage-wise departures of the second asses-
sor’s ranks from the corresponding reference ranks; the latter is the generated
or observed ranking π. Following the approach given by Fligner and Verducci
(1988), the computations for the first stage and a generic stage j are given below.
The stages are assumed to be independent, and penalties and truncated geo-
metric probabilities are assigned to the second assessor’s ranks commensurate
with their stage-wise deviations from the first assessor’s ranks as follows:
Stage 1: Since all n objects are available, the second assessor selects the
(1 + v)th best object overall, as specified by π −1 , and incurs the penalty V1 = v
5
with probability
!
1 − r1
P (V1 = v) = rv , v = 0, . . . , n − 1, 0 < r1 < 1.
1 − r1n 1
The vector {V1 , . . . , Vn−1 } thus captures the stage-wise deviations between
the two sets of ranks, and is referred to as the discordance or penalty vector
between the ranking schemes. {rj } too represents the stage-wise disagreement
between the two assessors. Rather than focus on rj , the transformation
θj = − log rj , j = 1, . . . , n − 1
is used, and a higher estimated θj reveals a closer association between the as-
sessors’ ranks for object j.
It remains to determine the stopping stage or endpoint K, the last stage
in the list of objects where the second assessor’s rank still shows some form
of agreement with that of the first. Beyond this stage, the second assessor
posts random stage-wise ranks relative to the first, and the previous association
between the assessors’ ranks has faded. Subsection 3.1 of Sampath and Verducci
(2013) provides the moving average maximum likelihood estimator (MAMLE)
for the rj , calculated iteratively over overlapping backward-looking windows of
stages i, j − w + 1 ≤ i ≤ j of fixed width w, till all the stages are exhausted.
The resulting curve {r̂j } and its inverted counterpart {θ̂j } are locally smooth
estimators of the stage-wise agreement between the two assessors.
The rejection region for the MAMLE, which enables the stopping rule, is now
straightforward. Briefly, a large number of simulations are generated from the
multistage model under the assumption that all the θj ’s are 0, and stage-wise
θ̂j ’s are computed using the MAMLE method. For each stage j, the (1 − α)th
quantile q(j) is computed. An estimate of the endpoint K is given by
K̂ = the earliest stage at which θ̂K̂+w > q(K̂), and θ̂j > q(j) for at most
α percent of the remaining j > K̂ + w. (1)
Additional diagnostics and guidelines on using the stopping rule are provided
in Subsection 3.2 of Sampath and Verducci (2013). An alternative approach,
based on a data-analytic method, is covered by Hall and Schimek (2012).
The two techniques discussed above accomplish very distinct goals; the tau-
path algorithm organizes bivariate data along the path from strongest to weakest
6
association, whereas the top-K MAMLE approach detects the endpoint of as-
sociation between two ranked lists. Synthesis of the algorithms is accomplished
by using the tau-path ordering from the first algorithm as the ordering of the
stages of the second.
Here the framework for the multistage model is that the pair (X, Y ) of
variables plays the role of a rater, ranking the observations {1, . . . , n} according
to the tau-path ordering. The actual values of Kendall’s tau coefficient Tk over
the first k stages along the tau-path are used to infer the strength of association
at each stage. Choosing the stages according to the tau-path guarantees that
the estimated θj will be decreasing. This happens because the number of new
discordances introduced at each tau-path stage must be non-decreasing. In
this context, stopping rule (1) identifies observations in the remaining stages to
have (X, Y ) association no greater than noise. Since the ordered data is used as
the input into the top-K algorithm, the computed endpoint gives the stopping
stage where the strongly associated subsample ends. This subsample may be
studied further to try to infer the underlying subpopulation. Section 5 provides
an example where several pairs of variables are used to estimate a common
subsample whose import may be identified by the behavior of an explanatory
variable on this subsample.
subsets.
For large datasets, it is infeasible to examine computationally all pairs of
variables and the population subsets within each pair. Critical to the design
of the top-K and tau-path algorithms was an understanding of their order of
growth. The variety of software languages and hardware architectures we con-
sidered required care not to bias the specification of the algorithms toward
a particular implementation context since programming languages, compilers,
processor architectures, and memory hierarchies all affect the performance of
any implementation.
7
Figure 1: The TKTP algorithm.
The TKTP algorithm is the top-level algorithm. A flowchart for this algo-
rithm appears in [Figure 1]. In Subsection 3.1 we provide an overview of the
major steps of the TKTP algorithm and the functions it invokes. Since FastBCS
and its optimized successor FastBCS2 are critical to the runtime performance of
TKTP, these algorithms are described in detail in Subsection 3.1.2 along with
a simple example showing how they work. In Subsection 3.2 we analyze the
computational complexity of the FastBCS* algorithms, and in Subsection 3.3
describe their efficiency and the various strategies we explored for implement-
ing them. The pseudocode used to describe the algorithms mostly follows the
conventions specified in Cormen et al. (2009).
8
invokes four major functions to find a subset of strongly associated observations
within this pair: FastBCS, TaupathMAMLE, GenerateRejectBoundary, and
StoppingPoint:
Algorithm 1 TKTP(X, Y )
Input: A random sample S of n pairs (X1 , Y1 ), . . . , (Xn , Yn ) from a con-
tinuous bivariate population.
Output: A possibly empty subset s of observations in S in which Xs and
Ys are strongly associated.
Require: length(X) = length(Y )
1: W IN DOW ← 5
SIGLV L ← 0.05
N SIM ← 10000
// Find the permuted order of the observation pairs and the tau-path for
X and Y using FastBCS*.
2: pi, taupath ← F astBCS2(X, Y )
// Get moving average maximum likelihood estimators (MAMLEs) for X
and Y .
3: θ̂ ← T aupathM AM LE(taupath, W IN DOW )
// Simulate the MAMLE boundary at the (1 − α)th quantile of the null
hypothesis.
4: boundary ← GenerateRejectBoundary(n, W IN DOW, N SIM, SIGLV L)
// Estimate the stopping point K̂.
5: K̂ ← StoppingP oint(θ̂, boundary, SIGLV L)
6: return {pi[j] | [j ≥ K̂] ∧ [θ̂j > q(j)]}, where q(j) is the (1 − α)th quantile
of θ̂j
9
Algorithm 2 FastBCS(X, Y )
Input: A random sample of n pairs (X1 , Y1 ), . . . , (Xn , Yn ) from a continuous
bivariate population.
Output: The final permutation pi and the tau-path statistic derived from
the concordance matrix constructed from pi.
Require: n = length(X) = length(Y ); n > 1
// The function indexOf (a, v) returns the index of the value v in array a.
1: C ← the concordance matrix for (X1 , Y1 ), . . . , (Xn , Yn )
pi ← [1 . . . n]
i←n
tie[1 . . . n] ← ∅
2: repeat . backward conditional search
3: for j ← 1 . . . i doP . backward elimination
i
4: colsum[j] ← u=1 C[pi[u],pi[j]]
5: minsum ← minimum value in colsum
6: ties i ← {pi[l], l ← 1 . . . i|colsum[pi[l]] = minsum}
7: if |ties i| > 1 then
8: tie[i] ← ties i
9: r tie ← randomly select a member of ties i
10: l ← indexOf (pi, r tie)
11: transpose pi[i] and pi[l]
12: else
13: l ← indexOf (pi, ties i[1])
14: transpose pi[i] and pi[l]
15: for k ← n downto (i + 1) do . tie logic
16: if pi[i] ∈ tie[k] then
Pi Pi+1 Pk
17: qi ← [ u=1 C[pi[u],pi[i]] , u=1 C[pi[u],pi[i]] , . . . , u=1 C[pi[u],pi[i]] ]
Pi Pi+1 Pk
18: qk ← [ u=1 C[pi[u],pi[k]] , u=1 C[pi[u],pi[k]] , . . . , u=1 C[pi[u],pi[k]] ]
19: if All(qk[u] ≥ qi[u])andAny(qk[u] > qi[u]), u = 1 . . . k − i + 1
then
20: transpose pi[i] and pi[k]
21: i←k−1 . forward step
22: tie[m] ← [], ∀m ≤ k
23: GOTO repeat
24: i ←P i−1P . backward step
i i
25: until ( j=1 k=1 C[pi[k],pi[j]]) ) = (i ∗ (i − 1))
26: k←i
27: T [2] ← · · · ← T [k] ← 1
28: return The final permutation pi and the tau-path statistic {T [2] . . . T [n]}
10
Algorithm 3 TaupathMAMLE(tauPath, window)
Input:
- The tau-path of a random sample of n pairs (X1 , Y1 ), . . . , (Xn , Yn ) from a
continuous bivariate population.
- The size of the window.
Output: An array of MAMLEs.
Require: n = length(tauP ath); n > 1
1: ma.theta ← meanDif f ← [ ]
2: diffs ← [0]
3: for i ← 2 . . . n do
totaldiscord[i − 1] ← (1 − tauP ath[i])/2) ∗ 2i
4:
5: diffs[i] ← (totaldiscord[i] − totaldiscord[i − 1])
6: for i ← 0 . . . (n − window − 1) do
7: meanDif f ← dif f s[(i + 1) . . . (i + window)]
8: ma.theta[i + window + 1] ← theta.scale(meanDif f s, i, window, n + 1)
9: return ma.theta
11
Algorithm 4 StoppingPoint(xyMamles, boundary, significanceLevel)
Input:
- The MAMLEs generated for the X and Y variables.
- The rejection boundary under the null hypothesis.
- The significance level of the rejection boundary.
Output: The estimated value of the stopping point K or 0 if none was
found.
Require: n = length(xyM amles) = length(boundary); n > 1
1: exceed ← [ ]
j←1
2: for i in 1 . . . n do
3: if mamles[i] > boundary[i] then
4: exceed[j] ← i
5: j ←j+1
6: numExceed ← length(exceed)
7: sort(exceed)
8: for i in 1 . . . length(exceed) do
9: tail ← n − exceed[i]
10: lef t ← numExceed − i
11: if lef t ≤ signif icanceLevel ∗ tail then
12: return candidate[i]
13: return 0
input to the algorithm. The values of the permuted index pi used to track
the reordering of the observations at each stage are initialized to the natural
ordering of X and Y . At each stage i, an i × i permuted matrix will be derived
from C using pi. This is shown in the figures below as Ci . The row and column
headings of the permuted concordance matrix are the values indexed by pi and
refer to the observation IDs shown in the headings of C. The tie array maintains
a tieset for each stage. Each element of the tie array is set to empty tiesets.
The search begins with backward elimination. The subset Si of observations
pi[1 . . . i] for any stage i is fixed. The goal of backward elimination is to find
and eliminate the observation in the subset pi[1 . . . i] that contributes least to
Kendall’s tau coefficient—or increases most the value of D in Equation (2)—of
the remaining subset of observations Si−1 . Elimination is done by transposing
this observation with the observation at pi[i]. The result of the transposition
guarantees that Ti−1 ≥ Ti in this stage. If pi[i], the observation represented by
the i th stage, is not a member of any prior stages’ tiesets, a backward step is
taken by setting i to i − 1 [24]. These steps are repeated until either Kendall’s
tau coefficient for stage i becomes 1, or i = 2 [25].
A tie occurs at stage i if more than one observation could be eliminated
from Si [7]. The tied observations form a tieset that is associated with stage
i [8]. The observations in tiesets may be reexamined in the tie logic of later
stages [16]. To complete the backward elimination at stage i in the presence of
12
ties, one observation is arbitrarily chosen and eliminated from Si [9]. Because
the tau-path is constructed in reverse order, the algorithm cannot determine
the effect of future rearrangements on local maximal monotonicity at the time
a selection from the tieset is made. In subsequent stages k < i, the tie logic
will reexamine the choice made in stage i [17–18] and take a forward step [21],
if necessary, to ensure monotonicity in the tau-path. To convey an intuitive
understanding for how the algorithm works, we walk through a simple example.
Initialization C 1 2 3 4 5
1 0 -1 -1 1 -1
C ← concordance matrix of
X, Y 2 -1 0 -1 1 -1
pi ← [1, 2, 3, 4, 5] 3 -1 -1 0 -1 1
i←5 4 1 1 -1 0 -1
tie[1 . . . 5] ← ∅ 5 -1 -1 1 -1 0
Figure 2: Example: the initialization of the concordance matrix C calculated for a pair of
variables X and Y with five observations.
13
Stage 5
Backward Elimination
C5 1 2 3 4 5
Calculate the column sums.
1 0 -1 -1 1 -1
Create a list of ties:
2 -1 0 -1 1 -1
ties i ← [1, 2, 3, 5].
3 -1 -1 0 -1 1
Transpose pi[5] and pi[1].
4 1 1 -1 0 -1
Tie Logic 5 -1 -1 1 -1 0
colsums -2 -2 -2 0 -2
There are no tiesets to ex-
ties i 1 2 3 5
amine in stage 5.
backward step: i ← 4
Stage 4
Backward Elimination
C4 5 2 3 4 1
Calculate the column sums.
5 0 -1 1 -1 -1
Create a list of ties:
2 -1 0 -1 1 -1
ties i ← [5, 2, 3, 4].
3 1 -1 0 -1 -1
Transpose pi[4] and pi[1].
4 -1 1 -1 0 1
Tie Logic 1 -1 -1 -1 1 0
colsums -1 -1 -1 -1
pi[4] = 4 is not a member of
ties i 5 2 3 4
any tieset for k > 4.
backward step: i ← 3
Figure 3: Example: the main steps taken in stage 5 resulting in a backward step to stage 4.
Stage 3 is shown in [Figure 4]. The steps in backward elimination find the sole
observation 3 to contribute the least. No transposition is required. To ensure
that pi[3] = 3 locally optimizes the tau-path, the tie logic [15–23] reexamines
all tiesets from k down to i + 1 in which 3 is a member. Since observation 3 is
a member of the tieset from stage 5, the processing continues by attempting to
determine whether the tau coefficient at stage 3 could have been improved had
the observation at pi[k = 5] been chosen instead. The calculations are shown in
the tie logic of [Figure 4]. In effect, the algorithm compares the cumulative sums
of stage 3 of two different concordance matrices shown as C3 and C30 in [Figure
4(a) and 4(b)], respectively. C3 represents the permuted order from the choice
that was made in stage 5. C30 represents the concordance matrix that would
have resulted from choosing the observation pi[k = 5], and is constructed by a
transposition of the observations at pi[3] and pi[5]. The cumulative sum qi is
calculated from C3 and qk from C30 [17-18]. Two tests must pass [19] if C30 is to
5
be considered the better choice: All of qku=3 ≥ qi5u=3 and Any qku=3
5
> qi5u=3 .
0
Both tests succeed and the algorithm continues with C3 as shown in [Figure
4(c)].
14
Stage 3
Backward Elimination
C3 4 2 3 5 1
Calculate column sums. 4 0 1 -1 -1 1
No ties to save. 2 1 0 -1 -1 -1
3 -1 -1 0 1 -1
Tie Logic 5 -1 -1 1 0 -1
pi[3] = 3 is a member of 1 1 -1 -1 -1 0
stage 5 tieset. colsums 0 0 -2
Determine locally optimal ties i
observation.
i k
C3 4 2 3 5 1
4 0 1 -1 -1 1
(a) Calculate the cumulative sum qi of
2 1 0 -1 -1 -1
pi[i = 3]
3 -1 -1 0 1 -1
5 -1 -1 1 0 -1
1 1 -1 -1 -1 0
k i
C30 4 2 1 5 3
4 0 1 1 -1 -1
(b) Calculate the cumulative sum qk of
2 1 0 -1 -1 -1
pi[k = 3]
1 -1 -1 0 1 -1
5 -1 -1 -1 0 1
3 1 -1 -1 -1 0
qi qk
(c) All(qk ≥ qi) AND Any(qk > qi) -2 0
test passes. -1 -1
-2 -2
Figure 4: Example: the main steps taken in stage 3 that result in a forward step to stage 4.
15
The final stages, as shown in [Figure 5], follow a series of backward steps
from stage 4 to stage 2 where the search ends [25]. The algorithm outputs pi
and the tau-path statistic calculated from the concordance matrix generated
from pi [28].
Stage 4f C4f 4 2 1 5 3
Backward Elimination 4 0 1 1 -1 -1
2 1 0 -1 -1 -1
There are no ties to save. 1 1 -1 0 -1 -1
5 -1 -1 -1 0 1
Tie Logic
3 -1 -1 -1 1 0
pi[4] = 4 is not a member of colsums 1 -1 -1 -3
any tieset for k > 4. ties i
Stage 3f
Backward Elimination
Figure 5: Example: the second pass through stages 4f and 3f resulting from a forward step.
16
Algorithm 5 FastBCS2(X, Y )
Input: A random sample of n pairs (X1 , Y1 ), . . . , (Xn , Yn ) from a continuous
bivariate population.
Output: The final permutation pi and the tau-path statistic derived from
the concordance matrix constructed from pi.
Require: n = length(X) = length(Y ); n > 1
// The function indexOf (a, v) returns the index of the value v in array a.
1: C ← the concordance matrix for (X1 , Y1 ), . . . , (Xn , Yn )
pi ← [1 . . . n]; tie[1 . . . n] ← ∅
i←n
2: for j ← 1 . . . n do
Pn
3: colsum[j] ← u=1 C[pi[u],pi[j]]
4: repeat
5: ties i ← {pi[l], l ← 1 . . . i|colsum[pi[l]] = minsum}
6: if |ties i| > 1 then
7: tie[i] ← ties i
8: r tie ← randomly select a member of ties i
9: l ← indexOf (pi, r tie)
10: transpose pi[i] and pi[l]; transpose colsum[i] and colsum[l]
11: else
12: l ← indexOf (pi, ties i[1])
13: transpose pi[i] and pi[l]; transpose colsum[i] and colsum[l]
14: for j ← 1 . . . i do
15: colsum[j] ← colsum[j] − C[π(i),π(j)]
16: for k ← n downto (i + 1) do
17: if pi[i] ∈ tie[k] then
Pi Pi+1 Pk
18: qi ← [ u=1 C[pi[u],pi[i]] , u=1 C[pi[u],pi[i]] , . . . , u=1 C[pi[u],pi[i]] ]
Pi Pi+1 Pk
19: qk ← [ u=1 C[pi[u],pi[k]] , u=1 C[pi[u],pi[k]] , . . . , u=1 C[pi[u],pi[k]] ]
20: if All(qk[u] ≥ qi[u]) and Any(qk[u] > qi[u]), u = 1 . . . k − i + 1
then
21: for j ← i . . . k do
22: for l ← 1 . . . j do
23: colsum[l] ← colsum[l] + C[pi(j),pi(l)]
24: transpose pi[i] and pi[k]; transpose colsum[i] and colsum[l]
25: i←k−1
26: tie[m] ← [], ∀m ≤ k
27: for j ← 1 . . . i do
28: colsum[j] ← colsum[j] − C[pi(i),pi(k)] ,
29: GOTO repeat
30: i ← i − 1
Pi Pi
31: until j=1 k=1 C[pi[k],pi[j]]) ) = (i ∗ (i − 1)
32: k←i
33: T [2] ← · · · ← T [k] ← 1
34: return The final permutation pi and the tau-path statistic {T [2] . . . T [n]}
17
3.2. Computational Complexity
The analysis of an algorithm requires computational models of the target
platforms that will be used for execution of the implementations. Our anal-
ysis of FastBCS was based on a random-access machine (RAM) model where
operations are assumed to be performed strictly in sequence in constant time,
and a data parallel model which defines computation as a sequence of instruc-
tions or kernel applied synchronously to sets of processing elements each with
access to private data memory. The programming model for parallelism will
be discussed in more detail in Subsection 3.3.3 in the context of performance
data collected from multicore and manycore devices. No attempts were made
to model memory architectures, although the hierarchies and distribution of
memory play an important role in the runtime cost of implementations of the
FastBCS* algorithms.
18
Line Statement ∼ Frequency Big-Oh
2 repeat
3 for j ← 1 . . . i do
4 for u ← 1 . . . i do
4a colsum[j] ← C[pi[u],pi[j]] E[XR ] ∗ (i2 ) O(n3 )
5 minsum ← min value in colsum
6 for i ← 1 . . . i do
6a if (colsum[(pi[i]] = minsum)
Pi
6b add pi[i] to ties i E[XR ] ∗ k=1 (k) O(n2 )
7 if (ties i > 1)
8 tie[i] ← ties i
9 r tie ← randomly select an observation
10 l ← indexOf (pi, r tie) E[XR ]/2 ∗ n/2 O(n2 )
11 transpose pi[i] and pi[l]
12 else
13 l ← indexOf (pi, ties i[1]) E[XR ]/2 ∗ n/2 O(n2 )
14 transpose pi[i] and pi[l]
15 for k ← n downto (i + 1) do Pn−2
16 if pi[i] member tie[k] E[XR ] ∗ k=1 (k) O(n2 )
17 qi ← cumsum(C, pi[i], pi[i], pi[k]), . . .
18 qk ← cumsum(C, pi[k], pi[i], pi[k]), . . .
19 if All(qk >= qi) and Any(qk >= qi)
20 transpose pi[i] and pi[k]
21 i←k−1
22 for 1 . . . k do
22a tie[m] ← []
23 GOTO repeat
24 i←i−1
Pi Pi
25 until j=1 k=1 C[pi[k],pi[j]] = (i ∗ (i − 1)) E[XR ] ∗ (i2 ) O(n3 )
The number of backward steps is at most n − 1, since the repeat loop will
surely terminate [25] when i = 1. Whether a forward step is taken depends on
the number and size of tiesets created, on whether the observation at stage i
is a member of a tieset generated in a previous stage [7, 16] and on whether a
correction to the tau-path under construction needs to be made [19]. A forward
step erases all tiesets up to stage k [22], and restarts the backward search at
k − 1 [23].
Since backward elimination is performed every iteration and summing the
columns [3–4a] dominates the order of growth with O(n3 ), it was important to
understand the contribution of backward and forward steps to the frequency of
19
n NR (line 3) NT (line 8) Tieset size iH NT /NR
(avg) (max) (avg) (avg) (avg)
200 177.44 199 86.92 3.6 22.30 0.49
400 367.22 394 182.59 3.8 32.71 0.50
600 559.76 589 280.69 3.9 40.74 0.50
800 753.51 794 380.03 4.1 47.41 0.50
1000 948.04 1028 480.20 4.1 53.38 0.51
1200 1142.79 1209 580.73 4.2 58.89 0.51
1400 1338.15 1410 680.78 4.3 63.66 0.51
1600 1533.73 1601 782.30 4.3 68.15 0.51
1800 1729.84 1788 883.69 4.4 72.81 0.51
2000 1925.58 2008 986.21 4.4 76.75 0.51
Table 2: Statistics of the count of iterations of the repeat loop (NR ), the count of the times
ties were found (NT ), the size of a tieset (Tieset size), the value of i when the algorithm halted
(XH ), and the proportion of times ties were found within the repeat loop (NT /NR ).
the outermost loop. Statistics for the frequency of the loop and tiesets are shown
in [Table 2]. The number of forward steps is negligible. Empirically it was at
most 4 for n ≤ 2, 000 (Column NF S in [Table 4]). Since the average number
of iterations (Column NR ) is less than N, we assumed E[XR ] ' n. Although
tiesets are generated about half of the time (Column NT /NR ), they are small
in size (Column “Tieset size”), decreasing the likelihood that an observation
associated with stage i will be a member of a previous tieset [17].
20
in the number of iterations by forward steps is offset by early termination when
permutations result in fully concordant matrices before i = 1. The distributions
of i at termination for different n are shown in [Figure 8].
Figure 6: The distribution of the average kth reset position in a forward step.
21
Figure 8: The distribution of the expected value of i when the algorithm halts.
22
C 1 2 3 4 5
1 0 -1 -1 1 -1
2 -1 0 -1 1 -1
(a) The initialization of colsums for C. 3 -1 -1 0 -1 1
4 1 1 -1 0 -1
5 -1 -1 1 -1 0
colsums -2 -2 -2 0 -2 ←
C5 5 2 3 4 1
5 0 -1 1 -1 -1
2 -1 0 -1 1 -1
(b) The colsums for C5 after transpos-
3 1 -1 0 -1 -1
ing pi[5] and pi[1] in stage 5.
4 -1 1 -1 0 1
1 -1 -1 -1 1 0 ←
colsums -1 -1 -1 -1 -2 ←
Figure 9: An example of how FastBCS2 distributes the column sum calculations. (a) The
initial values for the column sums are calculated before the repeat loop [Line 4]. 25 operations
are performed. (b) The values of the row pi[5] = 1 are subtracted from the colsums of C to
produce the colsums of C5 [Lines 14-15]. 5 operations are performed.
operation and the i th elements of the column sum array of C5 and the transposed
row pi[5] = 1. In practice, it is not straightforward. The next section discusses
why.
23
parallelism made possible by the introduction of Java 8, we created a faithful
reproduction in Java of the original R implementation which became the baseline
for runtime performance analysis.
24
3.3.3. Methods and Contexts
The OpenCL standard (OpenCL) computing model views a system as a
collection of compute devices such as CPUs or GPUs, each of which may contain
multiple processing elements. OpenCL 1.0, based on the C99 specification, was
a language for describing program fragments called kernels that execute on
compute devices. Although we considered C++/OpenCL 1.0 for specifying
kernels, we chose Java. In Java version 8, Oracle introduced parallel streams,
lambda expressions, and language extensions that provide a declarative style
of programming for parallelism. An experimental Java library (Frost (2015))
was also being developed at AMD to provide support for GPU parallelization.
OpenCL, Java 8, and Aparapi became the foundation for our initial work on
sequential-parallel algorithms.
The first attempt to parallelize FastBCS was based on the Aparapi library.
The sequential-parallel implementation was specified in Java. At runtime in the
Java virtual machine (JVM), the GPU-parallel fragments compiled into Java
bytecode were converted by Aparapi into the instructions and dispatched to the
computing system’s GPU device. For the software developer, this eliminated
the need to write and debug the OpenCL code that managed the dispatching
of work items into queues. Our initial attempts at speedup were disappointing.
Because of updates to the critical data structures for each stage of FastBCS2,
the cost of data transfers and Aparapi’s overhead were significant and offset
the gains from GPU computation. We refocused our efforts on what could be
achieved using the extensions to Java 8 in multicore systems.
The gains from the redesign of FastBCS2 came by moving the initial calcu-
lation of column sums outside the scope of the repeat loop and distributing only
the updates across the algorithm. The FastBCS2 gains come from the redesign
of the FastBCS algorithm, reducing the order of growth in the average case from
O(n3 ) to O(n2 ) as evidenced in [Table 5] of Subsection 3.3.4. This also made
possible the use of the parallel streams framework in Java 8. The FastBCS2
implementation is parameterized to execute either sequentially or with parallel
fragments on multicore CPUs. In sequential-parallel mode, the parallel streams
are executed. The creation and initialization of the concordance matrix, column
summation, and column sum updates attempt to run on as many cores as are
available.
To separate the effects of algorithm design, programming language, com-
putational platform, and parallelism on runtime performance, the data were
collected in six contexts on three platforms as described in [Table 4]. A context
consisted of an implementation of either the FastBCS or FastBCS2 algorithm
(FB or FB2); the programming language used to author the implementation
and the runtime environment or virtual machine in which it was executed (R
or Java); the hardware and operating system (P1, P2, or P3); and an execution
mode as either sequential (s) or sequential-parallel (sp).
3.3.4. Results
Three performance metrics were generated for each context: runtime, speed-
up, and the doubling ratio. The runtime measures the total amount of work
25
Software Hardware
Context Algorithm Lang. SDK OS Mode Device Cores Mem.
FB-RP1s FastBCS R RStudio OSX s MP-I7 4 16GB
FB-JP2s FastBCS Java Java 8 OSX s MP-I7 4 16GB
FB2-JP2s FastBCS2 Java Java 8 OSX s MP-I7 4 16GB
FB2-JP3s FastBCS2 Java Java 8 Linux s EC2-Xeon 16 32GB
FB2-JP2sp FastBCS2 Java Java 8 OSX sp MP-I7 4 16GB
FB2-JP3sp FastBCS2 Java Java 8 Linux sp EC2-Xeon 16 32GB
Table 4: The six execution contexts. Context: The contexts are encoded as algorithm (FB
or FB2), programming language and virtual machine (R or Java), the computing platform
(P1, P2, or P3), and the execution mode (”s” or ”sp”). Algorithm: The algorithm being
implemented. Lang: The statistical language R, or Java. SDK: The Software Development
Kit used to load, or compile and run the implementations was RStudio 0.99.441 for R and the
Oracle Java Development Kit 1.8.0 b25-17, respectively. OS: The operating system was either
OSX 10.11 or Linux Ubuntu 14.04.2.LTS. Mode: The execution mode indicates whether the
implementation was run sequentially (s) or as sequential-parallel (sp). Device: The Apple
computer was a Macbook Pro with a 2.2 Ghz Intel I7 (MP-I7). The larger multi-core system
was an Amazon EC2 c4.8xlarge instance with a 2.9 Ghz Intel Xeon E5-2666v3 (EC2-Xeon).
Cores: The number of cores in the microprocessor. Mem: The size (GB) of random access
memory available to the CPUs.
performed by all processors. The input for each implementation was a pair of
variables X and Y from a uniform distribution with input sizes that doubled
from n = 250 to 32,000. The execution of an implementation of the FastBCS*
algorithm was performed five times for each pair and the average time in mil-
liseconds recorded. No other applications were running on the platform during
the test. The results are given in [Table 5] and quadratic fits to average runtimes
depicted in [Figure 10], with relative performance given in [Table 6]. Due to
memory and time constraints, we were unable to complete the R implementation
of FastBCS (FB-RP1s) for n > 4, 000.
Table 5: A table of the runtimes of different implementations of FastBCS* for pairs of variables
X and Y of size n = 200 to 2,000 by 200 generated from a uniform distribution. Each cell
is the average runtime in milliseconds of five executions. Cells marked with an asterisk were
not completed because of memory and time constraints on the platform used for RStudio.
The speedups shown in [Table 7] were derived from the runtime values.
Speedup is defined as S(n) = T ∗ (n)/T (n), where T ∗ (n) is the execution time of
a reference context and T (n) the execution time of the context being compared
to it. The use of a sequential implementation as a reference is especially helpful
in determining the benefit of parallelism (Rauber and Rünger (2013), p. 162).
26
Figure 10: Quadratic fits to average runtimes of FastBCS2.
3.3.5. Discussion
The first two rows of [Tables 5 and 6] show dramatic improvements in
speedup in large n (n = 32, 000) from just two changes: 1) the use of a static
27
Vector length (n)
Context 250 500 1,000 2,000 4,000 8,000 16,000 32,000
FB-RP1s 0.038 0.024 0.013 0.012 0.009 * * *
FB-JP2s 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
FB2-JP2s 3.2 8.2 9.9 21.9 38.0 78.3 140.4 239.7
FB2-JP3s 5.3 7.9 17.4 36.2 60.2 103.9 204.4 407.0
FB2-JP2sp 0.5 2.5 8.9 27.4 61.8 145.7 298.7 539.9
FB2-JP3sp 0.4 2.1 6.9 22.1 62.3 140.1 351.8 955.1
Table 6: Speedup comparing all contexts using FB-JP2s as the reference implementation.
Cells marked with an asterisk could not be computed because the runtime values were not
generated.
Table 7: A table of speedup comparisons using different references to show: (a) the effect
on speedup of the redesigned algorithm FastBCS2 compared with FastBCS; (b) the effect on
speedup of a 16-core system as compared with a 4-core system for sequential or sequential-
parallel execution as well as factors such as processor clock speeds, hardware architecture,
and cache sizes and speeds; and (c) the effect on speedup of sequential execution as compared
with sequential-parallel execution on the same platform for two different platforms.
language and its development and runtime environments (Java, SDK, and JVM)
over a dynamic language (R and RStudio) and 2) minor changes to the algo-
rithm’s design based on a frequency analysis and an empirical study of the
algorithm’s behavior resulting from a randomization of the inputs. Comparing
the doubling ratios of FB-RP1s with FB-JP2s and FB-JP2s with FB2-JP2s at
n = 2,000 and 4,000, shows that most of the improvement to the order of growth
comes from the algorithm’s redesign.
Further improvements come from parallelization through the use of the Java
8 parallel streams for larger n. We see this by comparing FB2-JP2s with FB2-
JP2sp and FB2-JP3s with FB2-JP3sp. In [Tables 7c and 8] speedup and the
order of growth begin to improve on the 4-core processor at n = 2, 000, and
on the 16-core processor at n = 8, 000. More powerful hardware improves
the speedup for a sequential implementation (rows FB2-JP2s and FB2-JP3s of
[Table 7b]) but has no apparent effect on the order of growth (rows FB2-JP2s
and FB2-JP3s for n ≥ 2, 000 in [Table 8]).
28
Algorithm 6 CalculateDoublingRatios(nA , nB )
Input: nA and nB are the lower and upper limits of n.
1: IT ERAT ION S ← 5
2: i ← 1
3: n ← nA
4: f nc ← A function that invokes an implementation of FastBCS*
5: Ti ← timeF astBCS(f nc, n, IT ERAT ION S)
6: while n ≤ nB do
7: i←i+1
8: n ← 2n
9: Ti ← timeF astBCS(f nc, n, IT ERAT ION S)
10: ratios[i] ← Ti /Ti−1
11: if ratios[i] approaches a limit then
12: break
13: return ratios
29
Vector length (n)
Context 250 500 1,000 2,000 4,000 8,000 16,000 32,000
FB-RP1s NA 2.9 3.7 3.1 3.2 * * *
FB-JP2s NA 2.3 2.8 3.0 2.9 3.1 3.0 3.1
FB2-JP2s NA 0.9 2.5 1.9 2.1 2.1 2.1 2.3
FB2-JP3s NA 1.7 1.7 2.0 2.2 2.3 2.0 2.1
FB2-JP2sp NA 0.1 1.0 1.4 1.7 1.9 1.9 2.2
FB2-JP3sp NA 0.0 1.1 1.4 1.4 2.0 1.6 1.6
Table 8: The binary logarithm of the doubling ratio, or lg(T (2n)/T (n)), where T is the
execution time of an implementation of FastBCS* for X and Y from a uniform distribution.
The values of the cells represent the power b = lg(ratio) which is an acceptable approximation
of the order of growth O(nb ) when used to predict performance using the doubling hypothesis.
Cells marked with an asterisk could not be computed because the runtime values were not
generated.
30
Figure 11: Corresponding population values of Kendall’s τ and Spearman’s ρ from Frank and
Gaussian copulae.
sample size is so large. Similarly, samples from the Gaussian copulae were gen-
erated with fixed values of ρ and the corresponding population τ was estimated
from these large samples. Note that for any copula the population value of
Spearman’s ρ is the same as the population (Pearson) product moment corre-
lation between the uniformly distributed marginal variables.
In the figure, the solid black line is based on points generated from Frank
copulae, and the solid red line is based on points generated from Gaussian
copulae. Dashed lines are inversions of the solid ones, and colors indicate either
ρ versus τ (black) or τ versus ρ (red). Although there exist distributions in
which the relationship between population τ and ρ may be quite different from
that given above, this relationship remains quite stable across the Frank and
Gaussian copulae, as seen from the slight difference between solid and dashed
lines of the same color.
Although a Frank and a Gaussian copula may share the same values of τ and
ρ, the distributions themselves are generally quite distinct. Density contours of
the Frank and Gaussian copulae, both having τ = 0.5 and ρ = 0.7, are depicted
in [Figure 12].
Now consider the performance of TKTP when applied first to various mix-
tures of Frank and independent copulae. This first simulation study follows
a 3 × 3 × 2 complete block design, with three levels of sample sizes (n =
100, 500, 1000), three levels of association strength in the subsample (τ = 0.3,
0.5, 0.7) and two levels of mixing proportion of the associated subsample (p =
0.3, 0.4). For each of the 18 combinations of these parameters, 100,000 samples
are generated from the Frank mixture, and the results of the (α = 0.05; w = 3)
31
Figure 12: Density contours of Frank (left) and Gaussian (right) copulae, both with τ = 0.5
and ρ = 0.7. Contours depict density levels of 0.5 to 5 in steps of 0.05.
32
Figure 13: Summary of TKTP (α = 0.05, w = 3) simulations under Frank mixtures of copulae.
produces a substantial increase in both mean stopping point and mean coverage
for any fixed sample size, which is captured by the rate of coverage, as given by
the slopes of the lines in [Figure 13]. Formally, under any given copula,
which clearly does not depend on the sample size. [Table 9] gives the rates
of coverage under Frank and Gaussian copulae that are matched for the same
strength of association in the subsample.
Table 9: Rates of Coverage [Equation (3)] under Frank and Gaussian copulas.
The simulations compare coverage rates for TKTP for α in {0.1, 0.05, 0.10}
to see how stable the coverage probability is in this range. It is enough to
33
consider just Frank copulae with n = 500, mixing proportion = 0.3, and τ in
{0.3, 0.5, 0.7}. The standard errors of these rates are all about 0.1, suggesting
little real difference in the TKTP performance; the comparable Gaussian version
of [Figure 13] is virtually indistinguishable from that figure, and is thus omitted.
Nevertheless, the actual coverage is systematically less in all 18 Gaussian copulae
under study than in their comparable Frank copulae, suggesting that the small
differences indicate a real but slight underperformance of TKTP in the Gaussian
setting.
In all cases, the rate of coverage seems to capture the effectiveness of the
method. From the table, it remains stable under different types of mixtures; and,
from [Figure 13], it is not affected by small changes in the mixing proportion.
The control parameters α and w of the TKTP have different tasks. The
value α changes the distribution of the stopping point, with smaller values of
α inducing stochastically larger distributions of the stopping point. The effect
is much the same as selecting an operating point in a classification method.
The value α = 0.05 gives a rough midrange value, and lowering it slightly will
increase both the stopping point and the coverage probability, but lowering it
too far will greatly increase the probability of wrongly including uncorrelated
samples in the selection set. For large sample sizes n ≥ 500, we recommend
α = 0.05 because these quantiles are well estimated in the simulations and the
stopping point does not change much for other values of α near 0.05. The control
parameter w helps to stabilize the performance of TKTP. It is useful mostly for
small sample sizes n ≥ 100, as studied here; we recommend w = 3, the smallest
practical smoothing.
34
Figure 14: Cumulative number of stocks associated with 6-month lagged oil.
The dataset is composed of weekly prices of 26-week lagged S&P 500 stocks
versus oil over a decade from 12/31/2004 to 1/2/2015 (523 time points, 449
stocks with 10 years of data). The initial screen follows a two-step selection
process:
Step 1. Find stocks with partial association over the most weeks. There
were 523 − 26 = 497 time point pairs (oil time, stock time) available. The first
screen was to select stocks associated over at least 60% of these time point pairs.
There were 273 such stocks correlated with 6-month oil over periods of at least
6 years. See [Figure 14] which suggests this cutoff.
Step 2. Find the correlations (both Pearson and Kendall) between oil and
lagged stock prices over the TKTP-selected time point pairs. There are 77 of
these with Pearson correlations over 0.9, listed in [Table 10].
[Figure 15] depicts the full 10-year time courses of the weekly price of the
five stocks with the largest Pearson correlations in [Table 10] relative to the
6-month prior price of oil. These five stocks come from different industries:
PRGO–pharmaceuticals, CRM–software, ARG–chemicals, CTSH–IT services,
SRCL–Commercial Services. This suggests that oil price predicts, at least in
pattern, an aspect of the general economy. Although the huge crash of oil in
2008 was disproportionate, it did predict a downturn in each of the stocks six
months later.
More importantly than general association, the TKTP method helps to iden-
35
FMC COL TJX CNP BEN XOM SIAL PH NBL XRAY EMC
0.901 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.903 0.903 0.905
ISRG APH PCAR DHR RRC FTI FAST ROK DRI EL RAI
0.905 0.905 0.906 0.906 0.907 0.907 0.907 0.908 0.908 0.908 0.908
PWR SPG PCLN LH CL AMZN HCP UTX ROST BLK IBM
0.909 0.909 0.910 0.910 0.910 0.911 0.911 0.912 0.913 0.913 0.913
EW SO GWW KO PCP GIS ALTR BBBY CERN INTU DTV
0.914 0.914 0.916 0.916 0.916 0.916 0.916 0.917 0.917 0.917 0.917
FLS MCD XEL WAT CTXS SWK WEC ES AZO CSX CMI
0.918 0.918 0.918 0.918 0.918 0.919 0.919 0.920 0.920 0.921 0.922
FDO PSA DLTR ORCL YUM FFIV RL ACN ED O EMR
0.922 0.922 0.923 0.923 0.924 0.924 0.924 0.926 0.926 0.926 0.928
VTR PX ESRX AAPL RHT VAR SRCL CTSH ARG CRM PRGO
0.928 0.928 0.930 0.930 0.932 0.932 0.933 0.934 0.935 0.936 0.942
Table 10: Pearson correlations for 77 S&P 500 stocks correlated with 6-month prior oil price
over different restricted periods of at least 6-year duration.
tify the time periods over which association is strongest. Pooling information
from several stocks strengthens the degree of confidence about the time periods
selected, but the pooling needs to be done methodically. The idea is to pool
information from stocks that agree strongly with each other.
A complete-linkage clustering approach is used to find the stocks from which
to pool time-point inclusions. The agreement measure used for pairs of time-
point sets is the Jaccard/Tanimoto coefficient J defined for sets A and B as
|A ∩ B|
J= .
|A ∪ B|
A cluster of 24 of the 273 Step 1-screened stocks has all pairs with J > 0.8.
These 24 stocks, listed in [Table 11], include three of the five stocks from [Figure
15], and 13 of the stocks from [Table 10].
36
Figure 15: Five stock prices associated with 6-month lagged oil: 2005-2014.
The average 24 TKTP correlations with 6-month oil were 0.89 (Pearson)
and 0.75 (Kendall). [Figure 16] illustrates, for each week, the number of stocks
having that week included by TKTP. The color coding follows the rainbow with
red indicating inclusion in all 24 stocks; yellow-green, in 13 stocks; and blue in
0 stocks. When plotted under the price of oil, it appears that weeks tend to be
excluded during the most volatile periods for oil prices.
6. Discussion
The computational efficiencies attained by the new algorithms enable a whole
new approach to discovering key subpopulations from big data. Many pairs of
variables may be screened to see if there is a common subset of observations that
supports strong association between the pairs, and thus represents a subpopu-
lation supporting a whole network. This could be a subpopulation of cancers
supporting a gene-network for chemoresistance, or a subpopulation of time pe-
riods supporting an economic network for growth.
Although the methods here have been presented in the classical view of ana-
lyzing a random sample from a population, they also apply to any large dataset.
37
Figure 16: TKTP-selected weeks for oil price prediction of stock cluster prices.
Inference here is internal to the dataset treated as its own population. Identified
subsets correspond to a real subpopulation, but identifying the subpopulation
externally may not be straightforward. For the stock price example, the apho-
rism, “A rising tide lifts all boats,” seems to apply. A foretelling rise of oil
prices, ignoring brief periods of large volatility, signals a rise in tide, at least
in the harbor where the cluster of two dozen stocks is moored. It is not clear
which stocks will remain in the harbor, or what other, more specific, influences
will be in force.
The decision to focus on ranks rather than on the original numerical values
was motivated both by concerns for inferential robustness and insights from
recent mathematics. When moving between the database and external envi-
ronments, calibration issues often arise, but causal relationships persist in situ.
Thus rank associations found within a database subset are very likely to ap-
pear in corresponding external subpopulations. Among rank-based methods,
the multistage ranking model is very flexible since the number of parameters
increases with sample size.
Recent mathematical methods motivate the use of the multistage ranking
38
method. In the special case where all the population parameters {θj } are equal,
yielding the Mallows model, Starr (2009) has found a limiting distribution which,
under proper scaling, is a Frank copula. More general parameterizations of the
multistage model appear to converge to mixtures of Mallows, which covers a very
wide range of ranking distributions. If this is true, the multistage model for a
pair (X, Y ) of variables ranking the n observations will converge to a mixture of
Frank copulae. A “signal” in such a mixture of might correspond to a mixture
of components with large values of Kendalls tau, with “noise,” if present, being
represented by a mixture of nearly independent Frank copulae. Thus, in this
conceptual setting, TKTP screens out a nearly independent component. Doing
this on the basis of a single ranking is remarkable; finding additional pairs of
variables with the same limiting distribution, as in the example of Section 5,
adds to the accuracy. See Awasthi et al. (2014) and Meilă and Chen (2012) for
general approaches to mixtures of Mallows models.
References
Aldous, D., Diaconis, P., 1999. Longest increasing subsequences: from patience
sorting to the Baik-Deift-Johansson theorem. Bulletin of the American Math-
ematical Society 36, 413–432.
Awasthi, P., Blum, A., Sheffet, O., Vijayaraghavan, A., 2014. Learning mixtures
of ranking models, in: Advances in Neural Information Processing Systems,
pp. 2609–2617.
Baik, J., Deift, P., Johansson, K., 1999. On the distribution of the length of
the longest increasing subsequence of random permutations. Journal of the
American Mathematical Society 12, 1119–1178.
Bamattre, S., Hu, R., Verducci, J.S., 2015. Nonparametric testing for heteroge-
neous correlation. arXiv preprint arXiv:1504.05392.
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., 2009. Introduction to
Algorithms. 3rd ed., The MIT Press.
Fligner, M.A., Verducci, J.S., 1988. Multistage ranking models. Journal of the
American Statistical Association 83, 892–901.
Liiv, I., 2010. Seriation and matrix reordering methods: An historical overview.
Statistical Analysis and Data Mining: The ASA Data Science Journal 3, 70–
91.
39
Meilă, M., Chen, H., 2012. Dirichlet process mixtures of generalized Mallows
models. arXiv preprint arXiv:1203.3496.
Nelsen, R.B., 2006. An Introduction to Copulas, 2nd ed. Springer Sci-
ence+Business Media, New York.
Sampath, S., Verducci, J.S., 2013. Detecting the end of agreement between
two long ranked lists. Statistical Analysis and Data Mining: The ASA Data
Science Journal 6, 458–471.
Sedgewick, R., Wayne, K., 2011. Algorithms. 4th ed., Addison-Wesley Profes-
sional.
Starr, S., 2009. Thermodynamic limit for the Mallows model on Sn . Journal of
Mathematical Physics 50, 095208:1–15.
Yu, L., Verducci, J.S., Blower, P.E., 2011. The tau-path test for monotone
association in an unspecified population: Application to chemogenomic data
mining. Statistical Methodology 8, 97–111.
40