0% found this document useful (0 votes)
44 views

The Top-K Tau-Path Screen For Monotone Association: Yu Et Al. 2011

This document proposes a new method called the Top-K Tau-Path (TKTP) screen to identify a concise subsample from a larger dataset that is likely to contain a high proportion of observations where two variables are monotonically associated. The TKTP algorithm extends an existing tau-path method to allow it to be applied to much larger sample sizes in the thousands. Simulation studies examine the accuracy of the TKTP method in identifying the true supporting subsample under different conditions. An example using financial data demonstrates the step-by-step use of TKTP.

Uploaded by

Yusuf Rihabe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

The Top-K Tau-Path Screen For Monotone Association: Yu Et Al. 2011

This document proposes a new method called the Top-K Tau-Path (TKTP) screen to identify a concise subsample from a larger dataset that is likely to contain a high proportion of observations where two variables are monotonically associated. The TKTP algorithm extends an existing tau-path method to allow it to be applied to much larger sample sizes in the thousands. Simulation studies examine the accuracy of the TKTP method in identifying the true supporting subsample under different conditions. An example using financial data demonstrates the step-by-step use of TKTP.

Uploaded by

Yusuf Rihabe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

The Top-K Tau-Path Screen for Monotone Association

Srinath Sampatha,∗, Adriano Caloiarob , Wayne Johnsonb , Joseph S. Verduccic


a Hamilton Capital Management, Columbus, Ohio, USA
b Myatt& Johnson, Inc., Miami Beach, Florida, USA
c The Ohio State University, Columbus, Ohio, USA
arXiv:1509.00549v1 [stat.AP] 2 Sep 2015

Abstract
A pair of variables that tend to rise and fall either together or in opposition
are said to be monotonically associated. For certain phenomena, this tendency
is causally restricted to a subpopulation, as, for example, an allergic reaction
to an irritant. Previously, Yu et al. (2011) devised a method of rearranging
observations to test paired data to see if such an association might be present in
a subpopulation. However, the computational intensity of the method limited
its application to relatively small samples of data, and the test itself only judges
if association is present in some subpopulation; it does not clearly identify the
subsample that came from this subpopulation, especially when the whole sam-
ple tests positive. The present paper adds a “top-K” feature (Sampath and
Verducci (2013)) based on a multistage ranking model, that identifies a concise
subsample that is likely to contain a high proportion of observations from the
subpopulation in which the association is supported. Computational improve-
ments incorporated into this top-K tau-path (TKTP) algorithm now allow the
method to be extended to thousands of pairs of variables measured on sample
sizes in the thousands. A description of the new algorithm along with measures
of computational complexity and practical efficiency help to gauge its poten-
tial use in different settings. Simulation studies catalog its accuracy in various
settings, and an example from finance illustrates its step-by-step use.
Keywords: Kendall’s tau, nonparametric correlation, ranking, copula,
mixtures of distributions, unsupervised classification, computational
complexity

1. Introduction

The emergence of “big data” has provided the opportunity for scientists to
focus on key subpopulations. These are often identified by having higher mean
values on some relevant aspect that is measurable by variables in the database.
Here we change the criterion to having higher correlational values.

∗ Correspondingauthor.
Email address: [email protected] (Srinath Sampath)
There are many potential applications. In cancer research, the mechanisms
whereby some cancers become chemoresistant are inherent in their gene net-
works, not all of which have been identified. One way to discover novel gene
networks is to look for correlations between pairs of genes across many different
types of cancer cells. The subpopulation of cancers that possess this network
will match the subpopulation that supports the correlations. In marketing, a
goal is to identify subpopulations that will be most responsive to a particular
campaign. In finance, association between a company’s earnings and a key com-
modity price may be restricted to certain macroeconomic conditions which are
not known or directly measured. In all these cases, and more, the basic prob-
lem is to identify, as concisely as possible, a highly correlated subsample that
includes most of the members of the subpopulation that supports association
between the variables. This is thus a kind of unsupervised classification, with a
novel criterion.
Previous work on this problem is relatively recent. Yu et al. (2011) developed
the tau-path algorithm for re-sorting a sample so that the Kendall tau coefficient
is (optimally) decreasing. In checking the behavior of the tail of this tau-path,
they devised a test for correlation in a subsample versus the null hypothesis of
independence. A limitation of operating under this null hypothesis is that the
whole population tends to test positive when the supporting subpopulation is
large or when the association is strong, which is not uncommon in datasets with
large sample sizes. Bamattre et al. (2015) discussed some procedures for the
much more difficult problem of testing for heterogeneity of association in which
the distribution under the null hypothesis may have many different forms. A
more modestly-scoped approach, suggested in Sampath and Verducci (2013),
assumes a positive test for overall correlation, and attempts to discover if this
correlation really is supported only within a proper subpopulation. They applied
a multistage model of agreement between rankings to the re-sorted sample to
indicate the point where predictive information stops. This method, called
the top-K tau-path (TKTP), is intended to provide good coverage of the true
underlying supporting population, but this has not been investigated in detail
until now, mainly because the original implementation was not scalable to large
datasets.
The current paper covers two issues not previously resolved. The first is
extending the tau-path method to large samples. The original code was practical
only up to sample sizes n < 100; the new code easily handles n up to 10,000
on a personal computer. This extension to large samples makes it possible to
address the second issue: how to characterize the performance of TKTP beyond
the simple task of testing against independence. Simulation studies demonstrate
the pattern between size of selected subsample and percent of the true samples
included.
Section 2 provides a thorough description of TKTP. Section 3 details the
design of the new algorithms, including their computational complexity and
runtime efficiency. Performance characteristics of their coverage proportions
appear in Section 4, and a new example of predicting trends in stock prices in
Section 5. The last section of the paper summarizes all findings and discusses

2
mathematical underpinnings.

2. The Top-K Tau-Path (TKTP) Screen


This section covers Kendall’s τ measure of association, its corresponding
tau-path, and multistage ranking models to show how these fit together to form
the TKTP method of screening for a subpopulation.

2.1. Concordance and Kendall’s τ


A standard approach to quantifying the strength of the relationship between
two variables is to enumerate the number of concordances and discordances
occurring in sample pairs, and then regard the difference between these counts as
a measure of association. More precisely, if X and Y are random variables from
a joint distribution F , with independent observations (X1 , Y1 ) and (X2 , Y2 ), the
probabilities of concordance and discordance are, respectively,
pc = P [(X1 − X2 )(Y1 − Y2 ) > 0]
pd = P [(X1 − X2 )(Y1 − Y2 ) < 0],
and the popular Kendall’s τ measure of association is simply
τ = pc − pd .
Note that τ depends only on the ordinal properties of X and Y , so that in a
sample the data vectors X and Y may be replaced by their ranks with no loss
of information about τ .
An unbiased estimator of τ , the Kendall’s tau coefficient, can be computed
from the random sample {(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn )} by way of the n × n
concordance matrix c = ((c[i,j] )), a tool that captures association between every
sample pair in binary form as
(
1, if the (i, j)th pair is concordant
c[i,j] =
−1, if the (i, j)th pair is discordant,
and from which Kendall’s tau coefficient, computed as
XX  n
T = c[i,j] / ,
2
1≤i<j≤n

is an unbiased estimator of Kendall’s τ . If A and D are the number of concordant


and discordant sample pairs, respectively, then
A−D
T = .
A+D
With this theoretical framework, Subsection 2.2 delves into the idea of seri-
ation, a common matrix reordering technique, and discusses the approach taken
by Yu et al. (2011) to permute the concordance matrix and thus establish the se-
quentially maximal monotone decreasing tau-path to uncover pockets of strong
association in overall moderately associated sample data.

3
2.2. Seriation and the Tau-Path
A simple yet powerful clustering technique to discover hidden structure and
underlying patterns in data involves the permutation of multidimensional data
along a single dimension. In archeology, the chronological ordering of similar
relics could yield new insight into the different eras based on the similarity of
adjacent objects. Known as seriation, this technique finds application in a wide
variety of fields of scientific inquiry. Liiv (2010) provides an in-depth history of
this technique and its use in many disciplines. In unsupervised learning, seri-
ation often manifests in the form of matrix reordering, where multidimensional
matrices are permuted based on the magnitude of a particular characteristic,
potentially leading to insights into underlying associations in the data.
In the context of Kendall’s τ , Yu et al. (2011) permuted the sample concor-
dance matrix in a very specific manner described in the next section. By means
of either of two backward conditional search (BCS) algorithms, the Fast Back-
ward Conditional Search (FastBCS) and the Full Backward Conditional Search
(FullBCS) algorithms, they selected increasing subsets of the sample matrix in
such a way that the associated tau coefficients become monotonically decreas-
ing; the tau coefficient corresponding to the 2 × 2 subset is therefore at least
as large as the one corresponding to the 3 × 3 subset, and so on. The resulting
ordering of the sample matrix is referred to as the tau-path.
Using the tau-path statistic, Yu et al. (2011) also propose a test for inde-
pendence between the two variables versus the alternative that the variables
are correlated only in a subpopulation. By simulating a large number of inde-
pendent samples of size n from independent X and Y populations, establishing
the tau-path for each sample, and then calculating the percentiles of Kendall’s
tau coefficient at each step along the tau-path, and picking a fixed percentile
to serve as the boundary at each step; for any fixed percentile q, the boundary
points {qi | i = 2, . . . , n} are calibrated so that no more than a proportion α of
the random tau-paths exceed the boundaries. Note that the same simulation
may be achieved by sampling pairs of independent permutations of {1, . . . , n}.
The tau-path approach thus provides an ordering of the sample bivariate
data along the path of strongest association. So far the focus of this subsection
has been on uncovering positive associations between the two variables; should
a negative association between the variables be sought, this is readily accom-
plished by multiplying the Y ’s by −1 before performing the tau-path analysis.
Another aspect of the ordered data that is worth computing is the stopping
stage or endpoint of association; the stage in the tau-path ordered data where
association ends and randomness sets in. Especially in these times when large
datasets are readily available, and speed of computation and immediacy of re-
sults are important considerations, the need to prune the full dataset down to
the appropriately-sized subset for further analysis is critical. The key notion
here is that as one progresses down the tau-path from most- to least-associated
sample pairs, the two data elements in any pair should provide largely consis-
tent information about their strength of association. When adjacent data pairs
routinely provide inconsistent information about their strength of association,

4
it is possible that the level of association has deteriorated and randomness has
taken over. In the context of the tau-path ordering, multistage ranking models
provide an appropriate framework for extracting the endpoint of agreement.

2.3. The Multistage Ranking Model and the TKTP Stopping Rule
The primary challenge with traditional models that assess the degree of as-
sociation between two ranked lists is their mathematical complexity. Whereas a
multinomial distribution provides a full characterization of the population, the
large number of ensuing parameters leads to intractable likelihood equations,
even when the number of objects being ranked is small. Fligner and Verducci
(1988) proposed a family of forward multistage ranking models that vastly re-
duce the computational effort and also provide clear insights into the underlying
processes. The multistage ranking framework is outlined here and the TKTP
stopping rule then developed.
Given a group of n objects, the relative value of the objects to an assessor—
whether qualitative or numerical—can be expressed by the assignment of ranks
to the objects. Two representations of the assessor’s preferences are popular;
the first is the ranking or permutation, a vector of length n that gives, for each
object on the original list, the assessor’s respective rank, namely the quantity
(1 + the number of other objects that the assessor considers superior). This
representation of the ranked preferences is denoted

π = [π(1), . . . , π(n)].

The second representation, an ordering or inverse permutation of the n ob-


jects, is also a vector of length n, showing the labels of the n objects set in
ranked order. Thus the ordering or inverse permutation corresponding to the
representation π above is

π −1 (j) = i if π(i) = j, i = 1, . . . , n, j = 1, . . . , n.

Our interest lies in measuring the extent of agreement between the prefer-
ences of two assessors who independently rank the n objects, and in determining
the last stage where agreement still exists and random noise follows, the stop-
ping stage or endpoint of agreement between the two assessors. This is achieved
by anchoring one assessor’s ranking of the objects as the reference ranking or
ground truth, and then examining the stage-wise departures of the second asses-
sor’s ranks from the corresponding reference ranks; the latter is the generated
or observed ranking π. Following the approach given by Fligner and Verducci
(1988), the computations for the first stage and a generic stage j are given below.
The stages are assumed to be independent, and penalties and truncated geo-
metric probabilities are assigned to the second assessor’s ranks commensurate
with their stage-wise deviations from the first assessor’s ranks as follows:
Stage 1: Since all n objects are available, the second assessor selects the
(1 + v)th best object overall, as specified by π −1 , and incurs the penalty V1 = v

5
with probability
!
1 − r1
P (V1 = v) = rv , v = 0, . . . , n − 1, 0 < r1 < 1.
1 − r1n 1

Stage j (j = 2, . . . , n − 1): For each subsequent stage j, n − j + 1 objects are


still available. Here the second assessor selects the (1+v)th best available object,
as specified by π −1 , and incurs a penalty Vj = v with probability
!
1 − rj
P (Vj = v) = rjv , v = 0, . . . , n − j, 0 < rj < 1.
1 − rjn−j+1

The vector {V1 , . . . , Vn−1 } thus captures the stage-wise deviations between
the two sets of ranks, and is referred to as the discordance or penalty vector
between the ranking schemes. {rj } too represents the stage-wise disagreement
between the two assessors. Rather than focus on rj , the transformation

θj = − log rj , j = 1, . . . , n − 1

is used, and a higher estimated θj reveals a closer association between the as-
sessors’ ranks for object j.
It remains to determine the stopping stage or endpoint K, the last stage
in the list of objects where the second assessor’s rank still shows some form
of agreement with that of the first. Beyond this stage, the second assessor
posts random stage-wise ranks relative to the first, and the previous association
between the assessors’ ranks has faded. Subsection 3.1 of Sampath and Verducci
(2013) provides the moving average maximum likelihood estimator (MAMLE)
for the rj , calculated iteratively over overlapping backward-looking windows of
stages i, j − w + 1 ≤ i ≤ j of fixed width w, till all the stages are exhausted.
The resulting curve {r̂j } and its inverted counterpart {θ̂j } are locally smooth
estimators of the stage-wise agreement between the two assessors.
The rejection region for the MAMLE, which enables the stopping rule, is now
straightforward. Briefly, a large number of simulations are generated from the
multistage model under the assumption that all the θj ’s are 0, and stage-wise
θ̂j ’s are computed using the MAMLE method. For each stage j, the (1 − α)th
quantile q(j) is computed. An estimate of the endpoint K is given by

K̂ = the earliest stage at which θ̂K̂+w > q(K̂), and θ̂j > q(j) for at most
α percent of the remaining j > K̂ + w. (1)

Additional diagnostics and guidelines on using the stopping rule are provided
in Subsection 3.2 of Sampath and Verducci (2013). An alternative approach,
based on a data-analytic method, is covered by Hall and Schimek (2012).
The two techniques discussed above accomplish very distinct goals; the tau-
path algorithm organizes bivariate data along the path from strongest to weakest

6
association, whereas the top-K MAMLE approach detects the endpoint of as-
sociation between two ranked lists. Synthesis of the algorithms is accomplished
by using the tau-path ordering from the first algorithm as the ordering of the
stages of the second.
Here the framework for the multistage model is that the pair (X, Y ) of
variables plays the role of a rater, ranking the observations {1, . . . , n} according
to the tau-path ordering. The actual values of Kendall’s tau coefficient Tk over
the first k stages along the tau-path are used to infer the strength of association
at each stage. Choosing the stages according to the tau-path guarantees that
the estimated θj will be decreasing. This happens because the number of new
discordances introduced at each tau-path stage must be non-decreasing. In
this context, stopping rule (1) identifies observations in the remaining stages to
have (X, Y ) association no greater than noise. Since the ordered data is used as
the input into the top-K algorithm, the computed endpoint gives the stopping
stage where the strongly associated subsample ends. This subsample may be
studied further to try to infer the underlying subpopulation. Section 5 provides
an example where several pairs of variables are used to estimate a common
subsample whose import may be identified by the behavior of an explanatory
variable on this subsample.

3. Algorithms, Computational Complexity and Efficiency of Imple-


mentation
The algorithms that underlie the TKTP method were developed to screen
out observations for which a pair (X, Y ) of variables have association no greater
than noise. The remaining observations are then most likely to represent a
subpopulation that supports strong association. For large datasets this compu-
tational task can be daunting. For example, in the analysis of microarrays used
in molecular biology, gene chip platforms support increasing numbers of gene
probes that can be attached to a microarray. In large-scale toxicity studies, hun-
dreds of microarrays may be used to measure the effects of chemical compounds
at various dosage levels over time. In a study measuring the expression levels
of 30,000 genes responding to 3,600 chemical compound treatments, a dataset
of 3,600 observations and 30,000 variables would be generated. In screening for
relationships between pairs of genes, 30,000

2 or 449,985,000 comparisons would
be made. For eachP gene-gene pair identified, up to n = 3,600 observations could
n
be combined into k=2 nk /2 sets from sizes 2 to n in the search for associated


subsets.
For large datasets, it is infeasible to examine computationally all pairs of
variables and the population subsets within each pair. Critical to the design
of the top-K and tau-path algorithms was an understanding of their order of
growth. The variety of software languages and hardware architectures we con-
sidered required care not to bias the specification of the algorithms toward
a particular implementation context since programming languages, compilers,
processor architectures, and memory hierarchies all affect the performance of
any implementation.

7
Figure 1: The TKTP algorithm.

The TKTP algorithm is the top-level algorithm. A flowchart for this algo-
rithm appears in [Figure 1]. In Subsection 3.1 we provide an overview of the
major steps of the TKTP algorithm and the functions it invokes. Since FastBCS
and its optimized successor FastBCS2 are critical to the runtime performance of
TKTP, these algorithms are described in detail in Subsection 3.1.2 along with
a simple example showing how they work. In Subsection 3.2 we analyze the
computational complexity of the FastBCS* algorithms, and in Subsection 3.3
describe their efficiency and the various strategies we explored for implement-
ing them. The pseudocode used to describe the algorithms mostly follows the
conventions specified in Cormen et al. (2009).

3.1. Overview of Algorithms


3.1.1. The TKTP Wrapper
The input to the TKTP algorithm shown in [Algorithm 1] is a random
sample of observations for variables X and Y from a bivariate distribution. It

8
invokes four major functions to find a subset of strongly associated observations
within this pair: FastBCS, TaupathMAMLE, GenerateRejectBoundary, and
StoppingPoint:

Algorithm 1 TKTP(X, Y )
Input: A random sample S of n pairs (X1 , Y1 ), . . . , (Xn , Yn ) from a con-
tinuous bivariate population.
Output: A possibly empty subset s of observations in S in which Xs and
Ys are strongly associated.
Require: length(X) = length(Y )
1: W IN DOW ← 5
SIGLV L ← 0.05
N SIM ← 10000
// Find the permuted order of the observation pairs and the tau-path for
X and Y using FastBCS*.
2: pi, taupath ← F astBCS2(X, Y )
// Get moving average maximum likelihood estimators (MAMLEs) for X
and Y .
3: θ̂ ← T aupathM AM LE(taupath, W IN DOW )
// Simulate the MAMLE boundary at the (1 − α)th quantile of the null
hypothesis.
4: boundary ← GenerateRejectBoundary(n, W IN DOW, N SIM, SIGLV L)
// Estimate the stopping point K̂.
5: K̂ ← StoppingP oint(θ̂, boundary, SIGLV L)
6: return {pi[j] | [j ≥ K̂] ∧ [θ̂j > q(j)]}, where q(j) is the (1 − α)th quantile
of θ̂j

Line 2. The FastBCS algorithm shown in [Algorithm 2] orders the pairs


from the strongest to least associated and generates the tau-path of Kendall’s
tau coefficients that measure the strength of the association for each nested
subset from size 2 to n.
Line 3. The tau-path and window are passed as arguments to the Taupath-
MAMLE algorithm shown in [Algorithm 3] to generate a MAMLE curve for
the X and Y pair. The tau-path is used in the penalty function to calculate
the MAMLE [4] curve. The window width determines the smoothness of the
curve [6–8]. This algorithm is used to generate the reference ranking of the first
assessor’s choices for the X and Y pair of variables, and to simulate the second
assessor’s choices under the null hypothesis.
Line 4. To determine the rejection region under the null hypothesis, the
GenerateRejectBoundary function (not shown) simulates the second assessor’s
choices under the null hypothesis by generating the MAMLE curves for a pair
of random variables of size n for the specified number of iterations and invoking
TaupathMAMLE on each pair. From these MAMLE curves, the stage-wise
(1 − α)th quantiles are calculated and the boundary representing the edge of
the rejection region is returned.

9
Algorithm 2 FastBCS(X, Y )
Input: A random sample of n pairs (X1 , Y1 ), . . . , (Xn , Yn ) from a continuous
bivariate population.
Output: The final permutation pi and the tau-path statistic derived from
the concordance matrix constructed from pi.
Require: n = length(X) = length(Y ); n > 1
// The function indexOf (a, v) returns the index of the value v in array a.
1: C ← the concordance matrix for (X1 , Y1 ), . . . , (Xn , Yn )
pi ← [1 . . . n]
i←n
tie[1 . . . n] ← ∅
2: repeat . backward conditional search
3: for j ← 1 . . . i doP . backward elimination
i
4: colsum[j] ← u=1 C[pi[u],pi[j]]
5: minsum ← minimum value in colsum
6: ties i ← {pi[l], l ← 1 . . . i|colsum[pi[l]] = minsum}
7: if |ties i| > 1 then
8: tie[i] ← ties i
9: r tie ← randomly select a member of ties i
10: l ← indexOf (pi, r tie)
11: transpose pi[i] and pi[l]
12: else
13: l ← indexOf (pi, ties i[1])
14: transpose pi[i] and pi[l]
15: for k ← n downto (i + 1) do . tie logic
16: if pi[i] ∈ tie[k] then
Pi Pi+1 Pk
17: qi ← [ u=1 C[pi[u],pi[i]] , u=1 C[pi[u],pi[i]] , . . . , u=1 C[pi[u],pi[i]] ]
Pi Pi+1 Pk
18: qk ← [ u=1 C[pi[u],pi[k]] , u=1 C[pi[u],pi[k]] , . . . , u=1 C[pi[u],pi[k]] ]
19: if All(qk[u] ≥ qi[u])andAny(qk[u] > qi[u]), u = 1 . . . k − i + 1
then
20: transpose pi[i] and pi[k]
21: i←k−1 . forward step
22: tie[m] ← [], ∀m ≤ k
23: GOTO repeat
24: i ←P i−1P . backward step
i i
25: until ( j=1 k=1 C[pi[k],pi[j]]) ) = (i ∗ (i − 1))
26: k←i
27: T [2] ← · · · ← T [k] ← 1
28: return The final permutation pi and the tau-path statistic {T [2] . . . T [n]}

10
Algorithm 3 TaupathMAMLE(tauPath, window)
Input:
- The tau-path of a random sample of n pairs (X1 , Y1 ), . . . , (Xn , Yn ) from a
continuous bivariate population.
- The size of the window.
Output: An array of MAMLEs.
Require: n = length(tauP ath); n > 1
1: ma.theta ← meanDif f ← [ ]
2: diffs ← [0]
3: for i ← 2 . . . n do
totaldiscord[i − 1] ← (1 − tauP ath[i])/2) ∗ 2i

4:
5: diffs[i] ← (totaldiscord[i] − totaldiscord[i − 1])
6: for i ← 0 . . . (n − window − 1) do
7: meanDif f ← dif f s[(i + 1) . . . (i + window)]
8: ma.theta[i + window + 1] ← theta.scale(meanDif f s, i, window, n + 1)
9: return ma.theta

Line 5. The StoppingPoint algorithm shown in [Algorithm 4] estimates the


stopping stage or endpoint K that indexes the end of agreement between the
reference (XY tau-path) and generated (rejection boundary) rankings, by K̂.

3.1.2. FastBCS and FastBCS2


The Kendall’s tau coefficient of the concordance matrix (Yu et al. (2011); p.
101) of a pair of variables is defined as
A−D
T = n
 (2)
2

where n2 is the number of distinct pairs of observations in the sample, and




A and D are the number of concordant and discordant pairs of observations,


respectively. The tau-path is a monotonically decreasing sequence of Kendall
tau coefficients. The FastBCS algorithm (Yu et al. (2011); p. 103) generates the
tau-path of a pair of statistical variables X and Y of a sample of observations.
We have restated the algorithm in [Algorithm 2] and include a simple example
to help explain it. All numbers in brackets below refer to line numbers in the
FastBCS algorithm.
The backward conditional search [2–25] consists of two main segments known
as backward elimination [3–14] and tie logic [15–23]. The FastBCS incrementally
constructs the permutation sequence pi backward from the set of all observations
Sn at i = n to a subset of two observations S2 at i = 2. The tau-path is
constructed using the permuted order of pi after the main repeat loop exits.
Each iteration i is called a stage.
Prior to the search, the main data structures used during the operations are
initialized [1]. The concordance matrix C of the set of observations for i = n
is calculated using the natural ordering of the variables X and Y provided as

11
Algorithm 4 StoppingPoint(xyMamles, boundary, significanceLevel)
Input:
- The MAMLEs generated for the X and Y variables.
- The rejection boundary under the null hypothesis.
- The significance level of the rejection boundary.
Output: The estimated value of the stopping point K or 0 if none was
found.
Require: n = length(xyM amles) = length(boundary); n > 1
1: exceed ← [ ]
j←1
2: for i in 1 . . . n do
3: if mamles[i] > boundary[i] then
4: exceed[j] ← i
5: j ←j+1
6: numExceed ← length(exceed)
7: sort(exceed)
8: for i in 1 . . . length(exceed) do
9: tail ← n − exceed[i]
10: lef t ← numExceed − i
11: if lef t ≤ signif icanceLevel ∗ tail then
12: return candidate[i]
13: return 0

input to the algorithm. The values of the permuted index pi used to track
the reordering of the observations at each stage are initialized to the natural
ordering of X and Y . At each stage i, an i × i permuted matrix will be derived
from C using pi. This is shown in the figures below as Ci . The row and column
headings of the permuted concordance matrix are the values indexed by pi and
refer to the observation IDs shown in the headings of C. The tie array maintains
a tieset for each stage. Each element of the tie array is set to empty tiesets.
The search begins with backward elimination. The subset Si of observations
pi[1 . . . i] for any stage i is fixed. The goal of backward elimination is to find
and eliminate the observation in the subset pi[1 . . . i] that contributes least to
Kendall’s tau coefficient—or increases most the value of D in Equation (2)—of
the remaining subset of observations Si−1 . Elimination is done by transposing
this observation with the observation at pi[i]. The result of the transposition
guarantees that Ti−1 ≥ Ti in this stage. If pi[i], the observation represented by
the i th stage, is not a member of any prior stages’ tiesets, a backward step is
taken by setting i to i − 1 [24]. These steps are repeated until either Kendall’s
tau coefficient for stage i becomes 1, or i = 2 [25].
A tie occurs at stage i if more than one observation could be eliminated
from Si [7]. The tied observations form a tieset that is associated with stage
i [8]. The observations in tiesets may be reexamined in the tie logic of later
stages [16]. To complete the backward elimination at stage i in the presence of

12
ties, one observation is arbitrarily chosen and eliminated from Si [9]. Because
the tau-path is constructed in reverse order, the algorithm cannot determine
the effect of future rearrangements on local maximal monotonicity at the time
a selection from the tieset is made. In subsequent stages k < i, the tie logic
will reexamine the choice made in stage i [17–18] and take a forward step [21],
if necessary, to ensure monotonicity in the tau-path. To convey an intuitive
understanding for how the algorithm works, we walk through a simple example.

Initialization C 1 2 3 4 5
1 0 -1 -1 1 -1
C ← concordance matrix of
X, Y 2 -1 0 -1 1 -1
pi ← [1, 2, 3, 4, 5] 3 -1 -1 0 -1 1
i←5 4 1 1 -1 0 -1
tie[1 . . . 5] ← ∅ 5 -1 -1 1 -1 0

Figure 2: Example: the initialization of the concordance matrix C calculated for a pair of
variables X and Y with five observations.

Example. Suppose we wish to generate a tau-path of the variables X =


[1, 2, 4, 3, 5] and Y = [4, 3, 1, 5, 2] of a sample of five observations. FastBCS is
invoked with X and Y as arguments. The initialization [1] is shown in [Figure
2]. It initializes the concordance matrix C based on the natural ordering of the
pairs of observations in X and Y . The array pi is set to the natural ordering
1, . . . , 5. The current stage i is set to 5. Finally, each element of array tie
which preserves the ties found at a particular stage i is initialized to the empty
set. [Algorithm 2] begins the backward conditional search [2] and the first two
stages, stages 5 and 4, are shown in [Figure 3].
All five observations from the sample are in subset S5 . To find the ob-
servation to eliminate from S5 , backward elimination begins by calculating the
column sums of each of the observations from 1 to 5 of the permuted concordance
matrix C5 [3–4]. Because no transpositions have taken place, C5 is identical to
C. Four of the columns of C5 sum to values of –2, so we generate the tieset
of observations {1, 2, 3, 5} [6] and save it in tie[5] [8]. Although any of these
four observations could have been randomly chosen for elimination, throughout
this example we will always choose the first observation of the tieset. In stage
5, by transposing pi[1] with pi[5], the observation that contributes least to the
Kendall’s tau coefficient of S4 , pi is reordered so that the observations from
pi[1] to pi[4] will generate T4 ≥ T5 . No operations are performed in the tie logic
since there is no k > i. We set i to 4 [24] and proceed in similar fashion at
stage 4 with pi = [5, 2, 3, 4, 1]. Our interest in stage 4 is only in the first four
elements of pi which constitute the subset S4 and the concordance matrix C4 .
Four columns are summed and the tieset {5, 2, 3, 4} is saved in tie[4]. Since item
4 is not a member of any tieset of k > i, there are no prior tiesets to reexamine,
so items 4 and 5 are transposed and i is set to 3. In both stages 4 and 5, only
backward steps have been taken.

13
Stage 5
Backward Elimination
C5 1 2 3 4 5
Calculate the column sums.
1 0 -1 -1 1 -1
Create a list of ties:
2 -1 0 -1 1 -1
ties i ← [1, 2, 3, 5].
3 -1 -1 0 -1 1
Transpose pi[5] and pi[1].
4 1 1 -1 0 -1
Tie Logic 5 -1 -1 1 -1 0
colsums -2 -2 -2 0 -2
There are no tiesets to ex-
ties i 1 2 3 5
amine in stage 5.
backward step: i ← 4
Stage 4
Backward Elimination
C4 5 2 3 4 1
Calculate the column sums.
5 0 -1 1 -1 -1
Create a list of ties:
2 -1 0 -1 1 -1
ties i ← [5, 2, 3, 4].
3 1 -1 0 -1 -1
Transpose pi[4] and pi[1].
4 -1 1 -1 0 1
Tie Logic 1 -1 -1 -1 1 0
colsums -1 -1 -1 -1
pi[4] = 4 is not a member of
ties i 5 2 3 4
any tieset for k > 4.
backward step: i ← 3
Figure 3: Example: the main steps taken in stage 5 resulting in a backward step to stage 4.

Stage 3 is shown in [Figure 4]. The steps in backward elimination find the sole
observation 3 to contribute the least. No transposition is required. To ensure
that pi[3] = 3 locally optimizes the tau-path, the tie logic [15–23] reexamines
all tiesets from k down to i + 1 in which 3 is a member. Since observation 3 is
a member of the tieset from stage 5, the processing continues by attempting to
determine whether the tau coefficient at stage 3 could have been improved had
the observation at pi[k = 5] been chosen instead. The calculations are shown in
the tie logic of [Figure 4]. In effect, the algorithm compares the cumulative sums
of stage 3 of two different concordance matrices shown as C3 and C30 in [Figure
4(a) and 4(b)], respectively. C3 represents the permuted order from the choice
that was made in stage 5. C30 represents the concordance matrix that would
have resulted from choosing the observation pi[k = 5], and is constructed by a
transposition of the observations at pi[3] and pi[5]. The cumulative sum qi is
calculated from C3 and qk from C30 [17-18]. Two tests must pass [19] if C30 is to
5
be considered the better choice: All of qku=3 ≥ qi5u=3 and Any qku=3
5
> qi5u=3 .
0
Both tests succeed and the algorithm continues with C3 as shown in [Figure
4(c)].

14
Stage 3

Backward Elimination
C3 4 2 3 5 1
Calculate column sums. 4 0 1 -1 -1 1
No ties to save. 2 1 0 -1 -1 -1
3 -1 -1 0 1 -1
Tie Logic 5 -1 -1 1 0 -1
pi[3] = 3 is a member of 1 1 -1 -1 -1 0
stage 5 tieset. colsums 0 0 -2
Determine locally optimal ties i
observation.

i k
C3 4 2 3 5 1
4 0 1 -1 -1 1
(a) Calculate the cumulative sum qi of
2 1 0 -1 -1 -1
pi[i = 3]
3 -1 -1 0 1 -1
5 -1 -1 1 0 -1
1 1 -1 -1 -1 0

k i
C30 4 2 1 5 3
4 0 1 1 -1 -1
(b) Calculate the cumulative sum qk of
2 1 0 -1 -1 -1
pi[k = 3]
1 -1 -1 0 1 -1
5 -1 -1 -1 0 1
3 1 -1 -1 -1 0

qi qk
(c) All(qk ≥ qi) AND Any(qk > qi) -2 0
test passes. -1 -1
-2 -2

(d) FastBCS continues with C30


Clear the tiesets: tie[1..5] = ∅
Forward step: i ← 4

Figure 4: Example: the main steps taken in stage 3 that result in a forward step to stage 4.

15
The final stages, as shown in [Figure 5], follow a series of backward steps
from stage 4 to stage 2 where the search ends [25]. The algorithm outputs pi
and the tau-path statistic calculated from the concordance matrix generated
from pi [28].

Stage 4f C4f 4 2 1 5 3
Backward Elimination 4 0 1 1 -1 -1
2 1 0 -1 -1 -1
There are no ties to save. 1 1 -1 0 -1 -1
5 -1 -1 -1 0 1
Tie Logic
3 -1 -1 -1 1 0
pi[4] = 4 is not a member of colsums 1 -1 -1 -3
any tieset for k > 4. ties i

Stage 3f
Backward Elimination

Transpose pi[3] and pi[2]. C3f 4 2 1 5 3


4 0 1 1 -1 -1
Tie Logic 2 1 0 -1 -1 -1
pi[3] = 1 is not a member of 1 1 -1 0 -1 -1
any tieset for k > 3. 5 -1 -1 -1 0 1
3 -1 -1 -1 1 0
The until test succeeds. Halt. colsums 2 0 0
ties i 2 1
pi = [4, 1, 2, 5, 3]
tau-path = [1, 1, 0.333, −0.333, −0.4]

Figure 5: Example: the second pass through stages 4f and 3f resulting from a forward step.

FastBCS guarantees only that a sequentially maximal monotone decreasing


path is generated (Yu et al. (2011), p. 102). It does not lead to a unique per-
mutation nor to a unique tau-path since the choice of ties at earlier stages may
limit subsequent maximizations. For example, for the input X = [1, 2, 3, 5, 4]
and Y = [2, 4, 1, 3, 5] in [Figure 5], our implementation of the algorithm will
not find the permutation pi = [1, 2, 5, 3, 4] with a tau-path of [1, 1, 1, 0.333, 0.2].
Instead, it finds the permutation pi = [3, 5, 4, 1, 2] with a corresponding tau-
path of [1, 1, 0.333, 0.333, 0.2]. To find the former tau-path requires a stronger
algorithm such as the FullBCS (Yu et al. (2011), p.103).
The FastBCS2 algorithm shown in [Algorithm 5] is being introduced to re-
duce the O(n3 ) running time of FastBCS to O(n2 ) as evidenced by the perfor-
mance times of FastBCS2 illustrated with [Table 5] in Subsection 3.3.4. It has
the same inputs and outputs as well as overall structure, but some of the oper-
ations have been moved, altered, or merged. The differences will be described
in Subsection 3.2.

16
Algorithm 5 FastBCS2(X, Y )
Input: A random sample of n pairs (X1 , Y1 ), . . . , (Xn , Yn ) from a continuous
bivariate population.
Output: The final permutation pi and the tau-path statistic derived from
the concordance matrix constructed from pi.
Require: n = length(X) = length(Y ); n > 1
// The function indexOf (a, v) returns the index of the value v in array a.
1: C ← the concordance matrix for (X1 , Y1 ), . . . , (Xn , Yn )
pi ← [1 . . . n]; tie[1 . . . n] ← ∅
i←n
2: for j ← 1 . . . n do
Pn
3: colsum[j] ← u=1 C[pi[u],pi[j]]
4: repeat
5: ties i ← {pi[l], l ← 1 . . . i|colsum[pi[l]] = minsum}
6: if |ties i| > 1 then
7: tie[i] ← ties i
8: r tie ← randomly select a member of ties i
9: l ← indexOf (pi, r tie)
10: transpose pi[i] and pi[l]; transpose colsum[i] and colsum[l]
11: else
12: l ← indexOf (pi, ties i[1])
13: transpose pi[i] and pi[l]; transpose colsum[i] and colsum[l]
14: for j ← 1 . . . i do
15: colsum[j] ← colsum[j] − C[π(i),π(j)]
16: for k ← n downto (i + 1) do
17: if pi[i] ∈ tie[k] then
Pi Pi+1 Pk
18: qi ← [ u=1 C[pi[u],pi[i]] , u=1 C[pi[u],pi[i]] , . . . , u=1 C[pi[u],pi[i]] ]
Pi Pi+1 Pk
19: qk ← [ u=1 C[pi[u],pi[k]] , u=1 C[pi[u],pi[k]] , . . . , u=1 C[pi[u],pi[k]] ]
20: if All(qk[u] ≥ qi[u]) and Any(qk[u] > qi[u]), u = 1 . . . k − i + 1
then
21: for j ← i . . . k do
22: for l ← 1 . . . j do
23: colsum[l] ← colsum[l] + C[pi(j),pi(l)]
24: transpose pi[i] and pi[k]; transpose colsum[i] and colsum[l]
25: i←k−1
26: tie[m] ← [], ∀m ≤ k
27: for j ← 1 . . . i do
28: colsum[j] ← colsum[j] − C[pi(i),pi(k)] ,
29: GOTO repeat
30: i ← i − 1 
Pi Pi
31: until j=1 k=1 C[pi[k],pi[j]]) ) = (i ∗ (i − 1)
32: k←i
33: T [2] ← · · · ← T [k] ← 1
34: return The final permutation pi and the tau-path statistic {T [2] . . . T [n]}

17
3.2. Computational Complexity
The analysis of an algorithm requires computational models of the target
platforms that will be used for execution of the implementations. Our anal-
ysis of FastBCS was based on a random-access machine (RAM) model where
operations are assumed to be performed strictly in sequence in constant time,
and a data parallel model which defines computation as a sequence of instruc-
tions or kernel applied synchronously to sets of processing elements each with
access to private data memory. The programming model for parallelism will
be discussed in more detail in Subsection 3.3.3 in the context of performance
data collected from multicore and manycore devices. No attempts were made
to model memory architectures, although the hierarchies and distribution of
memory play an important role in the runtime cost of implementations of the
FastBCS* algorithms.

3.2.1. Analysis of FastBCS


Our analysis of FastBCS focused on the runtime as determined by the cost
and frequency of critical operations and inner loops. To simplify the analysis,
the FastBCS algorithm was rewritten as procedural pseudocode with a syntax
resembling familiar programming languages such as C, Pascal or Java (Cormen
et al. (2009), pp. 20–22). [Table 1] shows the frequency analysis and order of
growth of critical statements of the repeat loop of the algorithm. The baseline
implementation of FastBCS in Java was instrumented to generate experimental
data for validating order of growth using the doubling ratio (Sedgewick and
Wayne (2011), pp. 192–193) and as a basis for the probabilistic analysis of the
average-case running time. Probes were placed in the FastBCS source code just
before lines 3, 8, 13, 17, 20, and 24 of the algorithm. We profiled the execution
of an implementation of FastBCS using 1,000 samples for each of n = 200 to
2,000 by 200 to determine the frequency of events at the probes, and to collect
statistics on the forward distance, the position of k when a forward step was
taken, the tieset size, and the value of i when the halting condition was met.
The empirical cost models derived from this profile are also shown in [Table 1].
We begin by examining the frequency of the outermost loop. Recall that
the FastBCS algorithm divides into two major sections within the repeat loop
[2–24]. Backward elimination [3–14] performs the search for the observation to
eliminate from the subset at each stage i. The tie logic [15–23] looks forward,
beginning at the end of the tau-path being constructed, to determine if another
member of a tieset created in a previous stage k that includes the observation
pi[i] might improve the Kendall’s tau coefficient for concordance matrix Ci at
stage i.
The backward step [24] is taken unless a forward step [21] is reached. While
each backward step results in the execution of only one stage, each forward
step increases the number of stages to reconsider by k − i − 1, a value we call
the forward distance. The frequency of the repeat loop is E[XR ] = E[XB ] +
E[XF D ] − E[XH ], where E[XB ] is the expected number of backward steps,
E[XF D ] is the expected forward distance and E[XH ] is the expected value of i
when the algorithm terminates.

18
Line Statement ∼ Frequency Big-Oh
2 repeat
3 for j ← 1 . . . i do
4 for u ← 1 . . . i do
4a colsum[j] ← C[pi[u],pi[j]] E[XR ] ∗ (i2 ) O(n3 )
5 minsum ← min value in colsum
6 for i ← 1 . . . i do
6a if (colsum[(pi[i]] = minsum)
Pi
6b add pi[i] to ties i E[XR ] ∗ k=1 (k) O(n2 )
7 if (ties i > 1)
8 tie[i] ← ties i
9 r tie ← randomly select an observation
10 l ← indexOf (pi, r tie) E[XR ]/2 ∗ n/2 O(n2 )
11 transpose pi[i] and pi[l]
12 else
13 l ← indexOf (pi, ties i[1]) E[XR ]/2 ∗ n/2 O(n2 )
14 transpose pi[i] and pi[l]
15 for k ← n downto (i + 1) do Pn−2
16 if pi[i] member tie[k] E[XR ] ∗ k=1 (k) O(n2 )
17 qi ← cumsum(C, pi[i], pi[i], pi[k]), . . .
18 qk ← cumsum(C, pi[k], pi[i], pi[k]), . . .
19 if All(qk >= qi) and Any(qk >= qi)
20 transpose pi[i] and pi[k]
21 i←k−1
22 for 1 . . . k do
22a tie[m] ← []
23 GOTO repeat
24 i←i−1
Pi Pi
25 until j=1 k=1 C[pi[k],pi[j]] = (i ∗ (i − 1)) E[XR ] ∗ (i2 ) O(n3 )

Empirical Cost Models Expected value of:


E[XR ] = −22.06 + 0.97 ∗ n number of iterations in repeat loop
E[XT ] = 0.494 + 0.00001 ∗ n number of ties found
E[XM ] = −0.08 + 0.19 ∗ n number of times observation is member of tieset
E[XF D ] = 0.51 + 0.001 ∗ n the total forward distance
E[XH ] = 21.61 + 0.03 ∗ n value of i when the algorithm halts
Table 1: Frequency analysis of FastBCS.

The number of backward steps is at most n − 1, since the repeat loop will
surely terminate [25] when i = 1. Whether a forward step is taken depends on
the number and size of tiesets created, on whether the observation at stage i
is a member of a tieset generated in a previous stage [7, 16] and on whether a
correction to the tau-path under construction needs to be made [19]. A forward
step erases all tiesets up to stage k [22], and restarts the backward search at
k − 1 [23].
Since backward elimination is performed every iteration and summing the
columns [3–4a] dominates the order of growth with O(n3 ), it was important to
understand the contribution of backward and forward steps to the frequency of

19
n NR (line 3) NT (line 8) Tieset size iH NT /NR
(avg) (max) (avg) (avg) (avg)
200 177.44 199 86.92 3.6 22.30 0.49
400 367.22 394 182.59 3.8 32.71 0.50
600 559.76 589 280.69 3.9 40.74 0.50
800 753.51 794 380.03 4.1 47.41 0.50
1000 948.04 1028 480.20 4.1 53.38 0.51
1200 1142.79 1209 580.73 4.2 58.89 0.51
1400 1338.15 1410 680.78 4.3 63.66 0.51
1600 1533.73 1601 782.30 4.3 68.15 0.51
1800 1729.84 1788 883.69 4.4 72.81 0.51
2000 1925.58 2008 986.21 4.4 76.75 0.51

Table 2: Statistics of the count of iterations of the repeat loop (NR ), the count of the times
ties were found (NT ), the size of a tieset (Tieset size), the value of i when the algorithm halted
(XH ), and the proportion of times ties were found within the repeat loop (NT /NR ).

the outermost loop. Statistics for the frequency of the loop and tiesets are shown
in [Table 2]. The number of forward steps is negligible. Empirically it was at
most 4 for n ≤ 2, 000 (Column NF S in [Table 4]). Since the average number
of iterations (Column NR ) is less than N, we assumed E[XR ] ' n. Although
tiesets are generated about half of the time (Column NT /NR ), they are small
in size (Column “Tieset size”), decreasing the likelihood that an observation
associated with stage i will be a member of a previous tieset [17].

n NR (line 3) NM (line 17) NF S (line 20) NF D NM /NR


(avg) (avg) (avg) (max) (avg) (max)
200 177.44 39.72 0.13 3 0.7 21 0.22
400 367.22 77.37 0.13 2 0.9 26 0.21
600 559.76 117.30 0.18 2 1.5 31 0.21
800 753.51 157.26 0.17 3 1.9 41 0.21
1000 948.04 196.40 0.20 4 2.4 81 0.21
1200 1142.79 235.18 0.21 3 2.7 64 0.21
1400 1338.15 274.71 0.22 3 2.8 73 0.21
1600 1533.73 315.88 0.20 3 2.9 71 0.21
1800 1729.84 352.78 0.24 3 3.7 59 0.20
2000 1925.58 391.29 0.18 3 3.3 89 0.20
Table 3: Statistics of the count of the iterations of the repeat loop (NR ), the number of
times the observation of stage i was a member of a previously saved tieset (NM ), the number
of forward steps (NF S ), the total forward distance (NF D ), and the proportion of times tie
membership was found to the number of iterations of the repeat loop (NM /NR ).

In [Table 3] approximately 20% of observations at stage i are members of


a previous tieset (Column NM /NR ). The distribution of the reset position
of k for those samples in which a forward step was reached is shown in the
boxplot of [Figure 6]. The cost for repeating stages resulting from a higher k as
compared with a lower k increases quadratically. Similarly, the distribution of
the total forward distance for these samples is shown in [Figure 7]. The increase

20
in the number of iterations by forward steps is offset by early termination when
permutations result in fully concordant matrices before i = 1. The distributions
of i at termination for different n are shown in [Figure 8].

Figure 6: The distribution of the average kth reset position in a forward step.

Figure 7: The distribution of the total forward distance.

To reduce the order of growth, we chose the operations that contributed


most to the runtime cost and could be adapted to a parallel environment. The
computation of column sums [3–4a] in FastBCS became the focus for FastBCS2,
and the platform for exploring how to improve the runtime cost of TKTP in a
variety of computing environments. The strategies of implementation will be
described in Section 3.3, but we first provide a brief overview of the changes
that were made to FastBCS.

21
Figure 8: The distribution of the expected value of i when the algorithm halts.

3.2.2. Analysis of FastBCS2


The FastBCS2 algorithm is shown in [Algorithm 5]. The central idea of
FastBCS2 was to minimize the calculations required to produce the column
sums from the permuted concordance matrix Ci at each stage i. The summing
of all columns of the initial concordance matrix C is done once with a cost of
O(n2 ) before the repeat loop is entered and the results are saved in the array
colsum [2–3].
Once the loop is entered, the colsum values requiring changes are updated
whenever a transposition of observations occurs through the permuted index
pi [10, 13, 15, 24, 27–28] or a forward step is taken [21–23]. How the incre-
mental adjustments upward or downward are made to the column sums [14–15,
21–23, 27–28] can be seen in [Figure 9]. The naturally ordered concordance
matrix C is created and the column sums in [Figure 9(a)] are initialized as
{−2, −2, −2, 0, −2}. After the observation 5 is transposed with observation 1
[10], the values of the row labeled 1 are subtracted from the column sums shown
in [Figure 9(a)] leaving the values of the column sums shown in [Figure 9(b)].
The adjustment [15] is made only 5 times. Only the operation that sums the
columns in a forward step [21–23] can be O(n3 ) in the worst-case but, as we
have seen, forward steps are rarely taken. The others [14–15, 27–28] are O(n2 )
for the RAM computational model.
Refactoring and distributing the column sum operations has two advantages.
First, it allows the postponement of operations until they are needed. Second,
it provides an opportunity for an implementation to use source code fragments
as specifications for parallelism for implementations using compilers that can
translate them for data-parallel computing environments. In the example in
[Figure 9], were n ≤ 1, 000 and the implementation running on an accelerated
processing unit (APU) with 1,000 processing elements (PE), the for loop could
theoretically be done in constant time by assigning each P Ei the subtraction

22
C 1 2 3 4 5
1 0 -1 -1 1 -1
2 -1 0 -1 1 -1
(a) The initialization of colsums for C. 3 -1 -1 0 -1 1
4 1 1 -1 0 -1
5 -1 -1 1 -1 0
colsums -2 -2 -2 0 -2 ←

C5 5 2 3 4 1
5 0 -1 1 -1 -1
2 -1 0 -1 1 -1
(b) The colsums for C5 after transpos-
3 1 -1 0 -1 -1
ing pi[5] and pi[1] in stage 5.
4 -1 1 -1 0 1
1 -1 -1 -1 1 0 ←
colsums -1 -1 -1 -1 -2 ←

Figure 9: An example of how FastBCS2 distributes the column sum calculations. (a) The
initial values for the column sums are calculated before the repeat loop [Line 4]. 25 operations
are performed. (b) The values of the row pi[5] = 1 are subtracted from the colsums of C to
produce the colsums of C5 [Lines 14-15]. 5 operations are performed.

operation and the i th elements of the column sum array of C5 and the transposed
row pi[5] = 1. In practice, it is not straightforward. The next section discusses
why.

3.3. Efficiency and Implementation Strategies


Three implementations of the FastBCS and FastBCS2 algorithms were de-
veloped. The two of FastBCS in R (R Core Team (2015)) and Java were used
to explore and cross-validate the algorithm’s implementations. FastBCS2, the
third implementation, was an optimization of FastBCS implemented in Java
and parameterized to execute critical sections in either sequential or sequential-
parallel mode. This subsection provides a brief history of the implementations
of the algorithms, the opportunities for optimization provided by parallelism,
our methods and results, and a discussion of the results.

3.3.1. Early Implementations


The first reference implementation of the FastBCS algorithm was done in the
R programming language to explore the tau-path algorithm. Optimization was
not a goal. However, the need to screen larger datasets required more efficient
implementations. Although the R environment is mostly implemented in C, type
safety checking imposes a performance penalty when data is passed between R
and C. To avoid this, the second implementation of FastBCS was written entirely
in C. Using variables of size n = 512, our benchmarks showed a nearly 30x
improvement in runtime performance over the R implementation. However, we
found the C implementation unable to handle large toxicity microarray datasets
and difficult to profile. To address these limitations and to begin exploring the

23
parallelism made possible by the introduction of Java 8, we created a faithful
reproduction in Java of the original R implementation which became the baseline
for runtime performance analysis.

3.3.2. The Promise of Parallelism


There are four parallel architecture categories in Flynn’s taxonomy (Rauber
and Rünger (2013), p. 11): Single-Instruction, Single-Data (SISD); Multiple-
Instruction, Multiple-Data (MISD); Single-Instruction, Multiple-Data (SIMD);
and Multiple-Instruction, Multiple-Data (MIMD). Of these, the SIMD (many-
core) and MIMD (multicore) architectures were of interest. In SIMD architec-
tures, all processing elements execute the same instruction synchronously at
each step, but each processing element has access to private data memory. In
MIMD architectures, each processing element has private access to program
and data memory, and each element works independently. For applications
with a high degree of data parallelism, the SIMD approach can be very efficient
and—excluding the broader context of integrating with software applications—
simpler to program than MIMD computers. This is because a single program
flow controls execution groups of processing elements, eliminating the need for
synchronization at the program level (Rauber and Rünger (2013), p. 11) Recent
advances in hardware and software created opportunities to reduce the order
of growth of TKTP by integrating algorithm design with implementations for
data-parallel computational models.
Graphics Processing Units (GPUs) now contain thousands of processing el-
ements. Where GPUs were once devoted to rendering pipelines for realtime
graphics applications, their value to other data-intensive applications is now
recognized. Heterogeneous computing environments are emerging which com-
bine aspects of MIMD and SIMD architectures with a unified memory model.
A unified memory eliminates the need to transfer large blocks of data between
the hierarchies of memory devoted to either CPU or GPU.
While algorithms often cannot be easily parallelized, many contain islands
of computation that would be amenable to parallel execution with the right
kind of programming language and system software support. Three things are
required. First is a parallel programming model that provides an abstraction of
a heterogeneous computing environment to insulate the software developer from
the low-level software and hardware differences of each device. Second, the pro-
gramming language used for the implementation must allow sequential and par-
allel fragments to be expressed uniformly and in a declarative form that allows
late-binding decisions to be made by the virtual machine as to which available
compute device would most efficiently run various elements of the computation.
Third, because software engineering tools are critical for algorithm design and
profiling of the implementation, standards are needed in order for these tools
to be produced. Software technology and standards have advanced far enough
to make it possible to explore implementing algorithms as sequential-parallel
high-level programs as we have done with FastBCS2* implementations.

24
3.3.3. Methods and Contexts
The OpenCL standard (OpenCL) computing model views a system as a
collection of compute devices such as CPUs or GPUs, each of which may contain
multiple processing elements. OpenCL 1.0, based on the C99 specification, was
a language for describing program fragments called kernels that execute on
compute devices. Although we considered C++/OpenCL 1.0 for specifying
kernels, we chose Java. In Java version 8, Oracle introduced parallel streams,
lambda expressions, and language extensions that provide a declarative style
of programming for parallelism. An experimental Java library (Frost (2015))
was also being developed at AMD to provide support for GPU parallelization.
OpenCL, Java 8, and Aparapi became the foundation for our initial work on
sequential-parallel algorithms.
The first attempt to parallelize FastBCS was based on the Aparapi library.
The sequential-parallel implementation was specified in Java. At runtime in the
Java virtual machine (JVM), the GPU-parallel fragments compiled into Java
bytecode were converted by Aparapi into the instructions and dispatched to the
computing system’s GPU device. For the software developer, this eliminated
the need to write and debug the OpenCL code that managed the dispatching
of work items into queues. Our initial attempts at speedup were disappointing.
Because of updates to the critical data structures for each stage of FastBCS2,
the cost of data transfers and Aparapi’s overhead were significant and offset
the gains from GPU computation. We refocused our efforts on what could be
achieved using the extensions to Java 8 in multicore systems.
The gains from the redesign of FastBCS2 came by moving the initial calcu-
lation of column sums outside the scope of the repeat loop and distributing only
the updates across the algorithm. The FastBCS2 gains come from the redesign
of the FastBCS algorithm, reducing the order of growth in the average case from
O(n3 ) to O(n2 ) as evidenced in [Table 5] of Subsection 3.3.4. This also made
possible the use of the parallel streams framework in Java 8. The FastBCS2
implementation is parameterized to execute either sequentially or with parallel
fragments on multicore CPUs. In sequential-parallel mode, the parallel streams
are executed. The creation and initialization of the concordance matrix, column
summation, and column sum updates attempt to run on as many cores as are
available.
To separate the effects of algorithm design, programming language, com-
putational platform, and parallelism on runtime performance, the data were
collected in six contexts on three platforms as described in [Table 4]. A context
consisted of an implementation of either the FastBCS or FastBCS2 algorithm
(FB or FB2); the programming language used to author the implementation
and the runtime environment or virtual machine in which it was executed (R
or Java); the hardware and operating system (P1, P2, or P3); and an execution
mode as either sequential (s) or sequential-parallel (sp).

3.3.4. Results
Three performance metrics were generated for each context: runtime, speed-
up, and the doubling ratio. The runtime measures the total amount of work

25
Software Hardware
Context Algorithm Lang. SDK OS Mode Device Cores Mem.
FB-RP1s FastBCS R RStudio OSX s MP-I7 4 16GB
FB-JP2s FastBCS Java Java 8 OSX s MP-I7 4 16GB
FB2-JP2s FastBCS2 Java Java 8 OSX s MP-I7 4 16GB
FB2-JP3s FastBCS2 Java Java 8 Linux s EC2-Xeon 16 32GB
FB2-JP2sp FastBCS2 Java Java 8 OSX sp MP-I7 4 16GB
FB2-JP3sp FastBCS2 Java Java 8 Linux sp EC2-Xeon 16 32GB

Table 4: The six execution contexts. Context: The contexts are encoded as algorithm (FB
or FB2), programming language and virtual machine (R or Java), the computing platform
(P1, P2, or P3), and the execution mode (”s” or ”sp”). Algorithm: The algorithm being
implemented. Lang: The statistical language R, or Java. SDK: The Software Development
Kit used to load, or compile and run the implementations was RStudio 0.99.441 for R and the
Oracle Java Development Kit 1.8.0 b25-17, respectively. OS: The operating system was either
OSX 10.11 or Linux Ubuntu 14.04.2.LTS. Mode: The execution mode indicates whether the
implementation was run sequentially (s) or as sequential-parallel (sp). Device: The Apple
computer was a Macbook Pro with a 2.2 Ghz Intel I7 (MP-I7). The larger multi-core system
was an Amazon EC2 c4.8xlarge instance with a 2.9 Ghz Intel Xeon E5-2666v3 (EC2-Xeon).
Cores: The number of cores in the microprocessor. Mem: The size (GB) of random access
memory available to the CPUs.

performed by all processors. The input for each implementation was a pair of
variables X and Y from a uniform distribution with input sizes that doubled
from n = 250 to 32,000. The execution of an implementation of the FastBCS*
algorithm was performed five times for each pair and the average time in mil-
liseconds recorded. No other applications were running on the platform during
the test. The results are given in [Table 5] and quadratic fits to average runtimes
depicted in [Figure 10], with relative performance given in [Table 6]. Due to
memory and time constraints, we were unable to complete the R implementation
of FastBCS (FB-RP1s) for n > 4, 000.

Vector length (n)


Context 250 500 1,000 2,000 4,000 8,000 16,000 32,000
FB-RP1s 441 3,358 42,189 341,974 1,325,750 * * *
FB-JP2s 17 82 575 4,685 34,850 303,481 2,368,399 19,709,329
FB2-JP2s 5 10 58 214 916 3,876 16,867 82,237
FB2-JP3s 3 10 33 129 579 2,922 11,586 48,431
FB2-JP2sp 31 33 65 171 564 2,083 7,929 36,506
FB2-JP3sp 38 38 83 212 560 2,167 6,733 20,636

Table 5: A table of the runtimes of different implementations of FastBCS* for pairs of variables
X and Y of size n = 200 to 2,000 by 200 generated from a uniform distribution. Each cell
is the average runtime in milliseconds of five executions. Cells marked with an asterisk were
not completed because of memory and time constraints on the platform used for RStudio.

The speedups shown in [Table 7] were derived from the runtime values.
Speedup is defined as S(n) = T ∗ (n)/T (n), where T ∗ (n) is the execution time of
a reference context and T (n) the execution time of the context being compared
to it. The use of a sequential implementation as a reference is especially helpful
in determining the benefit of parallelism (Rauber and Rünger (2013), p. 162).

26
Figure 10: Quadratic fits to average runtimes of FastBCS2.

The Java implementation of FastBCS (FB-JP1s) was used as a reference for


comparing speedup across all contexts. Three additional tables using different
references for each comparison are shown in [Table 7] to illustrate the effect on
speedup by the redesign of the algorithm [Table 7(a)], the number of cores in
multicore systems [Table 7(b)], and the benefits of the Java 8 parallel stream
framework [Table 7(c)].
The doubling ratio was generated using an implementation based on the
algorithm shown in [Algorithm 6]. It provides an empirical measurement for
approximating most common orders of growth. The doubling hypothesis is that
as the limit of the ratio approaches ∼2b , the running time is approximately anb
and this is an acceptable approximation of O(nb ) when the ratio is being used
to predict performance (Sedgewick and Wayne (2011), p. 193). The results of
the doubling ratio across all platforms are shown in [Table 8].

3.3.5. Discussion
The first two rows of [Tables 5 and 6] show dramatic improvements in
speedup in large n (n = 32, 000) from just two changes: 1) the use of a static

27
Vector length (n)
Context 250 500 1,000 2,000 4,000 8,000 16,000 32,000
FB-RP1s 0.038 0.024 0.013 0.012 0.009 * * *
FB-JP2s 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
FB2-JP2s 3.2 8.2 9.9 21.9 38.0 78.3 140.4 239.7
FB2-JP3s 5.3 7.9 17.4 36.2 60.2 103.9 204.4 407.0
FB2-JP2sp 0.5 2.5 8.9 27.4 61.8 145.7 298.7 539.9
FB2-JP3sp 0.4 2.1 6.9 22.1 62.3 140.1 351.8 955.1
Table 6: Speedup comparing all contexts using FB-JP2s as the reference implementation.
Cells marked with an asterisk could not be computed because the runtime values were not
generated.

Vector length (n)


Context 250 500 1,000 2,000 4,000 8,000 16,000 32,000
(a) FB-JP2s 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
FB2-JP2s 3.2 8.2 9.9 21.9 38.0 78.3 140.4 239.7
(b) FB2-JP2s 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
FB2-JP3s 1.6 1.0 1.8 1.7 1.6 1.3 1.5 1.7
FB2-JP2sp 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
FB2-JP3sp 0.8 0.9 0.8 0.8 1.0 1.0 1.2 1.8
(c) FB2-JP2s 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
FB2-JP2sp 0.2 0.3 0.9 1.3 1.6 1.9 2.1 2.3
FB2-JP3s 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
FB2-JP3sp 0.1 0.3 0.4 0.6 1.0 1.3 1.7 2.3

Table 7: A table of speedup comparisons using different references to show: (a) the effect
on speedup of the redesigned algorithm FastBCS2 compared with FastBCS; (b) the effect on
speedup of a 16-core system as compared with a 4-core system for sequential or sequential-
parallel execution as well as factors such as processor clock speeds, hardware architecture,
and cache sizes and speeds; and (c) the effect on speedup of sequential execution as compared
with sequential-parallel execution on the same platform for two different platforms.

language and its development and runtime environments (Java, SDK, and JVM)
over a dynamic language (R and RStudio) and 2) minor changes to the algo-
rithm’s design based on a frequency analysis and an empirical study of the
algorithm’s behavior resulting from a randomization of the inputs. Comparing
the doubling ratios of FB-RP1s with FB-JP2s and FB-JP2s with FB2-JP2s at
n = 2,000 and 4,000, shows that most of the improvement to the order of growth
comes from the algorithm’s redesign.
Further improvements come from parallelization through the use of the Java
8 parallel streams for larger n. We see this by comparing FB2-JP2s with FB2-
JP2sp and FB2-JP3s with FB2-JP3sp. In [Tables 7c and 8] speedup and the
order of growth begin to improve on the 4-core processor at n = 2, 000, and
on the 16-core processor at n = 8, 000. More powerful hardware improves
the speedup for a sequential implementation (rows FB2-JP2s and FB2-JP3s of
[Table 7b]) but has no apparent effect on the order of growth (rows FB2-JP2s
and FB2-JP3s for n ≥ 2, 000 in [Table 8]).

28
Algorithm 6 CalculateDoublingRatios(nA , nB )
Input: nA and nB are the lower and upper limits of n.
1: IT ERAT ION S ← 5
2: i ← 1
3: n ← nA
4: f nc ← A function that invokes an implementation of FastBCS*
5: Ti ← timeF astBCS(f nc, n, IT ERAT ION S)
6: while n ≤ nB do
7: i←i+1
8: n ← 2n
9: Ti ← timeF astBCS(f nc, n, IT ERAT ION S)
10: ratios[i] ← Ti /Ti−1
11: if ratios[i] approaches a limit then
12: break
13: return ratios

14: function timeFastBCS(f nc, n, IT ERAT ION S)


15: t←0
16: for i ← 1 . . . iterations do
17: Generate variables X and Y of size n from a uniform distribution.
18: t ← t + runtime(f nc(X, Y ))
19: return t/iterations

Since the FastBCS* algorithms operate mostly on homogeneous numeric


data and a significant proportion of execution time in FastBCS* implementa-
tions is spent on the calculation of concordance matrix column sums, manycore
systems theoretically could further reduce the order of growth by mapping a
computational kernel onto the data of a column or group of columns, allow-
ing parallel calculation of each column’s sum. In our initial work with GPUs,
however, we found that data transfer cost was a limitation with the current gen-
eration of GPUs. When the GPU does not share memory with the host CPU,
data transfers are required to maintain a coherent state between the CPU and
GPU memories. The data transfer costs overwhelm any gain from parallelism.
The next generation of heterogeneous computing environments will eliminate
the need for data transfers through the inclusion of unified memory among de-
vices.
Standards for heterogeneous computing environments are emerging and the
major object-oriented languages such as Java as well as newer languages such as
Dart or Swift have the language features needed to express data-parallelism. Al-
though advancements in heterogeneous computing environments make parallel
programming increasingly attractive for algorithms like TKTP, the tools for de-
veloping sequential-parallel implementation to run heterogeneously have not yet
reached the maturity needed to allow these methods to be more broadly applied.
We expect this to change within the next couple of years. Until then, implemen-

29
Vector length (n)
Context 250 500 1,000 2,000 4,000 8,000 16,000 32,000
FB-RP1s NA 2.9 3.7 3.1 3.2 * * *
FB-JP2s NA 2.3 2.8 3.0 2.9 3.1 3.0 3.1
FB2-JP2s NA 0.9 2.5 1.9 2.1 2.1 2.1 2.3
FB2-JP3s NA 1.7 1.7 2.0 2.2 2.3 2.0 2.1
FB2-JP2sp NA 0.1 1.0 1.4 1.7 1.9 1.9 2.2
FB2-JP3sp NA 0.0 1.1 1.4 1.4 2.0 1.6 1.6

Table 8: The binary logarithm of the doubling ratio, or lg(T (2n)/T (n)), where T is the
execution time of an implementation of FastBCS* for X and Y from a uniform distribution.
The values of the cells represent the power b = lg(ratio) which is an acceptable approximation
of the order of growth O(nb ) when used to predict performance using the doubling hypothesis.
Cells marked with an asterisk could not be computed because the runtime values were not
generated.

tations for sequential-parallel algorithms written in high level languages can be


designed and executed on current generation multicore CPUs.

4. Power and Accuracy Simulation Study

In this section, simulation studies are used to exemplify the performance of


TKTP in screening for sample points that come from a population in which the
variables are associated. Since the method is based only on the relative rankings
of values observed for each variable, it is invariant to scale transformations of
the variables, and thus it is sufficient to limit the investigation of properties
to copulae, which have uniform margins. Samples are simulated under three
different types of mixing with the independent copula: 1) Frank, 2) Gaussian,
and 3) a positively associated Frank copula and a negatively associated Gaus-
sian copula. The objectives are to identify a suitable measure of performance of
TKTP in screening for subsamples that support monotonic association of the
variables; to check the robustness of this measure under different distributions
with comparable association; and to provide guidelines for setting the opera-
tional values of α and w to meet coverage requirements. Toward these aims,
it is helpful to provide some pertinent background on the Frank and Gaussian
copula families. For a thorough introduction, see Nelsen (2006).
Although the Frank copulae are often indexed by τ and the Gaussian copulae
indexed by the population version ρ of Spearman’s correlation coefficient, which
is also the product moment correlation of variables X and Y after their indi-
vidual cumulative distribution function (CDF) transforms, [Figure 11] shows
how either parameter may be used to index either family of copulae. In par-
ticular the Frank copulae with τ = 0.3, 0.5 or 0.7 have corresponding values of
ρ = 0.45, 0.7 or 0.89, respectively.
Each line in [Figure 11] was formed from 100 summary points, with each
point based on a sample of 100,000 points from one of the copulae. For each
sample generated from a Frank copula with a given τ , a Spearman correlation
coefficient ρ was calculated, which essentially equals the population ρ since the

30
Figure 11: Corresponding population values of Kendall’s τ and Spearman’s ρ from Frank and
Gaussian copulae.

sample size is so large. Similarly, samples from the Gaussian copulae were gen-
erated with fixed values of ρ and the corresponding population τ was estimated
from these large samples. Note that for any copula the population value of
Spearman’s ρ is the same as the population (Pearson) product moment corre-
lation between the uniformly distributed marginal variables.
In the figure, the solid black line is based on points generated from Frank
copulae, and the solid red line is based on points generated from Gaussian
copulae. Dashed lines are inversions of the solid ones, and colors indicate either
ρ versus τ (black) or τ versus ρ (red). Although there exist distributions in
which the relationship between population τ and ρ may be quite different from
that given above, this relationship remains quite stable across the Frank and
Gaussian copulae, as seen from the slight difference between solid and dashed
lines of the same color.
Although a Frank and a Gaussian copula may share the same values of τ and
ρ, the distributions themselves are generally quite distinct. Density contours of
the Frank and Gaussian copulae, both having τ = 0.5 and ρ = 0.7, are depicted
in [Figure 12].
Now consider the performance of TKTP when applied first to various mix-
tures of Frank and independent copulae. This first simulation study follows
a 3 × 3 × 2 complete block design, with three levels of sample sizes (n =
100, 500, 1000), three levels of association strength in the subsample (τ = 0.3,
0.5, 0.7) and two levels of mixing proportion of the associated subsample (p =
0.3, 0.4). For each of the 18 combinations of these parameters, 100,000 samples
are generated from the Frank mixture, and the results of the (α = 0.05; w = 3)

31
Figure 12: Density contours of Frank (left) and Gaussian (right) copulae, both with τ = 0.5
and ρ = 0.7. Contours depict density levels of 0.5 to 5 in steps of 0.05.

TKTP screen recorded. [Figure 13] summarizes the major finding.


In [Figure 13], each point represents the paired means from 100,000 simula-
tions. “Rate” refers to the increase in percent coverage with increase in percent
of sample selected. The plot character represents sample size (circle → n = 100;
triangle → n = 500; plus → n = 1,000). Color represents strength of association
(black → τ = 0.3; red → τ = 0.5; green → τ = 0.7). The size of each symbol
represents the proportion of associated subsample in the mix (small → 30% of
observed pairs are associated; large → 40%).
The stopping point is the point at which the algorithm determines the pre-
dictive power of the remaining sample reduces to chance levels. Percent covered
is the percent of observations from the Frank copula that appear before the
stopping point and are thus included in the screen.
Increasing sample size from n = 100 to n = 500 has a big effect on both
stopping point and percent coverage, but differences between these variables
at n = 500 and n = 1, 000 are small. The biggest effect of sample size is on
the distributions of these two variables, which are highly skewed left for small
sample sizes, but approach normality for larger sample sizes.
A larger underlying proportion (0.4 versus 0.3) always increases the mean
stopping proportion of the range for any sample size and any τ , but does not
necessarily increase mean coverage
√ probability. The method has little power to
detect proportions less than 2 n−1.758n1/6 , which is the expected length of the
longest increasing subsequence under independence (Baik et al. (1999), Aldous
and Diaconis (1999)), and it is also difficult for the algorithm to distinguish a
large associated subpopulation from the whole population. In the limited range
investigated here, size of the underlying associated subpopulation does not have
a big effect on the algorithm.
On the other hand, strength of association in the subpopulation has a sub-
stantial effect on the performance of TKTP. Stronger association (increasing τ )

32
Figure 13: Summary of TKTP (α = 0.05, w = 3) simulations under Frank mixtures of copulae.

produces a substantial increase in both mean stopping point and mean coverage
for any fixed sample size, which is captured by the rate of coverage, as given by
the slopes of the lines in [Figure 13]. Formally, under any given copula,

Mean(Number of associated values selected)


Rate of coverage = , (3)
Mean(Stopping point)

which clearly does not depend on the sample size. [Table 9] gives the rates
of coverage under Frank and Gaussian copulae that are matched for the same
strength of association in the subsample.

τ Frank Rate Gaussian Rate ρ


0.3 1.19 1.17 0.45
0.5 1.33 1.30 0.70
0.7 1.49 1.46 0.89

Table 9: Rates of Coverage [Equation (3)] under Frank and Gaussian copulas.

The simulations compare coverage rates for TKTP for α in {0.1, 0.05, 0.10}
to see how stable the coverage probability is in this range. It is enough to

33
consider just Frank copulae with n = 500, mixing proportion = 0.3, and τ in
{0.3, 0.5, 0.7}. The standard errors of these rates are all about 0.1, suggesting
little real difference in the TKTP performance; the comparable Gaussian version
of [Figure 13] is virtually indistinguishable from that figure, and is thus omitted.
Nevertheless, the actual coverage is systematically less in all 18 Gaussian copulae
under study than in their comparable Frank copulae, suggesting that the small
differences indicate a real but slight underperformance of TKTP in the Gaussian
setting.
In all cases, the rate of coverage seems to capture the effectiveness of the
method. From the table, it remains stable under different types of mixtures; and,
from [Figure 13], it is not affected by small changes in the mixing proportion.
The control parameters α and w of the TKTP have different tasks. The
value α changes the distribution of the stopping point, with smaller values of
α inducing stochastically larger distributions of the stopping point. The effect
is much the same as selecting an operating point in a classification method.
The value α = 0.05 gives a rough midrange value, and lowering it slightly will
increase both the stopping point and the coverage probability, but lowering it
too far will greatly increase the probability of wrongly including uncorrelated
samples in the selection set. For large sample sizes n ≥ 500, we recommend
α = 0.05 because these quantiles are well estimated in the simulations and the
stopping point does not change much for other values of α near 0.05. The control
parameter w helps to stabilize the performance of TKTP. It is useful mostly for
small sample sizes n ≥ 100, as studied here; we recommend w = 3, the smallest
practical smoothing.

5. Application: Long Term Predictive Power of Oil for S&P 500


Stocks

As an example of how TKTP may be used in large datasets, the weekly


closing prices of stocks currently listed in the Standard & Poor’s 500 index and
having a 10-year history are correlated against the corresponding price of oil
six months prior. By eliminating weeks where the overall pattern is broken,
the TKTP method provides not only a robust estimate of correlation, but also
the set of common time points over which predictions are most reliable for a
key cluster of stocks. For most of the stocks in this cluster, oil is not a direct
cause of the performance of these stocks, but simply a reliable indicator of the
likely course of stock prices, at least during the time periods ascertained by the
TKTP. The goal here is not to predict actual stock prices, but to identify likely
periods of a sustained trend.

34
Figure 14: Cumulative number of stocks associated with 6-month lagged oil.

The dataset is composed of weekly prices of 26-week lagged S&P 500 stocks
versus oil over a decade from 12/31/2004 to 1/2/2015 (523 time points, 449
stocks with 10 years of data). The initial screen follows a two-step selection
process:
Step 1. Find stocks with partial association over the most weeks. There
were 523 − 26 = 497 time point pairs (oil time, stock time) available. The first
screen was to select stocks associated over at least 60% of these time point pairs.
There were 273 such stocks correlated with 6-month oil over periods of at least
6 years. See [Figure 14] which suggests this cutoff.
Step 2. Find the correlations (both Pearson and Kendall) between oil and
lagged stock prices over the TKTP-selected time point pairs. There are 77 of
these with Pearson correlations over 0.9, listed in [Table 10].
[Figure 15] depicts the full 10-year time courses of the weekly price of the
five stocks with the largest Pearson correlations in [Table 10] relative to the
6-month prior price of oil. These five stocks come from different industries:
PRGO–pharmaceuticals, CRM–software, ARG–chemicals, CTSH–IT services,
SRCL–Commercial Services. This suggests that oil price predicts, at least in
pattern, an aspect of the general economy. Although the huge crash of oil in
2008 was disproportionate, it did predict a downturn in each of the stocks six
months later.
More importantly than general association, the TKTP method helps to iden-

35
FMC COL TJX CNP BEN XOM SIAL PH NBL XRAY EMC
0.901 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.903 0.903 0.905
ISRG APH PCAR DHR RRC FTI FAST ROK DRI EL RAI
0.905 0.905 0.906 0.906 0.907 0.907 0.907 0.908 0.908 0.908 0.908
PWR SPG PCLN LH CL AMZN HCP UTX ROST BLK IBM
0.909 0.909 0.910 0.910 0.910 0.911 0.911 0.912 0.913 0.913 0.913
EW SO GWW KO PCP GIS ALTR BBBY CERN INTU DTV
0.914 0.914 0.916 0.916 0.916 0.916 0.916 0.917 0.917 0.917 0.917
FLS MCD XEL WAT CTXS SWK WEC ES AZO CSX CMI
0.918 0.918 0.918 0.918 0.918 0.919 0.919 0.920 0.920 0.921 0.922
FDO PSA DLTR ORCL YUM FFIV RL ACN ED O EMR
0.922 0.922 0.923 0.923 0.924 0.924 0.924 0.926 0.926 0.926 0.928
VTR PX ESRX AAPL RHT VAR SRCL CTSH ARG CRM PRGO
0.928 0.928 0.930 0.930 0.932 0.932 0.933 0.934 0.935 0.936 0.942

Table 10: Pearson correlations for 77 S&P 500 stocks correlated with 6-month prior oil price
over different restricted periods of at least 6-year duration.

tify the time periods over which association is strongest. Pooling information
from several stocks strengthens the degree of confidence about the time periods
selected, but the pooling needs to be done methodically. The idea is to pool
information from stocks that agree strongly with each other.
A complete-linkage clustering approach is used to find the stocks from which
to pool time-point inclusions. The agreement measure used for pairs of time-
point sets is the Jaccard/Tanimoto coefficient J defined for sets A and B as

|A ∩ B|
J= .
|A ∪ B|

A cluster of 24 of the 273 Step 1-screened stocks has all pairs with J > 0.8.
These 24 stocks, listed in [Table 11], include three of the five stocks from [Figure
15], and 13 of the stocks from [Table 10].

CAN ARG ABC APH AZO BLL


HSIC DTV ECL EL FIS FISV
GIS HIL REGN ROP CRM SIAL
SRCL TJX UNP UHS WAT WEC

Table 11: Cluster of stocks completely linked by J > 0.8.

36
Figure 15: Five stock prices associated with 6-month lagged oil: 2005-2014.

The average 24 TKTP correlations with 6-month oil were 0.89 (Pearson)
and 0.75 (Kendall). [Figure 16] illustrates, for each week, the number of stocks
having that week included by TKTP. The color coding follows the rainbow with
red indicating inclusion in all 24 stocks; yellow-green, in 13 stocks; and blue in
0 stocks. When plotted under the price of oil, it appears that weeks tend to be
excluded during the most volatile periods for oil prices.

6. Discussion
The computational efficiencies attained by the new algorithms enable a whole
new approach to discovering key subpopulations from big data. Many pairs of
variables may be screened to see if there is a common subset of observations that
supports strong association between the pairs, and thus represents a subpopu-
lation supporting a whole network. This could be a subpopulation of cancers
supporting a gene-network for chemoresistance, or a subpopulation of time pe-
riods supporting an economic network for growth.
Although the methods here have been presented in the classical view of ana-
lyzing a random sample from a population, they also apply to any large dataset.

37
Figure 16: TKTP-selected weeks for oil price prediction of stock cluster prices.

Inference here is internal to the dataset treated as its own population. Identified
subsets correspond to a real subpopulation, but identifying the subpopulation
externally may not be straightforward. For the stock price example, the apho-
rism, “A rising tide lifts all boats,” seems to apply. A foretelling rise of oil
prices, ignoring brief periods of large volatility, signals a rise in tide, at least
in the harbor where the cluster of two dozen stocks is moored. It is not clear
which stocks will remain in the harbor, or what other, more specific, influences
will be in force.
The decision to focus on ranks rather than on the original numerical values
was motivated both by concerns for inferential robustness and insights from
recent mathematics. When moving between the database and external envi-
ronments, calibration issues often arise, but causal relationships persist in situ.
Thus rank associations found within a database subset are very likely to ap-
pear in corresponding external subpopulations. Among rank-based methods,
the multistage ranking model is very flexible since the number of parameters
increases with sample size.
Recent mathematical methods motivate the use of the multistage ranking

38
method. In the special case where all the population parameters {θj } are equal,
yielding the Mallows model, Starr (2009) has found a limiting distribution which,
under proper scaling, is a Frank copula. More general parameterizations of the
multistage model appear to converge to mixtures of Mallows, which covers a very
wide range of ranking distributions. If this is true, the multistage model for a
pair (X, Y ) of variables ranking the n observations will converge to a mixture of
Frank copulae. A “signal” in such a mixture of might correspond to a mixture
of components with large values of Kendalls tau, with “noise,” if present, being
represented by a mixture of nearly independent Frank copulae. Thus, in this
conceptual setting, TKTP screens out a nearly independent component. Doing
this on the basis of a single ranking is remarkable; finding additional pairs of
variables with the same limiting distribution, as in the example of Section 5,
adds to the accuracy. See Awasthi et al. (2014) and Meilă and Chen (2012) for
general approaches to mixtures of Mallows models.

References

Aldous, D., Diaconis, P., 1999. Longest increasing subsequences: from patience
sorting to the Baik-Deift-Johansson theorem. Bulletin of the American Math-
ematical Society 36, 413–432.
Awasthi, P., Blum, A., Sheffet, O., Vijayaraghavan, A., 2014. Learning mixtures
of ranking models, in: Advances in Neural Information Processing Systems,
pp. 2609–2617.

Baik, J., Deift, P., Johansson, K., 1999. On the distribution of the length of
the longest increasing subsequence of random permutations. Journal of the
American Mathematical Society 12, 1119–1178.
Bamattre, S., Hu, R., Verducci, J.S., 2015. Nonparametric testing for heteroge-
neous correlation. arXiv preprint arXiv:1504.05392.

Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., 2009. Introduction to
Algorithms. 3rd ed., The MIT Press.
Fligner, M.A., Verducci, J.S., 1988. Multistage ranking models. Journal of the
American Statistical Association 83, 892–901.

Frost, G., 2015. Aparapi. https://github.com/aparapi/aparapi.


Hall, P., Schimek, M.G., 2012. Moderate-deviation-based inference for random
degeneration in paired rank lists. Journal of the American Statistical Associ-
ation 107, 661–672.

Liiv, I., 2010. Seriation and matrix reordering methods: An historical overview.
Statistical Analysis and Data Mining: The ASA Data Science Journal 3, 70–
91.

39
Meilă, M., Chen, H., 2012. Dirichlet process mixtures of generalized Mallows
models. arXiv preprint arXiv:1203.3496.
Nelsen, R.B., 2006. An Introduction to Copulas, 2nd ed. Springer Sci-
ence+Business Media, New York.

R Core Team, 2015. R: A Language and Environment for Statistical Computing.


R Foundation for Statistical Computing. Vienna, Austria. URL: http://www.
R-project.org/.
Rauber, T., Rünger, G., 2013. Parallel programming: For multicore and cluster
systems. Springer Science & Business Media.

Sampath, S., Verducci, J.S., 2013. Detecting the end of agreement between
two long ranked lists. Statistical Analysis and Data Mining: The ASA Data
Science Journal 6, 458–471.
Sedgewick, R., Wayne, K., 2011. Algorithms. 4th ed., Addison-Wesley Profes-
sional.

Starr, S., 2009. Thermodynamic limit for the Mallows model on Sn . Journal of
Mathematical Physics 50, 095208:1–15.
Yu, L., Verducci, J.S., Blower, P.E., 2011. The tau-path test for monotone
association in an unspecified population: Application to chemogenomic data
mining. Statistical Methodology 8, 97–111.

40

You might also like