A Mathematical Programming Approach For Improving The Robustness of LAD Regression

A Mathematical Programming Approach for Improving the
Robustness of LAD Regression

Avi Giloni Bhaskar Sengupta
Sy Syms School of Business Computer Science Department
Room 428 BH Yeshiva University
Yeshiva University
500 W 185 St
New York, NY 10033
E-mail: [email protected]
Jeffrey Simonoff
Leonard N. Stern School of Business
New York University
44 West 4th Street
New York, NY 10012
E-mail: [email protected]
July 23, 2004
Abstract
This paper discusses a novel application of mathematical programming techniques
to a regression problem. While least squares regression techniques have been used for
a long time, it is known that their robustness properties are not desirable. Specifically,
the estimators are known to be too sensitive to data contamination. In this paper we
examine regressions based on Least-sum of Absolute Deviations (LAD) and show that
the robustness of the estimator can be improved significantly through a judicious choice
of weights. The problem of finding optimum weights is formulated as a nonlinear mixed
integer program, which is too difficult to solve exactly in general. We demonstrate
that our problem is equivalent to one similar to the knapsack problem and then solve
it for a special case. We then generalize this solution to general regression designs.
Furthermore, we provide an efficient algorithm to solve the general non-linear, mixed
integer programming problem when the number of predictors is small. We show the
efficacy of the weighted LAD estimator using numerical examples.
Keywords: Algorithms; Breakdown point; Knapsack problem; Nonlinear mixed integer
programming; Robust regression
1
1 Introduction
Consider the well-known statistical linear regression problem. We have n observations on

some “dependent” variable y and some number p ≥ 1 of “independent” variables x1 , . . ., xp,
for each one of which we know n values as well. We denote
   1   1 
y1 x1 · · · x1p x
 ·   ·  
·   · 
   
y=  · , X =  ·
  ·  
= ·
 = (x1, . . . , xp) ,

 ·   · ·   · 
yn x1 · · · xnp
n
xn
where y ∈ Rn is a vector of n observations, x1 , . . ., xp are column vectors with n components,

and x1, . . . , xn are row vectors with p components corresponding to the columns and rows
of the n × p matrix X, respectively. To rule out pathologies we assume throughout that the
rank r(X) of X is full, i.e., that r(X) = p. If the regression model includes an intercept
term, as is typical, x1 is a column of ones, but this is not required.
The statistical linear regression model is
y = Xβ + ε, (1)
where β T = (β1, . . . , βp) is the vector of parameters of the linear model and εT = (ε1 , . . ., εn )
is a vector of n random variables corresponding to the error terms in the asserted relation-
ship. The superscript T denotes “transposition” of a vector or matrix throughout this work.
In the statistical model, the dependent variable y is a random variable for which we obtain
measurements or observations that contain some “noise” or measurement errors that are
captured in the error terms ε.
Although (1) gives the statistical model underlying the regression problem, the numerical
problem faced is slightly different. For this, we write
y = Xβ + r (2)
where given some parameter vector β, the components ri of the vector rT = (r1 , . . ., rn ) are
the residuals that result, given the observations y, a fixed design matrix X, and the chosen
vector β ∈ Rp . It is well-known that when the errors ε are normally (Gaussian) distributed,
2
Pn
the least squares parameter estimator (which minimizes the `2-norm ky−Xβk2 = i=1 (yi −
xi β)2 of the residuals) has many desirable properties, having the minimum variance among
all linear unbiased estimators, and (being the maximum likelihood estimator) achieving the
minimum possible variance for all consistent estimators as the sample size becomes infinite.
Many other regression estimators, in addition to least squares, have been proposed in the
statistical literature. These techniques have been introduced to improve upon least squares
in some way. Among these techniques are those that are robust with respect to outliers,
as it is known that least squares regression estimates are affected by wild observations.
There have been several measures developed within the statistical literature that quantify
the robustness of a regression estimator. In this paper, we focus on the breakdown point
(c.f. [12]) to be formally defined in Section 2.
One of the earliest proposals for estimating regression parameters was regression per-
formed using the `1 -norm, also called Least-sum of Absolute Deviations (LAD). This regres-
sion problem can be solved using linear programming, hence its interest in the operations
research community. LAD regression has become more useful with the advent of interior
point methods for solving linear programs and with the increase in computer processing
speed (Portnoy and Koenker [11]). Furthermore, it is known that LAD regression is more
robust than least squares (c.f. [3]). As far back as the 1960s and 1970s, it was noticed that
empirically, LAD outperformed least squares in the presence of fat tailed data (c.f. [13]).
However, it is only more recently that the robustness properties of LAD regression have
been theoretically determined. For regression problems where there may be outliers in the
dependent variable, LAD regression is a good alternative to least squares, and we show that
judicious choice of weights can improve its robustness properties. Furthermore, LAD can
be utilized to demonstrate that least squares is accurate when indeed it is (if the LAD and
least squares estimates are similar); this can be useful, since the least squares estimator is
more efficient than LAD in the presence of Gaussian errors.
In this paper, we study a mathematical program (nonlinear mixed integer program)
that can be used to improve the robustness of LAD regression. We demonstrate that the
introduction of nonuniform weights can have a positive impact on the robustness properties
3
of LAD regression. We develop an algorithm for determining these weights and demonstrate
the usefulness of our approach through several numerical examples. Specifically, we develop
an algorithm for choosing weights that can significantly improve the robustness properties
of LAD regression. In order to study the weighted LAD regression problem, we use and
apply linear and mixed integer programming techniques. Our studies indicate that weighted
LAD regression should be seriously considered as a regression technique in many regression
and forecasting contexts.
The structure of the paper is as follows. In Section 2, we introduce the LAD regression
problem, summarize some of the pertinent research on LAD regression and its robustness
properties. We show (in Section 3) that the problem of incorporating nonuniform weights
can be formulated as a nonlinear mixed integer program. In Section 4, we demonstrate that
this problem is equivalent to a problem related to the knapsack problem. In Section 5, we
discuss a special case of the weight determination problem for which an optimal solution can
be obtained. Using the insights gained in Sections 3-5, we develop an algorithm (in Section
6) to solve the problem approximately, and demonstrate that the algorithm significantly
improves the robustness of the estimators through several numerical examples in Section 7.
2 LAD Regression and Breakdown
In the case of LAD regression, the general numerical problem (2) takes as the (optimal)
P
parameters β ∈ Rp those that minimize the `1 -norm ky − Xβk1 = ni=1 |yi − xiβ| of the
residuals. It is well-known that this problem can be formulated as the linear programming
(LP) problem
min eTn r+ + eTn r− (3)
such that Xβ + r+ − r− = y
β free, r+ ≥ 0, r− ≥ 0,
where en is the vector with all n components equal to one. In (3) the residuals r of the
general form (2) are simply replaced by a difference r+ − r− of nonnegative variables,
i.e., we require that r+ ≥ 0 and r− ≥ 0, whereas the parameters β ∈ Rp are “free”
4
to assume positive, zero, or negative values. From the properties of linear programming
solution procedures, it follows that for any solution inspected by the simplex algorithm,
either ri+ > 0 or ri− > 0, but not both, thus giving |ri | in the objective function depending
on whether ri > 0 or ri < 0 for any i ∈ N where N = {1, . . ., n}. Every optimal extreme
point solution β ∗ ∈ Rp of the LAD regression problem has the property that there exists a
nonsingular p × p submatrix XB of X such that
β∗ = X−1 B + ∗ − ∗
B y , r = max {0, y − Xβ } , r = − min {0, y − Xβ }
where |B| = p and yB is the subvector of y corresponding to the rows of XB ( c.f. [3]).
The notion of the breakdown point of a regression estimator (due to Hampel [6]) can
be found in [12] and is as follows. Suppose we estimate the regression parameters β by
some technique τ from some data (X, y), yielding the estimate β τ . If we contaminate m
(1 ≤ m < n) rows of the data in a way so that row i is replaced by some arbitrary data

e i , yei , we obtain some new data X,
x e ye . The same technique τ applied to X, e y
e yields

estimates β τ X, e y
e that are different from the original ones. We can use any norm k · k on

Rp to measure the distance β τ X, e y
e − βτ of the respective estimates. If we vary over
all possible choices of contamination then this distance either stays bounded or not. Let

e y
b (m, τ, X, y) = sup βτ X, e − βτ
e y
X,e

be the maximum bias that results when we replace at most m of the original data xi , yi
by arbitrary new ones. The breakdown point of τ is
nm o
α (τ, X, y) = min : b (m, τ, X, y) is infinite ,
1≤m<n n
i.e., we are looking for the minimum number of rows of (X, y) that if replaced by arbitrary
1
new data make the regression technique τ break down. We divide this by n to get n ≤
α (τ, X, y) ≤ 1. In practice, α (τ, X, y) ≤ .5, since otherwise it is impossible to distinguish
between the uncontaminated data and the contaminated data. The breakdown point of
1
LAD as well as least squares regression is n or asymptotically 0, see e.g., [12]. Clearly, the
larger the breakdown point, the more robust is the regression estimator.
5
However, LAD regression is more robust than least squares in the following manner.
The finite sample breakdown point of the LAD regression estimator is the breakdown point
of LAD regression with a fixed design matrix X and contamination restricted only to the
dependent variable y, denoted by α (τ, y|X). The finite sample breakdown point, or con-
ditional breakdown point, was introduced by Donoho and Huber ([1]). The finite sample
breakdown point has been studied by many authors; see, e.g., [2], [7], [4], and [8]. Ellis and
Morgenthaler ([2]) appear to be the first to mention that the introduction of weights can
improve the finite sample breakdown point of LAD regression, but they only show this for
very small data sets. Mizera and Müller ([9]) examine this question in more detail, showing
that the predictors X can be chosen to increase the breakdown point of LAD.
In this paper, we use the notation and framework set forth by Giloni and Padberg ([4]).
Let N = {1, . . ., n} and let U, L, Z be a mutually exclusive three-way partition of N such
that |U ∪ L| = q. Let e be a column vector of ones and let XZ be the submatrix of X
whose row indexes are in Z. Similarly, we define XU and XL . The subscripts of U and
L on e denote a vector of ones of appropriate dimension. Giloni and Padberg ([4]) define
the notion of q-stability of a design matrix as follows. X is q-stable if q ≥ 0 is the largest
integer such that
XZ ξ + η + − η − = 0 , (−eTU XU + eTL XL)ξ + eTZ (η + + η − ) ≤ 0 (4)
ξ 6= 0 , η+ ≥ 0 , η− ≥ 0 (5)
is not solvable for any U, L, Z. They prove that a design matrix X is q-stable if and
q+1
only if α (`1 , y|X) = n . This is in direct contrast to least squares regression where
1
its finite sample breakdown point is n or asymptotically 0. They show that the finite
sample breakdown point of LAD regression can be calculated by the following mixed integer
program M IP 1 (where M > 0 is a sufficiently large constant and ε > 0 is a sufficiently
small constant):
6
n
X
M IP 1 min ui + ` i = q + 1
i=1
such that xi ξ + ηi+ − ηi− + si − ti = 0 for i = 1, . . ., n
si − M ui ≤ 0, ti − M ì ≤ 0 for i = 1, . . . , n
ηi+ + ηi− + M ui + M ì ≤ M for i = 1, . . ., n
ui + ì ≤ 1 for i = 1, . . ., n
Xn
ηi+ + ηi− − si − ti ≤ 0,
i=1
n
X
si + t i ≥ ε
i=1
ξ free, η+ ≥ 0, η − ≥ 0, s ≥ 0, t ≥ 0, ui , ì ∈ {0, 1} for i = 1, . . . , n.
In our case, we are interested in weighted LAD regression. The weighted LAD regression
problem also can be formulated as a linear program, as follows.
n
X
min wi ri+ + ri− (6)
i=1
such that Xβ + r+ − r− = y
β free, r+ ≥ 0, r− ≥ 0.
Here, we assume that the residual associated with observation i is multiplied by some weight,
wi , where we assume that 0 < wi ≤ 1 without restriction of generality. We note that if we
were to set wi = 0, we would essentially “remove” observation i from the data. We do not
permit this, although, if some (optimal) weight is sufficiently near 0, the user can choose to
remove the observation from the data set.

We now note that since ri+ − ri− = yi − xi β , and by the simplex method either

ri+ > 0 or ri− > 0 but not both, then |wi ri+ − ri− | = wi ri+ + ri− . Therefore, if we were

to transform our data by setting x b i β = wi ri+ − ri− . In
b i, ybi = wi xi, yi , then ybi − x
7
such a case, the linear program (6) can be reformulated as
min eTn r+ + eTn r−
such that wi xiβ + r+ − r− = wiyi for i = 1, . . ., n
β free, r+ ≥ 0, r− ≥ 0.
This shows that weighted LAD regression can be treated as LAD regression with suitably
transformed data. Therefore, the problem of determining the breakdown of weighted LAD
regression with known weights corresponds to determining the breakdown of LAD regression
b y
with data (X, b ).
The equivalence of weighted LAD regression and ordinary LAD regression on trans-
formed data is also useful in other contexts. For example, LAD regression can be adapted
to the nonparametric estimation of smooth regression curves (see [14] for a general discus-
sion of nonparametric regression estimation), by local fitting of weighted LAD regressions
([15]; [16]). Formulations of the type in this section allowed Giloni and Simonoff ([5]) to
derive the exact breakdown properties of local LAD regression at any evaluation point, and
to make recommendations concerning the best choice of weight function.
In the next section, we formulate the problem of determining the set of weights that
maximizes the breakdown point of weighted LAD regression.
3 Problem Formulation
The task of determining the weights that maximize the finite sample breakdown point of
LAD regression is a complicated one. If one were to try to solve this problem by brute
force, it would require the inspection of all or a large subset of all vectors (if this is possible)

w ∈ Rn , the transformation of the data described above x b i, ybi = wi xi , yi , and then
the solution of M IP 1 for each case. Instead, we formulate this problem as a nonlinear
mixed integer program. For the solution methods we discuss below, we make the standard
assumption that the design matrix X is in general position, i.e., every p × p submatrix
of X has full rank. We do this in order to simplify the analysis that follows. However,
we note that even in the case where this assumption is violated, the methodology that we
8
develop is still valid. This is quite different from many so-called high breakdown regression
techniques (ones where the breakdown point can be as high as .5), where the value of the
breakdown point of the estimators depends upon whether or not the design matrix is in
general position; see [12], p. 118. The mixed integer program is
n
X
N LM IP max min (ui + ì )
w
i=1
such that wi xi ξ + ηi+ − ηi− + si − ti = 0 for i = 1, . . . , n (7)
si − M ui ≤ 0, ti − M ì ≤ 0 for i = 1, . . . , n (8)
ηi+ + ηi− + M ui + M ì ≤ M for i = 1, . . ., n (9)
ui + ì ≤ 1 for i = 1, . . ., n (10)
Xn
ηi+ + ηi− − si − ti ≤ 0 (11)
i=1
Xn
si + t i ≥ ε (12)
i=1
wi ≤ 1 for i = 1, . . ., n (13)
ξ free, η+ ≥ 0, η − ≥ 0, s ≥ 0, t ≥ 0, ui , ì ∈ {0, 1} for i = 1, . . ., n. (14)
Note that ξj is unrestricted for j = 1, . . ., p. Also note that N LM IP is a nonlinear problem

because of the first term in the left hand side of (7). Since nonlinear mixed integer programs
are extremely difficult to solve we will resort to heuristics for finding good feasible solutions.
In order to do so, we note that if the weights w ∈ Rn are fixed, then N LM IP is just M IP 1
with transformed data. Therefore, if we were to focus on only a specific sets of weights,
we could determine which set is the best by solving the different MIPs and choosing the
set that had the largest objective function value among the MIPs considered. However, as
mentioned by Giloni and Padberg ([4]), M IP 1 can take a long time to solve for large data
sets. We thus provide an alternative way of solving M IP 1 that is quite efficient when p is
small. We note that this alternative way is similar to the proposal of Mizera and Müller
([9]). We provide its statement and proof here, as we feel it is useful in understanding our
framework for selecting weights.
Proposition 1 In order to determine the finite sample breakdown point of LAD regression,
9
n

it is sufficient to consider p−1 candidate solutions for the vector ξ ∈ Rp in (4) and (5) .
Proof. Consider a three-way partition U1 , L1, Z1 of N where U1 ∩ L1 = ∅, Z1 = N − U1 − L1 ,

and q = |U1 ∪ L1 |. Note that (4) has no solution if and only if the objective function value
of the following optimization problem is strictly greater than zero.
X X X
OF = min − xi ξ + xi ξ + ηi+ + ηi−
i∈U1 i∈L1 i∈Z1
+ −
such that XZ1 ξ + η − η = 0 (15)
ξ 6= 0, η + ≥ 0, η− ≥ 0
We note that since this problem includes a constraint of the form ξ 6= 0, its study as a
linear program is more difficult. Therefore, we argue as follows. If we were to assume
that there exist ξ0 ∈ Rp, ξ0 6= 0, such that OF < 0, then consider ξ = ψξ0 where ψ > 0
is some constant. If i ∈ U1 ∪ L1, then the sign of xi ξ is the same as that of xiξ 0. If
i ∈ Z1 , ηi+ + ηi− ≥ |xiξ| because of (15), and since we are minimizing, equality will hold.
Therefore, ξ = ψξ0 results in OF being multiplied by ψ. It follows that OF < 0 (actually
OF → −∞ since we could let ψ → ∞). It can be shown similarly if OF = 0 or OF > 0 that
ξ = ψξ0 does not change the sign of OF . Therefore we set ξj = γ where γ > 0 and without
restriction of generality we let j = 1. This changes the optimization to the following linear
program:
X X X
OF 1 = min − xi ξ + xi ξ + ηi+ + ηi−
i∈U1 i∈L1 i∈Z1
+ −
such that XZ ξ + η − η = 0
ξ1 = γ
ξ f ree, η + ≥ 0, η− ≥ 0
−1
1 0 γ
All basic feasible solutions to this linear program are of the form ξ =
XB 0
where XB is a(p − 1) ×(p) submatrix of XZ with rank p − 1 and we assume that the
1 0
square matrix of order p is of full rank. (Note that if a row of X is of the form
XB
(1 0) then this would only reduce the number of possible basic feasible solutions.) By the
10
fundamental theorem of linear programming (c.f. [10], Theorem 1), a solution with OF 1 ≤ 0
exists if and only if a basic feasible solution exists with OF 1 ≤ 0. Thus if OF 1 ≤ 0 then
OF ≤ 0. We note that if an optimal solution to OF had ξ1 = γ < 0 then it is possible that
OF ≤ 0 when OF 1 > 0 since γ > 0 in the linear program. However, this is no concern to us
since we need to consider all possible three-way partitions and we point out that switching
the roles of U1 and L1 effectively changes the sign of ξ1. Since there are at most n possible
n
rows of X that could be included in XZ1 there are no more than p−1 possible subsets of
XZ for any Z. Therefore, the proposition follows.
4 An equivalent problem
In this section, we show that the problem N LM IP can be reduced to a problem which is
related to the knapsack problem. Consider the following problem:
n
X
EQM IP max min zi (16)
w z,ξ
i=1
n
X n
X
i
such that |wix ξ|zi ≥ 0.5 |wixi ξ| (17)
i=1 i=1
wi ≤ 1 for i = 1, . . . , n (18)
ξ 6= 0, zi ∈ {0, 1} for i = 1, . . . , n. (19)
Imagine a hiker who has available n objects that he can choose from and the weight of the
ith object is |αi |, where αi = wi xi ξ. The hiker has to select a subset of these objects such
that the weight of the subset is at least half of the weight of the entire set. Obviously,
the problem is trivial if the weights |αi| were known. One would order the objects by non-
decreasing weights and then select them one by one until the total weight of the selected
items is greater than or equal to half of the total weight of all of the n objects. The problem is
made difficult by the fact that the weights |αi| are determined by several unknown variables
(ξ and w), and further complications arise due to the max-min nature of the objective
function. Nevertheless, this demonstration of equivalence and the insight that we gain from
understanding the problem in this manner plays an important role in the development of
the algorithm to follow. In the following proposition, we show that the two problems are
11
equivalent in the sense that there is a simple method of constructing the optimal solution
of one when the optimal solution of the other is known.
Proposition 2 The problems N LM IP and EQM IP are equivalent.
Proof. Let us assume that the optimal solution of EQM IP is known. Let Z denote the
subset of indices from the set {1, . . ., n} for which zi = 1. We now construct a solution for
N LM IP from the solution of EQM IP as follows:
If i 6∈ Z and αi < 0, then set ηi+ = −αi , (20)
if i 6∈ Z and αi ≥ 0, then set ηi− = αi , (21)
if i ∈ Z and αi < 0, then set si = −αi and ui = 1 and (22)
if i ∈ Z and αi ≥ 0, then set ti = αi and ì = 1. (23)
Set all other variables in N LM IP to zero. Let us denote the set of indices i satisfying
(20)-(23) by G1 , G2, G3 and G4 respectively. Note that G3 ∪ G4 = Z and G1 ∪ G2 = Z c . Let
P P
α = ni=1 |αi|. Let M = maxi∈Z |αi | and let ε = i∈Z |αi |. We first show that ε > 0. If we
assume the contrary, i.e., ε = 0, then it implies that the left hand side of (17) is zero and
consequently, so is the right hand side of (17) and the optimal value of the objective function
is 0. This means that for each i, either wi = 0 or xiξ = 0. Note that the assumptions that
X is in general position and ξ 6= 0, implies that at least one of the xiξ 6= 0. Then for that
i, we can choose wi = 1 and zi = 1 and show that the optimum value of the objective can
be increased to 1, which in turn implies that the solution could not be optimal. We now
show that the solution constructed for N LM IP is feasible. It is fairly obvious from the
construction of the solution that the constraints (7)-(14) are satisfied, except the inequality
12
in (11). To show that this holds, we note that
n
X X X X X
(ηi+ + ηi− − si − ti ) = ηi+ + ηi− − si − ti (24)
i=1 i∈G1 i∈G2 i∈G3 i∈G4
X X
= |αi | − |αi| (from (20 − 23)) (25)
i6∈Z i∈Z
n
X X
= |αi | − 2 |αi | (26)
i=1 i∈Z
≤ 0 (from (17)). (27)
So far we have shown that the optimal solution for EQM IP is feasible for N LM IP . To
show that this solution is also optimal for N LM IP , we assume the contrary, i.e., assume
that there is a solution to N LM IP which has a higher value of the objective function than
the optimal solution to EQM IP . From this supposed solution of N LM IP , we construct a
new solution for EQM IP by setting zi = ui + ì . It is easy to verify that the new solution is
feasible for EQM IP , by observing the following: (a) An optimal solution to N LM IP has
an equivalent solution, which is also optimal, in which at most one of the four variables ηi+,
ηi− , si and ti for any i can be positive. (b) When ηi+ or ηi− is positive, the corresponding
ui = ì = 0. (c) When si (or ti ) is positive, then ui = 1 (or ì = 1). (d) Using (a), (b)
and (c), one reverses the arguments in (24)-(27) to show that the solution of EQM IP so
obtained satisfies the constraint (17). With this construction, it is now easy to see that
this solution has an optimal value of the objective function that is higher than that of the
assumed optimal solution of EQM IP . This contradiction shows that the optimal solution
of EQM IP must also be an optimal solution of the N LM IP .
To finish the proof, one must now show that a solution of N LM IP has an equivalent
solution which is also an optimal solution of EQM IP . The construction of this is done by
setting zi = ui + ì , and the proof of optimality can be obtained by reversing the arguments
given above.
13
5 The case of uniform design simple linear regression
In this section, we discuss the special case of simple linear regression (p = 2) with a uniform
design, i.e., xi1 = 1 and xi2 = i, for i = 1, . . . , n (there is no loss of generality in taking the
values of xi2 as the integers; any equispaced set of values yields the same weights). We refer
to this problem as the uniform simple linear problem. We show how to obtain an exact
solution for this problem. The problem stated in (16)-(19) without any restriction on wi
now becomes:
n
X
U SLM IP max min zi (28)
w z,ξ
i=1
n
X n
X
such that |wi(ξ1 + iξ2)|zi ≥ 0.5 |wi (ξ1 + iξ2 )| (29)
i=1 i=1
ξ 6= 0, zi ∈ {0, 1} for i = 1, . . . , n. (30)
A crucial result for determining the optimal values of wi and ξ is given by the following
lemma.
Lemma 1 For the problem U SLM IP , there is an optimum solution of wi , which is sym-
metrical, i.e., wi = wn−i for i = 1, . . ., n.
Proof. Define φi = ξ1 + iξ2 and let the optimal values of ξ1 = δ1 and ξ2 = δ2 ; this implies
that φi = δ1 + iδ2 . Construct a new solution by selecting ξ1 = δ1 + nδ2 and ξ2 = −δ2 .
We refer to the function φ for the new solution as φ̄ to distinguish it from the original
solution. Then φ̄i = δ1 + (n − i)δ2 , or φi = φ̄n−i . Note that as far as the regression problem
is concerned, it is as if the two solutions differ only in the arrangements of the entries of
the xξ vector. This means that there is an optimal solution for which wi = wn−i since a
symmetric choice of wi will cater to both choices of φ and φ̄.
Based on the above result, we limit ourselves to symmetric linear functions for wi . More
explicitly, the functions we choose to examine are linear functions of the distance from the
center of the range of x, being of the form:
0
w + iw1 if i ≤ n/2
wi =
w0 + (n − i)w1 if i > n/2.
14
We note that the problem defined in (28)-(30) is invariant in scaling for w and ξ. Therefore,
without loss of generality, we can impose two conditions such as w0 = ξ1 = 1. Thus the
simple linear problem reduces to the problem of finding just two parameters w1 and ξ2.
Clearly, the two unknown quantities also depend on the value of n, the size of the problem.
To remove this dependency, we convert the problem in (28)-(30) to an equivalent continuous
problem in which the variable i is replaced with a variable x, where 0 ≤ x ≤ 1. After solving
this problem by conducting a search over the two unknown parameters, we re-convert the
solution to the discrete case. The solutions obtained are w1 = 4/n and ξ2 = −1.35/n. The
optimal value of the objective function in the continuous version of the problem is 0.3005n.
This shows that the finite sample breakdown point of weighted LAD regression can reach
over 30%. Note that if we simplified the problem by not considering the weights w (i.e.,
wi = 1 for i = 1, . . . , n), then the optimum breakdown for the continuous version of this
problem is 0.25n (c.f. [2]). This implies that for the uniform simple linear problem, the
breakdown can be increased by about 20% by a judicious choice of weights.
The solution to the uniform simple linear problem after normalizing the weights such
that wi ≤ 1 for i = 1, . . . , n is
 4
 1+i( n )

 if i ≤ n/2
 1+b n c 4
2 (n)
wi =

 4
1+(n−i)( n )

 if i > n/2.
1+b n c 4
2 (n)
1
Thus the selected weights range from approximately 3 to 1. Our algorithm for the deter-
mination of general weights given in the next section is based upon this solution for the
uniform simple linear problem.
6 Weights for General Regression Designs
In this section, we describe a general weight-determination algorithm, and describe in more

detail how N LM IP can be solved once the weights are chosen. The key idea in generalizing
the weights is to note that the symmetric linear weights have the property that the weights
decrease linearly with the distance from the “center” of the range of the predictor. The
15
proposed general weights have a corresponding property, decreasing linearly with the sum
of the distances for the predictors from the coordinatewise median of the observation, where
the distances are scaled by the range of each predictor. The median is chosen as the center
of the data since the LAD estimator for univariate location is the median. This yields the
following algorithm for choosing the weights.
Algorithm for choosing weights

Step 1: Let zji = (xij − mini {xij })/ maxi {xij }, j = 1, . . . , p (if x1 is a column of ones, set
z1i = 0 for all i). For j = 1, . . ., p, let mj denote the median of the entries zji for i = 1, . . . , n.
P
Step 2: Let ri = pj=1 |zji − mj |.
Step 3: Let ri = (ri − minν rν )/(maxν rν − minν rν ).
Step 4: Let wi = 1 − 23 ri .
Note that steps 3 and 4 of the algorithm guarantee that these weights are (virtually) identical
to those derived earlier for the uniform simple linear case when the data take that form.
Once the weights are chosen, the results of section 3 provide a way of approximately
solving EQM IP , and hence N LM IP . The details are as follows.
Algorithm for solving EQM IP

Step 1: Let N denote the set {1, · · · , n}. Let Kk denote all of the subsets of N , such that
|Kk | = p − 1 for k = 1, · · · , r. Clearly, r = n!/((p − 1)!(n − p + 1)!). For k = 1, · · · , r, solve
the following system of linear equations for the unknown p×1 vector ξk :
wi xiξ k = 0 for i ∈ Kk
and ξ1k = 1. Note that this system is guaranteed to have a unique solution, based on the
assumptions on xi . Let
αki = wi xi ξk .
Step 2: Reorder the elements |αki | for i = 1, . . . , n in decreasing order so that |αki | now
16
denotes the ith order statistic. Identify the smallest index m∗ (k) satisfying
m∗ (k) n
X X
|αki | ≥ 0.5 |αki |.
i=1 i=1
Step 3: Find m∗ = mink m∗ (k). Let k∗ be the value of k for which this is minimum. The
solution is given by:
∗ ∗
If i > m∗ and αki < 0, then set ηi+ = −αki ,
∗ ∗
if i > m∗ and αki ≥ 0, then set ηi− = αki
∗ ∗
if i ≤ m∗ and αki < 0, then set si = −αki and ui = 1 and
∗ ∗
if i ≤ m∗ and αki ≥ 0, then set ti = αki and ì = 1.
Set all other variables to zero.
7 Examples
In this section we illustrate the benefits of weighting using the proposed algorithm. We
first determine the breakdown points for 21 one- and two-predictor designs that cover a
wide range of possible patterns, all of which are based on n = 500 observations. In the
one predictor case, we generate n observations from either a uniform, exponential (with
mean 1) or normal distribution, as these represent varying degrees of nonuniformity in the
design. We also consider the situation where a certain percentage (either 10% or 20%) of the
observations are replaced by observations placed at roughly four standard deviations away
from the mean of the predictor. The existence of unusual values for the predictors (called
leverage points in the statistics literature) is of particular interest, since it is well-known
that LAD regression is very sensitive to leverage points.
We also examine two-predictor (multiple) regressions by generating two predictors using
the six possible combinations of uniform, exponential, and normal distributions. We also
examine the effects of leverage points, by replacing 20% of the observations with values
placed at roughly four standard deviations away from the mean of the predictor for both
predictors.
17
The results are given in Table 1. For each design the breakdown point (expressed as a
percentage of the total sample size) is given for LAD, weighted LAD based on applying the
algorithm for choosing weights once, and weighted LAD based on applying the algorithm for
choosing weights iteratively. Iterating is done by treating the weighted design WX (where
W is a diagonal matrix of weights) as the current design, and reapplying the algorithm.
At each iteration, the breakdown point is determined, and iteration continues until the
breakdown point stops increasing (the total number of iterations in each case is given in
parentheses in the table). These breakdown points were computed exactly here, but in the
situation where a user is faced with designs too large to determine the breakdown exactly,
we suggest utilizing the heuristic for finding a good upper bound of the breakdown suggested
by Giloni and Padberg ([4]) and then iterating until the upper bound no longer increases
(of course, another possibility would be to just iterate once).
The table makes clear the benefits of weighting. Just one iteration of the weighting
algorithm typically increases the breakdown point 3-5 percentage points, and as much as
6-7 percentage points when there are leverage points. Even more impressively, iterating the
weighting algorithm leads to gains in breakdown of at least 5 percentage points in most
cases, and as much as 10-15 percentage points in many. The weighted LAD estimator is
much more robust than the LAD estimator, and therefore more trustworthy as a routine
regression tool.
We conclude with discussion of a well-known data set from the robust regression liter-
ature, the so-called Stars data set ([12], p. 27). Figure 1 contains three graphs. At the
top left is a scatter plot of the original data, which is a plot of the logarithm of the light
intensity versus the logarithm of the temperature at the surface of 47 stars in the star clus-
ter CYG OB1. The plot (called a Hertzsprung-Russell star chart) also includes the least
squares regression line, the LAD regression line, and the weighted LAD regression line using
the weights defined by our weight selection algorithm, including iteration. There are four
obvious outliers in this data set all with logged temperature approximately equal to 3.5.
These outlying data points are what are referred to as “red giants,” as opposed to the rest
of the stars which are considered to lie in the “main sequence.” It is apparent that the least
18
Design Leverage LAD breakdown WLAD breakdown WLAD breakdown
point point (1 iteration) point (# iterations)
Exponential 14.0% 17.8% 26.6% (4)
Exponential 10% 11.2% 14.6% 27.0% (4)
Exponential 20% 14.6% 17.2% 27.0% (4)
Normal 23.0% 27.0% 30.2% (2)
Normal 10% 14.8% 22.4% 29.6% (2)
Normal 20% 15.0% 20.0% 29.8% (2)
Uniform 24.6% 29.8% 29.8% (1)
Uniform 10% 7.2% 10.4% 27.4% (3)
Uniform 20% 11.8% 14.8% 26.8% (3)
Exponential/Exponential 14.0% 20.6% 23.6% (4)
Exponential/Exponential 20% 13.4% 20.0% 21.6% (4)
Exponential/Normal 14.0% 16.2% 19.6% (4)
Exponential/Normal 20% 13.4% 16.6% 21.2% (2)
Exponential/Uniform 14.0% 17.4% 21.0% (3)
Exponential/Uniform 20% 11.6% 13.8% 21.0% (4)
Normal/Normal 22.8% 25.4% 25.8% (2)
Normal/Normal 20% 13.2% 17.4% 21.4% (2)
Normal/Uniform 23.0% 25.2% 25.2% (1)
Normal/Uniform 20% 11.6% 13.6% 21.6% (3)
Uniform/Uniform 23.4% 25.4% 25.4% (1)
Uniform/Uniform 20% 11.0% 13.2% 20.6% (3)
Table 1: Breakdown points for various designs.
squares line is drawn towards the red giants, which is not surprising, given its breakdown
point of 1/47. By contrast, the weighted LAD line is unaffected by the outliers, and goes
through the main sequence. The WLAD line has breakdown 10/47 if only one iteration of
the weighting algorithm is used, and increases to 13/47 with two iterations.
It is interesting to note, however, that the LAD line is also drawn towards the outliers,
despite the fact that the LAD line has not broken down (its breakdown point is 5/47, so the
four outliers are not enough to break it down). This illustrates that weighting can be helpful
if outliers occur at leverage points, even if the LAD estimator has not broken down. The
top right plot reinforces that LAD has not broken down. In this plot the light intensities
of the four giants have been further contaminated (by adding 10 to the values). The least
19
squares line follows the outliers, but the LAD line is virtually identical to what it was in
the original plot, since it has not broken down (although it still does not follow the main
sequence, as the weighted LAD line still does). In the third plot of the figure, two more
stars have their light intensities contaminated. Since there are now 6 outliers, the LAD line
breaks down, and is as poor as the least squares line, while the weighted LAD line is still
resistant to the outliers.
8 Conclusions and Directions for Future Research
In this paper, we have demonstrated that weighted LAD regression is a regression technique
whose robustness properties can be studied by mathematical programming methods. We
have developed a computationally feasible method for calculating the optimum weights
and have demonstrated that the optimum breakdown can be significantly increased by a
judicious choice of weights. These results leave open the statistical properties of weighted
LAD regression, which would be fruitful topics for further research. These include study
of the asymptotic properties of weighted LAD regression, its small-sample properties, and
investigation of whether nonlinear weights might lead to better performance.
References
[1] Donoho, D.L. and Huber, P.J. 1983. “The Notion of Breakdown Point,” in: P. Bickel, K.
Doksum and J.L. Hodges, eds., A Festschrift for Erich Lehmann, Wadsworth, Belmont,
CA, 157-184.
[2] Ellis, S.P. and Morgenthaler, S. 1992. “Leverage and Breakdown in L1 -Regression,”
Journal of American Statistical Association, 87, 143-148.
[3] Giloni, A. and Padberg, M. 2002. “Alternative Methods of Linear Regression,” Math-
ematical and Computer Modelling, 35, 361-374.
[4] Giloni, A. and Padberg, M. 2004. “The Finite Sample Breakdown Point of `1 -
Regression,” SIAM Journal on Optimization, 14, pp. 1028 - 1042.
20
[5] Giloni, A. and Simonoff, J.S. 2005. “The conditional breakdown properties of robust
local polynomial estimators,” Journal of Nonparametric Statistics, 17, to appear.
[6] Hampel, F.R. 1968. Contributions to the Theory of Robust Estimation. Ph.D. Thesis,
University of California, Berkeley.
[7] He X., Jureckova J., Koenker R., and Portnoy S. 1990. “Tail Behavior of Regression
Estimators and their Breakdown Points,” Econometrica, 58, 1195-1214.
[8] Mizera, I. and Müller, C.H. 1999, “Breakdown points and variation exponents of robust
M-estimators in linear models,” Annals of Statistics, 27, 1164-1177.
[9] Mizera, I. and Müller, C.H. 2001, “The influence of the design on the breakdown
points of `1 -type M-estimators,” in MODA6 – Advances in Model-Oriented Design
and Analysis, A. Atkinson, P. Hackl, and W. Müller, eds., Physica-Verlag, Heidelberg,
193-200.
[10] Padberg, M. 1995. Linear Optimization and Extensions, Springer-Verlag, Berlin.
[11] Portnoy, S. and Koenker, R. 1997, “The Gaussian hare and the Laplacian tortoise:
computability of squared-error versus absolute-error estimators,” Statistical Science,
12, 279-300.
[12] Rousseeuw, P.J. and Leroy, A.M. 1987. Robust Regression and Outlier Detection, Wiley,
New York.
[13] Sharpe, W.F. 1971. “Mean-absolute-deviation characteristic lines for securities and
portfolios,” Management Science, 18, B1-B13.
[14] Simonoff, J.S. 1996. Smoothing Methods in Statistics, Springer-Verlag, New York.
[15] Wang, F.T. and Scott, D.W. 1994. “The L1 method for robust nonparametric regres-
sion,” Journal of the American Statistical Association, 89, 65-76.
[16] Yu, K. and Jones, M.C. (1998), “Local linear quantile regression,” Journal of the
American Statistical Association, 93, 228–237.
21
Figure 1: Stars data and modifications, with three regression lines.
16
6.0
14
Logged light intensity

5.5
12
10
5.0
8
4.5
6
4.0
4
3.6 4.0 4.4 3.6 4.0 4.4
Logged temperature Logged temperature

16
14
Least Squares
12
LAD
Weighted LAD
10
8
6
4
3.6 4.0 4.4
Logged temperature
22

A Mathematical Programming Approach For Improving The Robustness of LAD Regression

Uploaded by

Copyright:

Available Formats

A Mathematical Programming Approach For Improving The Robustness of LAD Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Mathematical Programming Approach For Improving The Robustness of LAD Regression

Uploaded by

Copyright:

Available Formats

A Mathematical Programming Approach for Improving the

Robustness of LAD Regression

July 23, 2004

Consider the well-known statistical linear regression problem. We have n observations on

where y ∈ Rn is a vector of n observations, x1 , . . ., xp are column vectors with n components,

2 LAD Regression and Breakdown

min eTn r+ + eTn r− (3)

XZ ξ + η + − η − = 0 , (−eTU XU + eTL XL)ξ + eTZ (η + + η − ) ≤ 0 (4)

ηi+ + ηi− + M ui + M `i ≤ M for i = 1, . . ., n

min eTn r+ + eTn r−

such that wi xiβ + r+ − r− = wiyi for i = 1, . . ., n

ηi+ + ηi− + M ui + M `i ≤ M for i = 1, . . ., n (9)

ξ free, η+ ≥ 0, η − ≥ 0, s ≥ 0, t ≥ 0, ui , `i ∈ {0, 1} for i = 1, . . ., n. (14)

Note that ξj is unrestricted for j = 1, . . ., p. Also note that N LM IP is a nonlinear problem

Proof. Consider a three-way partition U1 , L1, Z1 of N where U1 ∩ L1 = ∅, Z1 = N − U1 − L1 ,

ξ 6= 0, zi ∈ {0, 1} for i = 1, . . . , n. (19)

Proposition 2 The problems N LM IP and EQM IP are equivalent.

If i 6∈ Z and αi < 0, then set ηi+ = −αi , (20)

if i 6∈ Z and αi ≥ 0, then set ηi− = αi , (21)

if i ∈ Z and αi < 0, then set si = −αi and ui = 1 and (22)

if i ∈ Z and αi ≥ 0, then set ti = αi and `i = 1. (23)

6 Weights for General Regression Designs

In this section, we describe a general weight-determination algorithm, and describe in more

Algorithm for choosing weights

Algorithm for solving EQM IP

Set all other variables to zero.

Table 1: Breakdown points for various designs.

8 Conclusions and Directions for Future Research

[10] Padberg, M. 1995. Linear Optimization and Extensions, Springer-Verlag, Berlin.

Logged light intensity

Logged temperature Logged temperature

3.6 4.0 4.4

You might also like