Gretl Guide (401 450)
Gretl Guide (401 450)
Gretl Guide (401 450)
For the purpose of identification one of the outcomes must be taken as the “baseline”; it is usually
assumed that β0 = 0, in which case
exp(xi βk )
P (yi = k|xi ) = Pp
1+ j=1 exp(xi βj )
and
1
P (yi = 0|xi ) = Pp .
1+ j=1 exp(xi βj )
Listing 38.4 reproduces Table 15.2 in Wooldridge (2002a), based on data on career choice from
Keane and Wolpin (1997). The dependent variable is the occupational status of an individual (0 = in
Chapter 38. Discrete and censored dependent variables 390
school; 1 = not in school and not working; 2 = working), and the explanatory variables are education
and work experience (linear and square) plus a “black” binary variable. The full data set is a panel;
here the analysis is confined to a cross-section for 1987.
k1
X
∗ ∗
y1,i = xij βj + ε1,i y1,i = 1 ⇐⇒ y1,i >0 (38.9)
j=1
k2
X
∗ ∗
y2,i = zij γj + ε2,i y2,i = 1 ⇐⇒ y2,i >0 (38.10)
j=1
" # " !#
ε1,i 1 ρ
∼ N 0, (38.11)
ε2,i ρ 1
biprobit y1 y2 X ; Z
Output from estimation includes a Likelihood Ratio test for the hypothesis ρ = 0.1 This can be
retrieved in the form of a bundle named independence_test under the $model accessor, as in
? eval $model.independence_test
bundle:
dfn = 1
test = 204.066
pvalue = 2.70739e-46
Since biprobit estimates a two-equation system, the $uhat and $yhat accessors provide ma-
trices rather than series as usual. Specifically, $uhat gives a two-column matrix containing the
generalized residuals, while $yhat contains four columns holding the estimated probabilities of
the possible joint outcomes: (y1,i , y1,i ) = (1, 1) in column 1, (y1,i , y2,i ) = (1, 0) in column 2,
(y1,i , y2,i ) = (0, 1) in column 3 and (y1,i , y2,i ) = (0, 0) in column 4.
Gretl provides FE logit via the function package felogit,2 RE probit natively. Provided your dataset
has a panel structure, the latter option can be obtained by adding the --random option to the
probit command:
as exemplified in the reprobit.inp sample script. The numerical technique used for this particular
estimator is Gauss-Hermite quadrature, which we’ll now briefly describe. Generalizing equation
(38.5) to a panel context, we get
k
X
∗
yi,t = xijt βj + αi + εi,t = zi,t + ωi,t (38.12)
j=1
in which we assume that the individual effect, αi , and the disturbance term, εi,t , are mutually
independent zero-mean Gaussian random variables. The composite error term, ωi,t = αi + εi,t , is
therefore a normal r. v. with mean zero and variance 1 + σα2 . Because of the individual effect, αi ,
observations for the same unit are not independent; the likelihood therefore has to be evaluated
on a per-unit basis, as
ℓi = log P yi,1 , yi,2 , . . . , yi,T .
and there’s no way to write the above as a product of individual terms.
However, the above probability could be written as a product if we were to treat αi as a constant;
in that case we would have
T
X x ijt β j + αi
ℓi |αi = Φ (2yi,t − 1) q
t=1 1 + σα2
The technique known as Gauss–Hermite quadrature is simply a way of approximating the above
integral via a sum of carefully chosen terms:3
m
X
ℓi ≃ (ℓi |αi = nk )wk
k=1
where the numbers nk and wk are known as quadrature points and weights, respectively. Of course,
accuracy improves with higher values of m, but so does CPU usage. Note that this technique can
also be used in more general cases by using the quadtable() function and the mle command via
the apparatus described in chapter 26. Here, however, the calculations were hard-coded in C for
maximal speed and efficiency.
Experience shows that a reasonable compromise can be achieved in most cases by choosing m in
the order of 20 or so; gretl uses 32 as a default value, but this can be changed via the --quadpoints
option, as in
2 Seehttp://gretl.sourceforge.net/current_fnfiles/felogit.gfn.
3 Some have suggested using a more refined method called adaptive Gauss-Hermite quadrature; this is not imple-
mented in gretl.
Chapter 38. Discrete and censored dependent variables 392
2
where εi ∼ N(0, σ ). If yi∗were observable, the model’s parameters could be estimated via ordinary
least squares. On the contrary, suppose that we observe yi , defined as
a
for yi∗ ≤ a
yi = yi∗ for a < yi∗ < b (38.13)
b for y ∗ ≥ b
i
In most cases found in the applied literature, a = 0 and b = ∞, so in practice negative values of yi∗
are not observed and are replaced by zeros.
In this case, regressing yi on the xi s does not yield consistent estimates of the parameters β,
Pk
because the conditional mean E(yi |xi ) is not equal to j=1 xij βj . It can be shown that restricting
the sample to non-zero observations would not yield consistent estimates either. The solution is to
estimate the parameters via maximum likelihood. The syntax is simply
As usual, progress of the maximization algorithm can be tracked via the --verbose switch, while
$uhat returns the generalized residuals. Note that in this case the generalized residual is defined
as ûi = E(εi |yi = 0) for censored observations, so the familiar equality ûi = yi − ŷi only holds for
uncensored observations, that is, when yi > 0.
An important difference between the Tobit estimator and OLS is that the consequences of non-
normality of the disturbance term are much more severe: non-normality implies inconsistency for
the Tobit estimator. For this reason, the output for the Tobit model includes the Chesher and Irish
(1987) normality test by default.
The general case in which a is nonzero and/or b is finite can be handled by using the options
--llimit and --rlimit. So, for example,
yi∗ = xi β + ϵi
but we only know that mi ≤ yi∗ ≤ Mi , where the interval may be left- or right-unbounded (but
not both). If mi = Mi , we effectively observe yi∗ and no information loss occurs. In practice, each
observation belongs to one of four categories:
2. right-unbounded, when Mi = ∞,
Chapter 38. Discrete and censored dependent variables 393
It is interesting to note that this model bears similarities to other models in several special cases:
• When all observations are point observations the model trivially reduces to the ordinary linear
regression model.
• The interval model could be thought of as an ordered probit model (see 38.2) in which the cut
points (the αj coefficients in eq. 38.8) are observed and don’t need to be estimated.
• The Tobit model (see 38.6) is a special case of the interval model in which mi and Mi do not
depend on i, that is, the censoring limits are the same for all observations. As a matter of
fact, gretl’s tobit command is handled internally as a special case of the interval model.
The gretl command intreg estimates interval models by maximum likelihood, assuming normality
of the disturbance term ϵi . Its syntax is
where minvar contains the mi series, with NAs for left-unbounded observations, and maxvar con-
tains Mi , with NAs for right-unbounded observations. By default, standard errors are computed
using the negative inverse of the Hessian. If the --robust flag is given, then QML or Huber–White
standard errors are calculated instead. In this case the estimated covariance matrix is a “sandwich”
of the inverse of the estimated Hessian and the outer product of the gradient.
If the model specification contains regressors other than just a constant, the output includes a
chi-square statistic for testing the joint null hypothesis that none of these regressors has any
effect on the outcome. This is a Wald statistic based on the estimated covariance matrix. If you
wish to construct a likelihood ratio test, this is easily done by estimating both the full model
and the null model (containing only the constant), saving the log-likelihood in both cases via the
$lnl accessor, and then referring twice the difference between the two log-likelihoods to the chi-
square distribution with k degrees of freedom, where k is the number of additional regressors (see
the pvalue command in the Gretl Command Reference). Also included is a conditional moment
normality test, similar to those provided for the probit, ordered probit and Tobit models (see
above). An example is contained in the sample script wtp.inp, provided with the gretl distribution.
As with the probit and Tobit models, after a model has been estimated the $uhat accessor re-
turns the generalized residual, which is an estimate of ϵi : more precisely, it equals yi − xi β̂ for
point observations and E(ϵi |mi , Mi , xi ) otherwise. Note that it is possible to compute an unbiased
predictor of yi∗ by summing this estimate to xi β̂. Listing 38.5 shows an example. As a further
similarity with Tobit, the interval regression model may deliver inconsistent estimates if the dis-
turbances are non-normal; hence, the Chesher and Irish (1987) test for normality is included by
default here too.
k
X
yi∗ = xij βj + εi (38.14)
j=1
p
X
si∗ = zij γj + ηi (38.15)
j=1
Chapter 38. Discrete and censored dependent variables 394
# estimate ystar
gen_resid = $uhat
yhat = $yhat + gen_resid
corr ystar yhat
sigma = 0.223273
Left-unbounded observations: 0
Right-unbounded observations: 0
Bounded observations: 100
Point observations: 0
...
In this context, the ♦ symbol indicates that for some observations we simply do not have data on
y: yi may be 0, or missing, or anything else. A dummy variable di is normally used to set censored
observations apart.
One of the most popular applications of this model in econometrics is a wage equation coupled
with a labor force participation equation: we only observe the wage for the employed. If yi∗ and si∗
were (conditionally) independent, there would be no reason not to use OLS for estimating equation
Chapter 38. Discrete and censored dependent variables 395
(38.14); otherwise, OLS does not yield consistent estimates of the parameters βj .
Since conditional independence between yi∗ and si∗ is equivalent to conditional independence be-
tween εi and ηi , one may model the co-dependence between εi and ηi as
εi = ληi + vi ;
substituting the above expression in (38.14), you obtain the model that is actually estimated:
k
X
yi = xij βj + λη̂i + vi ,
j=1
so the hypothesis that censoring does not matter is equivalent to the hypothesis H0 : λ = 0, which
can be easily tested.
The parameters can be estimated via maximum likelihood under the assumption of joint normality
of εi and ηi ; however, a widely used alternative method yields the so-called Heckit estimator, named
after Heckman (1979). The procedure can be briefly outlined as follows: first, a probit model is fit
on equation (38.15); next, the generalized residuals are inserted in equation (38.14) to correct for
the effect of sample selection.
Gretl provides the heckit command to carry out estimation; its syntax is
heckit y X ; d Z
where y is the dependent variable, X is a list of regressors, d is a dummy variable holding 1 for
uncensored observations and Z is a list of explanatory variables for the censoring equation.
Since in most cases maximum likelihood is the method of choice, by default gretl computes ML
estimates. The 2-step Heckit estimates can be obtained by using the --two-step option. After
estimation, the $uhat accessor contains the generalized residuals. As in the ordinary Tobit model,
the residuals equal the difference between actual and fitted yi only for uncensored observations
(those for which di = 1).
Listing 38.6 shows two estimates from the dataset used in Mroz (1987): the first one replicates
Table 22.7 in Greene (2003),4 while the second one replicates table 17.1 in Wooldridge (2002a).
e−λ λy
P (Y = y) = , y = 0, 1, 2 . . .
y!
where the single parameter λ is both the mean and the variance of Y . In an econometric context
we generally want to treat λ as specific to the observation, i, and driven by covariates Xi via a
parameter vector β. The standard way of allowing for this is the exponential mean function,
λi ≡ exp(Xi β)
hence leading to
exp(− exp(Xi β))(exp(Xi β))y
P (Yi = y) =
y!
4 Note that the estimates given by gretl do not coincide with those found in the printed volume. They do, however,
match those found on the errata web page for Greene’s book: http://pages.stern.nyu.edu/~wgreene/Text/Errata/
ERRATA5.htm.
Chapter 38. Discrete and censored dependent variables 396
open mroz87.gdt
# Greene’s specification
# Wooldridge’s specification
Maximization of this quantity is quite straightforward, and is carried out in gretl using the syntax
In some cases, an “offset” variable is needed: the count of occurrences of the outcome of interest
in a given time is assumed to be strictly proportional to the offset variable ti . In the epidemiology
literature, the offset is known as “population at risk”. In this case λ is modeled as
λi = ti exp(Xi β)
The log-likelihood is not greatly complicated thereby. Here’s another way of thinking about the
offset variable: its natural log is just another explanatory variable whose coefficient is constrained
to equal 1.
If an offset variable is needed, it should be specified at the end of the command, separated from
the list of explanatory variables by a semicolon, as in
Overdispersion
As mentioned above, in the Poisson model E(Yi |Xi ) = V (Yi |Xi ) = λi , that is, the conditional mean
equals the conditional variance by construction. In many cases this feature is at odds with the data;
the conditional variance is often larger than the mean, a phenomenon known as overdispersion.
Chapter 38. Discrete and censored dependent variables 397
The output from the poisson command includes a conditional moment test for overdispersion (as
per Davidson and MacKinnon (2004), section 11.5), which is printed automatically after estimation.
Overdispersion can be attributed to unmodeled heterogeneity between individuals. Two data points
with the same observable characteristics Xi = Xj may differ because of some unobserved scale
factor si ̸= sj so that
E(Yi |Xi , si ) = λi si ̸= λj sj = E(Yi |Xj , sj )
even though λi = λj . In other words, Yi is a Poisson random variable conditional on both Xi and
si , but since si is unobservable, the only thing we can we can use, P (Yi |Xi ), will not conform to the
Poisson distribution.
It is often assumed that si can be represented as a gamma random variable with mean 1 and
variance α. The parameter α, which measures the degree of heterogeneity between individuals, is
then estimated jointly with the vector β.
In this case, the conditional probability that Yi = y given Xi can be shown to be
y " #α−1
Γ (y + α−1 ) λi α−1
P (Yi = y|Xi ) = (38.17)
Γ (α−1 )Γ (y + 1) λi + α−1 λi + α−1
which is known as the Negative Binomial Model. The conditional mean is still E(Yi |Xi ) = λi , but the
variance is V (Yi |Xi ) = λi (1 + λi α).
To estimate the Negative Binomial model in gretl, just substitute the keyword negbin for poisson
in the commands shown above.
To be precise, the model 38.17 is that labeled NEGBIN2 by Cameron and Trivedi (1986). There’s
also a lesser-used NEGBIN1 variant, in which the conditional variance is a scalar multiple of the
conditional mean; that is, V (Yi |Xi ) = λi (1 + γ). This can be invoked in gretl by appending the
option --model1 to the negbin command.5
The two accessors $yhat and $uhat return the predicted values and generalized residuals, respec-
tively. Note that $uhat is not equal to the difference between the dependent variable and $yhat.
Examples
Among the sample scripts supplied with gretl you can find camtriv.inp. This exemplifies the
count-data estimators described above, based on a dataset analysed by Cameron and Trivedi (1998).
The gretl package also contains a relevant dataset used by McCullagh and Nelder (1983), namely
mccullagh.gdt, on which the Poisson and Negative Binomial estimators may be tried.
• From engineering, the “time to failure” of electronic or mechanical components: how long do,
say, computer hard drives last until they malfunction?
• From the medical realm: how does a new treatment affect the time from diagnosis of a certain
condition to exit from that condition (where “exit” might mean death or full recovery)?
In each case we may be interested in how the durations are distributed, and how they are affected
by relevant covariates. There are several approaches to this problem; the one we discuss here —
which is currently the only one supported by gretl — is estimation of a parametric model by means
5 The “1” and “2” in these labels indicate the power to which λi is raised in the conditional variance expression.
Chapter 38. Discrete and censored dependent variables 398
of Maximum Likelihood. In this approach we hypothesize that the durations follow some definite
probability law and we seek to estimate the parameters of that law, factoring in the influence of
covariates.
We may express the density of the durations as f (t, X, θ), where t is the length of time in the state
in question, X is a matrix of covariates, and θ is a vector of parameters. The likelihood for a sample
of n observations indexed by i is then
n
Y
L= f (ti , xi , θ)
i=1
Rather than working with the density directly, however, it is standard practice to factor f (·) into
two components, namely a hazard function, λ, and a survivor function, S. The survivor function
gives the probability that a state lasts at least as long as t; it is therefore 1 − F (t, X, θ) where F
is the CDF corresponding to the density f (·). The hazard function addresses this question: given
that a state has persisted as long as t, what is the likelihood that it ends within a short increment
of time beyond t —that is, it ends between t and t + ∆? Taking the limit as ∆ goes to zero, we end
up with the ratio of the density to the survivor function:6
f (t, X, θ)
λ(t, X, θ) = (38.18)
S(t, X, θ)
so the log-likelihood can be written as
n
X n
X
ℓ= log f (ti , xi , θ) = log λ(ti , xi , θ) + log S(ti , xi , θ) (38.19)
i=1 i=1
One point of interest is the shape of the hazard function, in particular its dependence (or not) on
time since the state began. If λ does not depend on t we say the process in question exhibits du-
ration independence: the probability of exiting the state at any given moment neither increases nor
decreases based simply on how long the state has persisted to date. The alternatives are positive
duration dependence (the likelihood of exiting the state rises, the longer the state has persisted)
or negative duration dependence (exit becomes less likely, the longer it has persisted). Finally, the
behavior of the hazard with respect to time need not be monotonic; some parameterizations allow
for this possibility and some do not.
Since durations are inherently positive the probability distribution used in modeling must respect
this requirement, giving a density of zero for t ≤ 0. Four common candidates are the exponential,
Weibull, log-logistic and log-normal, the Weibull being the most common choice. The table below
displays the density and the hazard function for each of these distributions as they are commonly
parameterized, written as functions of t alone. (φ and Φ denote, respectively, the Gaussian PDF
and CDF.)
The hazard is constant for the exponential distribution. For the Weibull, it is monotone increasing
in t if α > 1, or monotone decreasing for α < 1. (If α = 1 the Weibull collapses to the exponential.)
6 For a fuller discussion see, for example, Davidson and MacKinnon (2004).
Chapter 38. Discrete and censored dependent variables 399
The log-logistic and log-normal distributions allow the hazard to vary with t in a non-monotonic
fashion.
Covariates are brought into the picture by allowing them to govern one of the parameters of the
density, so that durations are not identically distributed across cases. For example, when using
the log-normal distribution it is natural to make µ, the expected value of log t, depend on the
covariates, X. This is typically done via a linear index function: µ = Xβ.
Note that the expressions for the log-normal density and hazard contain the term (log t − µ)/σ .
Replacing µ with Xβ this becomes (log t − Xβ)/σ . As in Kalbfleisch and Prentice (2002), we define
a shorthand label for this term:
wi ≡ (log ti − xi β)/σ (38.20)
It turns out that this constitutes a useful simplifying change of variables for all of the distributions
discussed here. The interpretation of the scale factor, σ , in the expression above depends on the
distribution. For the log-normal, σ represents the standard deviation of log t; for the Weibull and
the log-logistic it corresponds to 1/α; and for the exponential it is fixed at unity. For distributions
other than the log-normal, Xβ corresponds to − log γ, or in other words γ = exp(−Xβ).
With this change of variables, the density and survivor functions may be written compactly as
follows (the exponential is the same as the Weibull).
In light of the above we may think of the generic parameter vector θ, as in f (t, X, θ), as composed of
the coefficients on the covariates, β, plus (in all cases but the exponential) the additional parameter
σ.
A complication in estimation of θ is posed by “incomplete spells”. That is, in some cases the state
in question may not have ended at the time the observation is made (e.g. some workers remain
unemployed, some components have not yet failed). If we use ti to denote the time from entering
the state to either (a) exiting the state or (b) the observation window closing, whichever comes first,
then all we know of the “right-censored” cases (b) is that the duration was at least as long as ti .
This can be handled by rewriting the the log-likelihood (compare 38.19) as
n
X
ℓi = δi log S (wi ) + (1 − δi ) − log σ + log f (wi ) (38.21)
i=1
where δi equals 1 for censored cases (incomplete spells), and 0 for complete observations. The
rationale for this is that the log-density equals the sum of the log hazard and the log survivor
function, but for the incomplete spells only the survivor function contributes to the likelihood. So
in (38.21) we are adding up the log survivor function alone for the incomplete cases, plus the full
log density for the completed cases.
where durat measures durations, 0 represents the constant (which is required for such models), X
is a named list of regressors, and cens is the censoring dummy.
By default the Weibull distribution is used; you can substitute any of the other three distribu-
tions discussed here by appending one of the option flags --exponential, --loglogistic or
--lognormal.
Interpreting the coefficients in a duration model requires some care, and we will work through
an illustrative case. The example comes from section 20.3 of Wooldridge (2002a) and it concerns
criminal recidivism.7 The data (filename recid.gdt) pertain to a sample of 1,445 convicts released
from prison between July 1, 1977 and June 30, 1978. The dependent variable is the time in months
until they are again arrested. The information was gathered retrospectively by examining records
in April 1984; the maximum possible length of observation is 81 months. Right-censoring is impor-
tant: when the date were compiled about 62 percent had not been rearrested. The dataset contains
several covariates, which are described in the data file; we will focus below on interpretation of the
married variable, a dummy which equals 1 if the respondent was married when imprisoned.
Listing 38.7 shows the gretl commands for Weibull and log-normal models along with most of the
output. Consider first the Weibull scale factor, σ . The estimate is 1.241 with a standard error of
0.048. (We don’t print a z score and p-value for this term since H0 : σ = 0 is not of interest.)
Recall that σ corresponds to 1/α; we can be confident that α is less than 1, so recidivism displays
negative duration dependence. This makes sense: it is plausible that if a past offender manages
to stay out of trouble for an extended period his risk of engaging in crime again diminishes. (The
exponential model would therefore not be appropriate in this case.)
On a priori grounds, however, we may doubt the monotonic decline in hazard that is implied by
the Weibull specification. Even if a person is liable to return to crime, it seems relatively unlikely
that he would do so straight out of prison. In the data, we find that only 2.6 percent of those
followed were rearrested within 3 months. The log-normal specification, which allows the hazard
to rise and then fall, may be more appropriate. Using the duration command again with the same
covariates but the --lognormal flag, we get a log-likelihood of −1597 as against −1633 for the
Weibull, confirming that the log-normal gives a better fit.
Let us now focus on the married coefficient, which is positive in both specifications but larger and
more sharply estimated in the log-normal variant. The first thing is to get the interpretation of the
sign right. Recall that Xβ enters negatively into the intermediate variable w (equation 38.20). The
Weibull hazard is λ(wi ) = ewi , so being married reduces the hazard of re-offending, or in other
words lengthens the expected duration out of prison. The same qualitative interpretation applies
for the log-normal.
To get a better sense of the married effect, it is useful to show its impact on the hazard across time.
We can do this by plotting the hazard for two values of the index function Xβ: in each case the
values of all the covariates other than married are set to their means (or some chosen values) while
married is set first to 0 then to 1. Listing 38.8 provides a script that does this, and the resulting
plots are shown in Figure 38.1. Note that when computing the hazards we need to multiply by the
Jacobian of the transformation from ti to wi = log(ti − xi β)/σ , namely 1/t. Note also that the
estimate of σ is available via the accessor $sigma, but it is also present as the last element in the
coefficient vector obtained via $coeff.
A further difference between the Weibull and log-normal specifications is illustrated in the plots.
The Weibull is an instance of a proportional hazard model. This means that for any sets of values of
the covariates, xi and xj , the ratio of the associated hazards is invariant with respect to duration. In
this example the Weibull hazard for unmarried individuals is always 1.1637 times that for married.
In the log-normal variant, on the other hand, this ratio gradually declines from 1.6703 at one month
to 1.1766 at 100 months.
7 Germán Rodríguez of Princeton University has a page discussing this example and displaying estimates from Stata
at http://data.princeton.edu/pop509/recid1.html.
Chapter 38. Discrete and censored dependent variables 401
Partial output:
Model 1: Duration (Weibull), using observations 1-1445
Dependent variable: durat
open recid.gdt -q
# Weibull variant
duration durat 0 X married ; cens
# coefficients on all Xs apart from married
matrix beta_w = $coeff[1:$ncoeff-2]
# married coefficient
scalar mc_w = $coeff[$ncoeff-1]
scalar s_w = $sigma
# Log-normal variant
duration durat 0 X married ; cens --lognormal
matrix beta_n = $coeff[1:$ncoeff-2]
scalar mc_n = $coeff[$ncoeff-1]
scalar s_n = $sigma
list allX = 0 X
# evaluate X\beta at means of all variables except marriage
scalar Xb_w = meanc({allX}) * beta_w
scalar Xb_n = meanc({allX}) * beta_n
loop t=1..100
# first column, duration
mat_w[t, 1] = t
mat_n[t, 1] = t
wi_w = (log(t) - Xb_w)/s_w
wi_n = (log(t) - Xb_n)/s_n
# second col: hazard with married = 0
mat_w[t, 2] = (1/t) * exp(wi_w)
mat_n[t, 2] = (1/t) * pdf(z, wi_n) / cdf(z, -wi_n)
wi_w = (log(t) - (Xb_w + mc_w))/s_w
wi_n = (log(t) - (Xb_n + mc_n))/s_n
# third col: hazard with married = 1
mat_w[t, 3] = (1/t) * exp(wi_w)
mat_n[t, 3] = (1/t) * pdf(z, wi_n) / cdf(z, -wi_n)
endloop
Weibull
0.020
unmarried
0.018 married
0.016
0.014
0.012
0.010
0.008
0.006
0 20 40 60 80 100
months
Log-normal
0.020
unmarried
0.018 married
0.016
0.014
0.012
0.010
0.008
0.006
0 20 40 60 80 100
months
Figure 38.1: Recidivism hazard estimates for married and unmarried ex-convicts
Chapter 38. Discrete and censored dependent variables 404
The expression given for the log-logistic mean, however, is valid only for σ < 1; otherwise the
expectation is undefined, a point that is not noted in all software.8
Alternatively, if the --medians option is given, gretl’s duration command will produce conditional
medians as the content of $yhat. For the Weibull the median is exp(Xβ)(log 2)σ ; for the log-logistic
and log-normal it is just exp(Xβ).
The values we give for the accessor $uhat are generalized (Cox–Snell) residuals, computed as the
integrated hazard function, which equals the negative log of the survivor function:
Under the null of correct specification of the model these generalized residuals should follow the
unit exponential distribution, which has mean and variance both equal to 1 and density exp(−ϵ).
See chapter 18 of Cameron and Trivedi (2005) for further discussion.
8 The predict adjunct to the streg command in Stata 10, for example, gaily produces large negative values for the
Quantile regression
39.1 Introduction
In Ordinary Least Squares (OLS) regression, the fitted values, ŷi = Xi β̂, represent the conditional
mean of the dependent variable—conditional, that is, on the regression function and the values
of the independent variables. In median regression, by contrast and as the name implies, fitted
values represent the conditional median of the dependent variable. It turns out that the principle of
estimation for median regression is easily stated (though not so easily computed), namely, choose
β̂ so as to minimize the sum of absolute residuals. Hence the method is known as Least Absolute
Deviations or LAD. While the OLS problem has a straightforward analytical solution, LAD is a linear
programming problem.
Quantile regression is a generalization of median regression: the regression function predicts the
conditional τ-quantile of the dependent variable — for example the first quartile (τ = .25) or the
ninth decile (τ = .90).
If the classical conditions for the validity of OLS are satisfied — that is, if the error term is indepen-
dently and identically distributed, conditional on X — then quantile regression is redundant: all the
conditional quantiles of the dependent variable will march in lockstep with the conditional mean.
Conversely, if quantile regression reveals that the conditional quantiles behave in a manner quite
distinct from the conditional mean, this suggests that OLS estimation is problematic.
Gretl has offered quantile regression functionality since version 1.7.5 (in addition to basic LAD
regression, which has been available since early in gretl’s history via the lad command).1
where
• reglist is a standard gretl regression list (dependent variable followed by regressors, including
the constant if an intercept is wanted); and
• tau is the desired conditional quantile, in the range 0.01 to 0.99, given either as a numerical
value or the name of a pre-defined scalar variable (but see below for a further option).
Estimation is via the Frisch–Newton interior point solver (Portnoy and Koenker, 1997), which is sub-
stantially faster than the “traditional” Barrodale–Roberts (1974) simplex approach for large prob-
lems.
1 We gratefully acknowledge our borrowing from the quantreg package for GNU R (version 4.17). The core of the
package is composed of Fortran code written by Roger Koenker; this is accompanied by various driver and auxiliary
functions written in the R language by Koenker and Martin Mächler. The latter functions have been re-worked in C for
gretl. We have added some guards against potential numerical problems in small samples.
405
Chapter 39. Quantile regression 406
By default, standard errors are computed according to the asymptotic formula given by Koenker
and Bassett (1978). Alternatively, if the --robust option is given, we use the sandwich estimator
developed in Koenker and Zhao (1994).2
When the confidence intervals option is selected, the parameter estimates are calculated using
the Barrodale–Roberts method. This is simply because the Frisch–Newton code does not currently
support the calculation of confidence intervals.
Two further details. First, the mechanisms for generating confidence intervals for quantile esti-
mates require that the model has at least two regressors (including the constant). If the --intervals
option is given for a model containing only one regressor, an error is flagged. Second, when a model
is estimated in this mode, you can retrieve the confidence intervals using the accessor $coeff_ci.
This produces a k × 2 matrix, where k is the number of regressors. The lower bounds are in the
first column, the upper bounds in the second. See also section 39.5 below.
2 These correspond to the iid and nid options in R’s quantreg package, respectively.
Chapter 39. Quantile regression 407
Coefficient on income
0.75
Quantile estimates with 90% band
OLS estimate with 90% band
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.3
0 0.2 0.4 0.6 0.8 1
tau
The gretl GUI has an entry for Quantile Regression (under /Model/Robust estimation), and you can
select multiple quantiles there too. In that context, just give space-separated numerical values (as
per the predefined options, shown in a drop-down list).
When you estimate a model in this way most of the standard menu items in the model window
are disabled, but one extra item is available — graphs showing the τ sequence for a given coeffi-
cient in comparison with the OLS coefficient. An example is shown in Figure 39.1. This sort of
graph provides a simple means of judging whether quantile regression is redundant (OLS is fine) or
informative.
In the example shown—based on data on household income and food expenditure gathered by
Ernst Engel (1821–1896)—it seems clear that simple OLS regression is potentially misleading. The
“crossing” of the OLS estimate by the quantile estimates is very marked.
However, it is not always clear what implications should be drawn from this sort of conflict. With
the Engel data there are two issues to consider. First, Engel’s famous “law” claims an income-
elasticity of food consumption that is less than one, and talk of elasticities suggests a logarithmic
formulation of the model. Second, there are two apparently anomalous observations in the data
set: household 105 has the third-highest income but unexpectedly low expenditure on food (as
judged from a simple scatter plot), while household 138 (which also has unexpectedly low food
consumption) has much the highest income, almost twice that of the next highest.
With n = 235 it seems reasonable to consider dropping these observations. If we do so, and adopt
a log–log formulation, we get the plot shown in Figure 39.2. The quantile estimates still cross the
OLS estimate, but the “evidence against OLS” is much less compelling: the 90 percent confidence
bands of the respective estimates overlap at all the quantiles considered.
A script to produce the results discussed above is presented in listing 39.1.
Coefficient on log(income)
0.96
0.94
0.92
0.9
0.88
0.86
0.84
0.82
0.8
0.78
Quantile estimates with 90% band
OLS estimate with 90% band
0.76
0 0.2 0.4 0.6 0.8 1
tau
Figure 39.2: Log–log regression; 2 observations dropped from full Engel data set.
The script saves the two models “as icons”. Double-clicking on a model’s icon opens a window to
display the results, and the Graph menu in this window gives access to a tau-sequence plot.
Chapter 39. Quantile regression 409
The matrix ci will contain the lower and upper bounds of the (symmetrical) 90 percent confidence
intervals.
To avoid a situation where gretl becomes unresponsive for a very long time we have set the maxi-
mum number of iterations for the Borrodale–Roberts algorithm to the (somewhat arbitrary) value
of 1000. We will experiment further with this, but for the meantime if you really want to use this
method on a large dataset, and don’t mind waiting for the results, you can increase the limit using
the set command with parameter rq_maxiter, as in
Nonparametric methods
The main focus of gretl is on parametric estimation, but we offer a selection of nonparametric
methods. The most basic of these
• various tests for difference in distribution (Sign test, Wilcoxon rank-sum test, Wilcoxon signed-
rank test);
Details on the above can be found by consulting the help for the commands difftest, runs, corr
and spearman. In the GUI program these items are found under the Tools menu and the Robust
estimation item under the Model menu.
In this chapter we concentrate on two relatively complex methods for nonparametric curve-fitting
and prediction, namely William Cleveland’s “loess” (also known as “lowess”) and the Nadaraya–
Watson estimator.
wk (xi ) = W (h−1
i (xk − xi ))
where hi is the distance between xi and its r th nearest neighbor, and W (·) is the tricube function,
(1 − |x|3 )3
(
for |x| < 1
W (x) =
0 for |x| ≥ 1
410
Chapter 40. Nonparametric methods 411
The local regression can be made robust via an adjustment based on the residuals, ei = yi − ŷi .
Robustness weights, δk , are defined by
δk = B(ek /6s)
where s is the median of the |ei | and B(·) is the bisquare function,
(1 − x 2 )2
(
for |x| < 1
B(x) =
0 for |x| ≥ 1
An illustration of loess is provided in Listing 40.1: we generate a series that has a deterministic
sine wave component overlaid with noise uniformly distributed on (−1, 1). Loess is then used to
retrieve a good approximation to the sine function. The resulting graph is shown in Figure 40.1.
nulldata 120
series x = index
scalar n = $nobs
series y = sin(2*pi*x/n) + uniform(-1, 1)
series yh = loess(y, x, 2, 0.75, 0)
gnuplot y yh x --output=display --with-lines=yh
2
loess fit
1.5
0.5
-0.5
-1
-1.5
-2
0 20 40 60 80 100 120
x
where Kh (·) is the so-called kernel function, which is usually some simple transform of a density
function that depends on a scalar, h, known as the bandwidth. The one used by gretl is
!
x2
Kh (x) = exp −
2h
for |x| < τ and zero otherwise. Larger values of h produce a smoother function. The scalar τ,
known as the trim parameter, is used to prevent numerical problems when the kernel function is
evaluated too far away from zero.
A common variant of Nadaraya–Watson is the so-called “leave-one-out” estimator, which omits the
i-th observation when evaluating m(xi ). The formula therefore becomes
P
j̸=i yj · Kh (xi − xj )
m(xi ) = P
j̸=i Kh (xi − xj )
This makes the estimator more robust numerically and its usage is often advised for inference
purposes.
The nadarwat() function in gretl takes up to five arguments as follows: the dependent series y,
the independent series x, the bandwidth h, a Boolean switch to turn on “leave-one-out”, and a value
for the trim parameter τ, expressed as a multiple of h. The last three arguments are optional; if
they are omitted the default values are, respectively, an automatic data-determined value for h (see
below), leave-one-out not activated, and τ = 4. The default value of τ offers a relatively safe guard
against numerical problems; in some cases a larger τ may produce more sensible values in regions
of X with sparse support.
Choice of bandwidth
As mentioned above, larger values of h lead to a smoother m(·) function; smaller values make
the m(·) function follow the yi values more closely, so that the function appears more “jagged”.
In fact, as h → ∞, m(xi ) → Ȳ ; on the contrary, if h → 0, observations for which xi ̸= X are not
taken into account at all when computing m(X). Also, the statistical properties of m(·) vary with
h: its variance can be shown to be decreasing in h, while its squared bias is increasing in h. It
can be shown that choosing h ∼ n−1/5 minimizes the RMSE, so that value is customarily taken as a
reference point.
If the argument h is omitted or set to 0, gretl uses the following data-determined value:
r
h = 0.9 · min s, · n−1/5
1.349
60
m0
m1
m2
55
50
HA
45
40
35
30
30 35 40 45 50 55 60
WA
Figure 40.2: Nadaraya–Watson example for several choices of the bandwidth parameter
Chapter 40. Nonparametric methods 414
If you need a point estimate of m(X) for some value of X which is not present among the valid
observations of your dependent variable, you may want to add some “fake” observations to your
dataset in which y is missing and x contains the values you want m(x) evaluated at. For example,
the following script evaluates m(x) at regular intervals between −2.0 and 2.0:
nulldata 120
set seed 120496
x m
MIDAS models
The acronym MIDAS stands for “Mixed Data Sampling”. MIDAS models can essentially be described
as models where one or more independent variables are observed at a higher frequency than the
dependent variable, and possibly an ad-hoc parsimonious parameterization is adopted. See Ghysels
et al., 2004; Ghysels, 2015; Armesto et al., 2010 for a fuller introduction. Naturally, these models
require easy handling of multiple-frequency data. The way this is done in gretl is explained in
Chapter 20; in this chapter, we concentrate on the numerical aspects of estimation.
where τ represents the reference point of the sequence of high-frequency lags in “high-frequency
time”.1 Obvious generalizations of this specification include a higher AR order for y and inclusion
of additional low- and/or high-frequency regressors.
Estimation of (41.1) can be accomplished via OLS. However, it is more common to enforce parsi-
mony by making the individual coefficients on lagged high-frequency terms a function of a relatively
small number of hyperparameters, as in
where W (·) is the weighting function associated with a given parameterization and θ is a k-vector
of hyperparameters, k < p.
This presents a couple of computational questions: how to calculate the per-lag coefficients given
the values of the hyperparameters, and how best to estimate the value of the hyperparameters?
Gretl can handle natively four commonly used parameterizations: normalized exponential Almon,
normalized beta (with or without a zero last coefficient), and plain (non-normalized) Almon poly-
nomial. The Almon variants take one or more parameters (two being a common choice). The beta
variants take either two or three parameters. Full details on the forms taken by the W (·) function
are provided in section 41.3.
All variants are handled by the functions mweights and mgradient, which work as follows.
• mweights takes three arguments: the number of lags required (p), the k-vector of hyperpa-
rameters (θ), and an integer code or string indicating the method (see Table 41.1). It returns
a p-vector containing the coefficients.
• mgradient takes three arguments, just like mweights. However, this function returns a p × k
matrix holding the (analytical) gradient of the p coefficients or weights with respect to the k
elements of θ.
1 For discussion of the placement of this reference point relative to low-frequency time, see section 20.3 above.
415
Chapter 41. MIDAS models 416
In the case of the non-normalized Almon polynomial the γ coefficient in (41.2) is identically 1.0
and is omitted. The "beta1" case is the the same as the two-parameter "beta0" except that θ1 is
constrained to equal 1, leaving θ2 as the only free parameter. Ghysels and Qian (2016) make a case
for use of this particularly parsimonious version.2
An additional function is provided for convenience: it is named mlincomb and it combines mweights
with the lincomb function, which takes a list (of series) argument followed by a vector of coeffi-
cients and produces a series result, namely a linear combination of the elements of the list. If we
have a suitable list X available, we can do, for example,
This is equivalent to
The final theta argument is optional in most cases (implying an automatic initialization of the
hyperparameters). If this argument is given it must take one of the following forms:
1. The name of a matrix (vector) holding initial values for the hyperparameters, or a simple
expression which defines a matrix using scalars, such as {1, 5}.
2. The keyword null, indicating that an automatic initialization should be used (as happens
when this argument is omitted).
3. An integer value (in numerical form), indicating how many hyperparameters should be used
(which again calls for automatic initialization).
The third of these forms is required if you want automatic initialization in the Almon polynomial
case, since we need to know how many terms you wish to include. (In the normalized exponential
Almon case we default to the usual two hyperparameters if theta is omitted or given as null.)
The midasreg syntax allows the user to specify multiple high-frequency predictors, if wanted: these
can have different lag specifications, different parameterizations and/or different frequencies.
The options accepted by midasreg include --quiet (suppress printed output), --verbose (show
detail of iterations, if applicable) and --robust (use a HAC estimator of the Newey–West type in
computing standard errors). Two additional specialized options are described below.
Examples of usage
Suppose we have a dependent variable named dy and a MIDAS list named dX, and we wish to run
a MIDAS regression using one lag of the dependent variable and high-frequency lags 1 to 10 of the
series in dX. The following will produce U-MIDAS estimates:
The next lines will produce estimates for the normalized exponential Almon parameterization with
two coefficients, both initialized to zero:
In the examples above, the required lags will be added to the dataset automatically then deleted
after use. If you are estimating several models using a single set of MIDAS lags it is more efficient to
create the lags once and use the mdsl specifier. For example, the following estimates three variant
parameterizations (exponential Almon, beta with zero last lag, and beta with non-zero last lag) on
the same data:
Replication exercise
We give a substantive illustration of midasreg in Listing 41.1. This replicates the first practical
example discussed by Ghysels in the user’s guide titled MIDAS Matlab Toolbox,3 The dependent
3 See Ghysels (2015). This document announces itself as Version 2.0 of the guide and is dated November 1, 2015.
The example we’re looking at appears on pages 24–26; the associated Matlab code can be found in the program
appADLMIDAS1.m.
Chapter 41. MIDAS models 418
variable is the quarterly log-difference of real GDP, named dy in our script. The independent vari-
ables are the first lag of dy and monthly lags 3 to 11 of the monthly log-difference of non-farm
payroll employment (named dXL in our script). Therefore, in this case equation (41.2) becomes
The script exercises all five of the parameterizations mentioned above,4 and in each case the results
of 9 pseudo-out-of-sample forecasts are recorded so that their Root Mean Square Errors can be
compared.
The data file used in the replication, gdp_midas.gdt, was contructed as described in section 20.1
(and as noted there, it is included in the current gretl package). Part of the output from the replica-
tion script is shown in Listing 41.2. The γ coefficient is labeled HF_slope in the gretl output.
For reference, output from Matlab (version R2016a for Linux) is available at http://gretl.sourceforge.
net/midas/matlab_output.txt. For the most part (in respect of regression coefficients and aux-
iliary statistics such as R 2 and forecast RMSEs), gretl’s output agrees with that of Matlab to the
extent that one can reasonably expect on nonlinear problems — that is, to at least 4 significant dig-
its in all but a few instances.5 Standard errors are not quite so close across the two programs,
particularly for the hyperparameters of the beta and exponential Almon functions. We show these
in Table 41.2.
Differences of this order are not unexpected, however, when different methods are used to calcu-
late the covariance matrix for a nonlinear regression. The Matlab standard errors are based on a
numerical approximation to the Hessian at convergence, while those produced by gretl are based
on a Gauss–Newton Regression, as discussed and recommended in Davidson and MacKinnon (2004,
chapter 6).
Underlying methods
The midasreg command calls one of several possible estimation methods in the background, de-
pending on the MIDAS specification(s). As shown in Listing 41.2, this is flagged in a line of output
immediately preceding the “Dependent variable” line. If the only specification type is U-MIDAS,
the method is OLS. Otherwise it is one of three variants of Nonlinear Least Squares.
# estimation sample
smpl 1985:1 2009:1
print "=== normalized beta with zero last lag (beta0) ==="
midasreg dy 0 dy(-1) ; mdsl(dXL, 2, {1,5})
fcast --out-of-sample --static --quiet
FC ~= $fcast
Forecast RMSEs:
umidas 0.5424
beta0 0.5650
betan 0.5210
nealmon 0.5642
almonp 0.5329
Chapter 41. MIDAS models 421
• L-BFGS-B with conditional OLS. L-BFGS is a “limited memory” version of the BFGS optimizer
and the trailing “-B” means that it supports bounds on the parameters, which is useful for
reasons given below.
• Golden Section search with conditional OLS. This is a line search method, used only when
there is a just a single hyperparameter to estimate.
Levenberg–Marquardt is the default NLS method, but if the MIDAS specifications include any of
the beta variants or normalized exponential Almon we switch to L-BFGS-B, unless the user gives the
--levenberg option. The ability to set bounds on the hyperparameters via L-BFGS-B is helpful, first
because the beta parameters (other than the third one, if applicable) must be non-negative but also
because one is liable to run into numerical problems (in calculating the weights and/or gradient) if
their values become too extreme. For example, we have found it useful to place bounds of −2 and
+2 on the exponential Almon parameters.
Here’s what we mean by “conditional OLS” in the context of L-BFGS-B and line search: the search
algorithm itself is only responsible for optimizing the MIDAS hyperparameters, and when the algo-
rithm calls for calculation of the sum of squared residuals given a certain hyperparameter vector we
optimize the remaining parameters (coefficients on base-frequency regressors, slopes with respect
to MIDAS terms) via OLS.
Despite the strong evidence for a structural break, in this case the nonlinear estimator appears to
converge successfully. But one might wonder if a shorter estimation period could provide better
out-of-sample forecasts.
Listing 41.3 presents a more ambitious example: we use GSSmin (Golden Section minimizer) to es-
timate a MIDAS model with the “one-parameter beta” specification (that is, the two-parameter beta
with θ1 clamped at 1). Note that while the function named beta1_SSR is specialized to the given
parameterization, midas_GNR is a fairly general means of calculating the Gauss–Newton regression
for an ADL(1) MIDAS model, and it could be generalized further without much difficulty.
Plot of coefficients
At times, it may be useful to plot the “gross” coefficients on the lags of the high-frequency series
in a MIDAS regression—that is, the normalized weights multiplied by the HF_slope coefficient
(the γ in 41.2). After estimation of a MIDAS model in the gretl GUI this is available via the item
MIDAS coefficients under the Graphs menu in the model window. It is also easily generated via
script, since the $model bundle that becomes available following the midasreg command contains
a matrix, midas_coeffs, holding these coefficients. So the following is sufficient to display the
plot:
matrix m = $model.midas_coeffs
plot m
options with-lp fit=none
literal set title "MIDAS coefficients"
literal set ylabel ’’
end plot --output=display
Caveat: this feature is at present available only for models with a single MIDAS specification.
f (i, θ)
wi = Pp (41.3)
k=1 f (k, θ)
/* main */
# estimation sample
smpl 1985:1 2009:1
In the normalized exponential Almon case with m parameters the function f (·) is
m
X
j
f (i, θ) = exp θj i (41.4)
j=1
exp θ1 i + θ2 i2
wi = Pp
k=1 exp (θ1 k + θ2 k2 )
(3) wi + θ 3
wi =
1 + pθ3
(3)
That is, we add θ3 to each weight then renormalize so that the wi values again sum to unity.
In Eric Ghysels’ Matlab code the two beta variants are labeled “normalized beta density with a zero
last lag” and “normalized beta density with a non-zero last lag” respectively. Note that while the
two basic beta parameters must be positive, the third additive parameter may be positive, negative
or zero.
Note that no normalization is applied in this case, so no additional coefficient should be placed
before the MIDAS lags term in the context of a regression.
Analytical gradients
Here we set out the expressions for the analytical gradients produced by the mgradient function,
and also used internally by the midasreg command. In these expressions f (i, θ) should be un-
derstood as referring back to the
P specific forms noted above for the exponential Almon and beta
distributions. The summation k should be understood as running from 1 to p.
For the normalized exponential Almon case, the gradient is
dwi
= ij−1
dθj
Part III
Technical details
426
Chapter 42
Gretl provides a method for retrieving data from databases which support the Open Database
Connectivity (ODBC) standard. Most users won’t be interested in this, but there may be some for
whom this feature matters a lot—typically, those who work in an environment where huge data
collections are accessible via a Data Base Management System (DBMS).
In the following section we explain what is needed for ODBC support in gretl. We provide some
background information on how ODBC works in section 42.2, and explain the details of getting gretl
to retrieve data from a database in section 42.3. Section 42.4 provides some example of usage, and
section 42.5 gives some details on the management of ODBC connections.
ODBC
query
data
For the above mechanism to work, it is necessary that the relevant ODBC software is installed
and working on the client machine (contact your DB administrator for details). At this point, the
database (or databases) that the server provides will be accessible to the client as a data source
with a specific identifier (a Data Source Name or DSN); in most cases, a username and a password
are required to connect to the data source.
427
Chapter 42. Gretl and ODBC 428
Once the connection is established, the user sends a query to ODBC, which contacts the database
manager, collects the results and sends them back to the user. The query is almost invariably
formulated in a special language used for the purpose, namely SQL.1 We will not provide here an
SQL tutorial: there are many such tutorials on the Net; besides, each database manager tends to
support its own SQL dialect so the precise form of an SQL query may vary slightly if the DBMS on
the other end is Oracle, MySQL, PostgreSQL or something else.
Suffice it to say that the main statement for retrieving data is the SELECT statement. Within a DBMS,
data are organized in tables, which are roughly equivalent to spreadsheets. The SELECT statement
returns a subset of a table, which is itself a table. For example, imagine that the database holds a
table called “NatAccounts”, containing the data shown in Table 42.1.
Gretl provides a mechanism for forwarding your query to the DBMS via ODBC and including the
results in your currently open dataset.
42.3 Syntax
At present we do not offer a graphical interface for ODBC import; this must be done via the com-
mand line interface. The two commands used for fetching data via an ODBC connection are open
and data.
The open command is used for connecting to a DBMS: its syntax is
The user and password items are optional; the effect of this command is to initiate an ODBC
connection. It is assumed that the machine gretl runs on has a working ODBC client installed.
1 See http://en.wikipedia.org/wiki/SQL.
Chapter 42. Gretl and ODBC 429
In order to actually retrieve the data, the data command is used. Its syntax is:
where:
series is a list of names of gretl series to contain the incoming data, separated by spaces. Note that
these series need not exist pior to the ODBC import.
query-string is a string containing the SQL statement used to extract the data.
There should be no spaces around the equals signs in the obs-format and query fields in the data
command.
The query-string can, in principle, contain any valid SQL statement which results in a table. This
string may be specified directly within the command, as in
which will store into the gretl variable x the content of the column foo from the table bar. However,
since in a real-life situation the string containing the SQL statement may be rather long, it may be
best to store it in a string variable. For example:
(The series named index is automatically added to a dataset created via the nulldata command.)
The format specifiers available for use with obs-format are as follows:
In addition the format can include literal characters to be passed through, such as slashes or colons,
to make the resulting string compatible with gretl’s observation identifiers.
For example, consider the following fictitious case: we have a 5-days-per-week dataset, to which we
want to add the stock index for the Verdurian market;2 it so happens that in Verduria Saturdays
are working days but Wednesdays are not. We want a column which does not contain data on
Saturdays, because we wouldn’t know where to put them, but at the same time we want to place
missing values on all the Wednesdays.
In this case, the following syntax could be used
The column VerdSE holds the data to be fetched, which will go into the gretl series y. The first
three columns are used to construct a string which identifies the day. Daily dates take the form
YYYY-MM-DD in gretl. If a row from the DBMS produces the observation string 2008-04-01 this will
match OK (it’s a Tuesday), but 2008-04-05 will not match since it is a Saturday; the corresponding
row will therefore be discarded. On the other hand, since no string 2008-04-23 will be found in
the data coming from the DBMS (it’s a Wednesday), that entry is left blank in our series y.
42.4 Examples
In the following examples, we will assume that access is available to a database known to ODBC
with the data source name “AWM”, with username “Otto” and password “Bingo”. The database
“AWM” contains quarterly data in two tables (see 42.3 and 42.4):
2 See http://www.almeopedia.com/index.php/Verduria.
Chapter 42. Gretl and ODBC 431
The table Consump is the classic “rectangular” dataset; that is, its internal organization is the same
as in a spreadsheet or econometrics package: each row is a data point and each column is a variable.
The structure of the DATA table is different: each record is one figure, stored in the column xval,
and the other fields keep track of which variable it belongs to, for which date.
nulldata 160
setobs 4 1970:1 --time
open dsn=AWM user=Otto password=Bingo --odbc
Listing 42.1 shows a query for two series: first we set up an empty quarterly dataset. Then we
connect to the database using the open statement. Once the connection is established we retrieve
two columns from the Consump table. No observation string is required because the data already
have a suitable structure; we need only import the relevant columns.
In example 42.2, by contrast, we make use of the observation string since we are drawing from the
DATA table, which is not rectangular. The SQL statement stored in the string S produces a table with
three columns. The ORDER BY clause ensures that the rows will be in chronological order, although
this is not strictly necessary in this case.
Listing 42.3 shows what happens if the rows in the outcome from the SELECT statement do not
match the observations in the currently open gretl dataset. The query includes a condition which
filters out all the data from the first quarter. The query result (invisible to the user) would be
something like
+------+------+---------------+
| year | qtr | xval |
+------+------+---------------+
| 1970 | 2 | 7.8705000000 |
| 1970 | 3 | 7.5600000000 |
| 1970 | 4 | 7.1892000000 |
| 1971 | 2 | 5.8679000000 |
| 1971 | 3 | 6.2442000000 |
| 1971 | 4 | 5.9811000000 |
| 1972 | 2 | 4.6883000000 |
| 1972 | 3 | 4.6302000000 |
...
Internally, gretl fills the variable bar with the corresponding value if it finds a match; otherwise, NA
is used. Printing out the variable bar thus produces
Obs bar
1970:1
1970:2 7.8705
1970:3 7.5600
1970:4 7.1892
1971:1
1971:2 5.8679
1971:3 6.2442
1971:4 5.9811
1972:1
1972:2 4.6883
1972:3 4.6302
...
Chapter 43
43.1 Introduction
TEX — initially developed by Donald Knuth of Stanford University and since enhanced by hundreds
of contributors around the world — is the gold standard of scientific typesetting. Gretl provides
various hooks that enable you to preview and print econometric results using the TEX engine, and
to save output in a form suitable for further processing with TEX.
This chapter explains the finer points of gretl’s TEX-related functionality. The next section describes
the relevant menu items; section 43.3 discusses ways of fine-tuning TEX output; and section 43.4
gives some pointers on installing (and learning) TEX if you do not already have it on your computer.
(Just to be clear: TEX is not included with the gretl distribution; it is a separate package, including
several programs and a large number of supporting files.)
Before proceeding, however, it may be useful to set out briefly the stages of production of a final
document using TEX. For the most part you don’t have to worry about these details, since, in regard
to previewing at any rate, gretl handles them for you. But having some grasp of what is going on
behind the scences will enable you to understand your options better.
The first step is the creation of a plain text “source” file, containing the text or mathematics to
be typset, interspersed with mark-up that defines how it should be formatted. The second step
is to run the source through a processing engine that does the actual formatting. Typically this a
program called pdflatex that generates PDF output.1 (In times gone by it was a program called latex
that generated so-called DVI (device-independent) output.)
So gretl calls pdflatex to process the source file. On MS Windows and Mac OS X, gretl expects the
operating system to find the default viewer for PDF output. On GNU/Linux you can specify your
preferred PDF viewer via the menu item “Tools, Preferences, General,” under the “Programs” tab.
ENROLL
\ = 0.241105 + 0.223530 CATHOL − 0.00338200 PUPIL − 0.152643 WHITE
(0.066022) (0.04597) (0.0027196) (0.040706)
433
Chapter 43. Gretl and TEX 434
The distinction between the “Copy” and “Save” options (for both tabular and equation) is twofold.
First, “Copy” puts the TEX source on the clipboard while with “Save” you are prompted for the name
of a file into which the source should be saved. Second, with “Copy” the material is copied as a
“fragment” while with “Save” it is written as a complete file. The point is that a well-formed TEX
source file must have a header that defines the documentclass (article, report, book or whatever)
and tags that say \begin{document} and \end{document}. This material is included when you do
“Save” but not when you do “Copy”, since in the latter case the expectation is that you will paste
the data into an existing TEX source file that already has the relevant apparatus in place.
The items under “Equation options” should be self-explanatory: when printing the model in equa-
tion form, do you want standard errors or t-ratios displayed in parentheses under the parameter
estimates? The default is to show standard errors; if you want t-ratios, select that item.
Other windows
Several other sorts of output windows also have TEX preview, copy and save enabled. In the case of
windows having a graphical toolbar, look for the TEX button. Figure 43.2 shows this icon (second
from the right on the toolbar) along with the dialog that appears when you press the button.
One aspect of gretl’s TEX support that is likely to be particularly useful for publication purposes is
the ability to produce a typeset version of the “model table” (see section 3.4). An example of this is
shown in Table 43.2.
\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}
Chapter 43. Gretl and TEX 436
OLS estimates
Dependent variable: ENROLL
ADMEXP −0.1551
(0.1342)
n 51 51 51
2
R̄ 0.4502 0.4462 0.2956
ℓ 96.09 95.36 88.69
\usepackage{amsmath}
\usepackage{dcolumn,longtable}
\begin{document}
\thispagestyle{empty}
Note that the amsmath and dcolumn packages are required. (For some sorts of output the longtable
package is also needed.) Beyond that you can, for instance, change the type size or the font by al-
tering the documentclass declaration or including an alternative font package.
In addition, if you wish to typeset gretl output in more than one language, you can set up per-
language preamble files. A “localized” preamble file is identified by a name of the form gretlpre_xx.tex,
where xx is replaced by the first two letters of the current setting of the LANG environment vari-
able. For example, if you are running the program in Polish, using LANG=pl_PL, then gretl will do
the following when writing the preamble for a TEX source file.
1. Look for a file named gretlpre_pl.tex in the gretl user directory. If this is not found, then
2. look for a file named gretlpre.tex in the gretl user directory. If this is not found, then
3. use the default preamble.
Conversely, suppose you usually run gretl in a language other than English, and have a suitable
gretlpre.tex file in place for your native language. If on some occasions you want to produce TEX
output in English, then you could create an additional file gretlpre_en.tex: this file will be used
for the preamble when gretl is run with a language setting of, say, en_US.
Command-line options
After estimating a model via a script— or interactively via the gretl console or using the command-
line program gretlcli—you can use the commands tabprint or eqnprint to print the model to
file in tabular format or equation format respectively. These options are explained in the Gretl
Command Reference.
If you wish alter the appearance of gretl’s tabular output for models in the context of the tabprint
command, you can specify a custom row format using the --format flag. The format string must
be enclosed in double quotes and must be tied to the flag with an equals sign. The pattern for the
format string is as follows. There are four fields, representing the coefficient, standard error, t-
ratio and p-value respectively. These fields should be separated by vertical bars; they may contain
a printf-type specification for the formatting of the numeric value in question, or may be left
blank to suppress the printing of that column (subject to the constraint that you can’t leave all the
columns blank). Here are a few examples:
--format="%.4f|%.4f|%.4f|%.4f"
--format="%.4f|%.4f|%.3f|"
--format="%.5f|%.4f||%.4f"
--format="%.8g|%.8g||%.4f"
The first of these specifications prints the values in all columns using 4 decimal places. The second
suppresses the p-value and prints the t-ratio to 3 places. The third omits the t-ratio. The last one
again omits the t, and prints both coefficient and standard error to 8 significant figures.
Once you set a custom format in this way, it is remembered and used for the duration of the gretl
session. To revert to the default formatting you can use the special variant --format=default.
Further editing
Once you have pasted gretl’s TEX output into your own document, or saved it to file and opened it
in an editor, you can of course modify the material in any wish you wish. In some cases, machine-
generated TEX is hard to understand, but gretl’s output is intended to be human-readable and
Chapter 43. Gretl and TEX 438
-editable. In addition, it does not use any non-standard style packages. Besides the standard LATEX
document classes, the only files needed are, as noted above, the amsmath, dcolumn and longtable
packages. These should be included in any reasonably full TEX implementation.