Stata - Applied Economics

About this introduction to Stata

This introduction provides a stripped-down overview of some of the features, and commands available within Stata, as well as helpful examples of the features in their context.

Examples within this document are given with the included dataset auto.dta which is a set of Consumer Reports data on US Automobiles in 1978. To replicate the examples, load the data using sysuse auto.dta, clear (note this will delete the data currently in your environment).

Using a .do file

Stata supports interactive or .do file command execution. While the interactive console can be convenient for experimentation, it is really bad for creating reproducible work, and easily allows errors to enter into your results unnoticed.

As such always make sure work is written in a .do file.

Commands

Environment setup

cd Sets the working directory, which is the default folder from which all other commands will run. This means that your output (e.g. log files) and input (e.g. data files) will all be read from and written to this location by default.

cd [path]

e.g.

cd "O:\Documents\Applied Economics\Introduction"

use Loads a dataset from your working directory.

use [datafile] *, clear*

e.g.

. use DataUoBSet2.dta, clear

Data management

display Basic way of outputting data or the output of a calculation to the console. When presented with a list of data, the display option will output the first item

display [operation]

e.g. Simple example

. display 1 + 1
2

Concatinating Variables, Strings, and Integers. Note that despite there being 74 values for make and price only the first one is chosen

. display make " $" price
AMC Concord $4099

Displaying function output

. display sqrt(weight)
54.129474

list Outputs each value for a given variable(s) or the whole dataset into the console. By default this is in a paginated form with –More–, which can then be scrolled through by using any key on the keyboard.

list *[variables...]*

e.g. Listing the makes and prices of cars

. list make price

     +----------------------------+
     | make                 price |
     |----------------------------|
  1. | AMC Concord          4,099 |
  2. | AMC Pacer            4,749 |
  3. | AMC Spirit           3,799 |
  4. | Buick Century        4,816 |
  5. | Buick Electra        7,827 |
     |----------------------------|
  6. | Buick LeSabre        5,788 |
...
 27. | Linc. Mark V        13,594 |
--more--

When using this with a .do file having the interactive –more– element is undesirable, so it is recommended to run with set more off environment configuration, where the same command would then be

. set more off

. list make price

     +----------------------------+
     | make                 price |
     |----------------------------|
  1. | AMC Concord          4,099 |
  2. | AMC Pacer            4,749 |
  3. | AMC Spirit           3,799 |
  4. | Buick Century        4,816 |
  5. | Buick Electra        7,827 |
     |----------------------------|
  6. | Buick LeSabre        5,788 |
...

 73. | VW Scirocco          6,850 |
 74. | Volvo 260           11,995 |
     +----------------------------+

generate gen creates new variables, either with an absolute value, or as the output of other variables and functions.

gen [name] = [operation]

e.g. To create a co-efficient of boxiness (or possibly width) for each car

. gen boxiness = length * headroom / displacement

drop removes variables and values from the working environment.

drop [variable] 
drop if [condition]
drop _all

e.g. to remove all the foreign cars from the dataset

. drop if foreign
(22 observations deleted)

Given that foreign is a byte variable, this is a shortcut for saying drop if foreign == 1.

Similarly if we don’t care about the turning circle in any of our calculations we can remove the data from the set by executing

. drop turn

replace allows values to be set for pre-existing variables. Be wary when using this as it will change the original variables irrevocably. Because of this risk it outputs changes that have been made, so that they are not missed.

replace [variable] = [value] *in [index]* *if [condition]*

e.g. to create a dummy variable for heavy cars we would generate the variable first, then replace the blanks

. gen heavy = 1 if weight > 4000
(65 missing values generated)

. replace heavy = 0 if missing(heavy)
(65 real changes made)

or if we know that the price is incorrect for the first car we could

. replace price = 30000 in 1
(1 real change made)

describe will provide information about a given variable, or the dataset as a whole. This will include the data type, any labels and the display format, together with other information about the dataset.

describe *[variables...]*

e.g. to describe the mpg and foreign variables

. describe mpg foreign

              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------
mpg             int     %8.0g                 Mileage (mpg)
foreign         byte    %8.0g      origin     Car type

or to get information on the whole dataset

. describe

Contains data from C:\Program Files\Stata14IC\ado\base/a/auto.dta
  obs:            74                          1978 Automobile Data
 vars:            12                          13 Apr 2014 17:45
 size:         3,182                          (_dta has notes)
---------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------
make            str18   %-18s                 Make and Model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %8.0g                 Repair Record 1978
headroom        float   %6.1f                 Headroom (in.)
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            int     %8.0g                 Turn Circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear Ratio
foreign         byte    %8.0g      origin     Car type
---------------------------------------------------------------------
Sorted by: foreign


rename simply changes the name of a variable.

rename [variable] [name]

e.g.

. rename rep78 repairRecord

Table outputs

summarize most often used to provide quick details of a given variable


tabulate tab can quickly collect variables, as well as providing pdf/cdf values for a given sample within the dataset. Also possible to use for generating dummy variables from factor variables.

tabulate [variable] *, generate([newVarName])*

tabstat a compact output

tabstat [variable...] *,* *by([variable])* *stat([statnames...])*

e.g. to show prices and fuel efficiency for domestic and foreign cars

. tabstat price mpg, by(foreign)

Summary statistics: mean
  by categories of: foreign (Car type)

 foreign |     price       mpg
---------+--------------------
Domestic |  6072.423  19.82692
 Foreign |  6384.682  24.77273
---------+--------------------
   Total |  6165.257   21.2973
------------------------------

or to look at various properties of the prices

. tabstat price, stat(mean sd q)

    variable |      mean        sd       p25       p50       p75
-------------+--------------------------------------------------
       price |  6165.257  2949.496      4195    5006.5      6342
----------------------------------------------------------------

table is both the most complex and most simple of the table outputs, providing all the functionality of the others, but with far fewer default behaviours.

Interactive Env

browse opens a spreadsheet-like table of all or the selected variables in the Stata Graphical User Interface, according to any filters which may have been imposed.

browse *[variables...]* *if [condition]* *, nolabel*

e.g. to browse the name and mpg of properties for cars worth more than 10,000

. browse make mpg if price > 10000

browse make mpg if price > 10000


edit is like browse, however the table cells are editable. This is not a recommended way of making edits to your data, however any changes made here will show in the console as their equivlent commands.

Graph

Graphing functions all have lots of options, however the examples given here are will provide a very brief examples for each case. See the stata docs for details on things like labeling, colouring or otherwise customizing graph functionality.

histogram draws a histogram to display the density or percentage change for a given value or range. Note the comma in the following

histogram [variable] *if [condition]*, *start([start])* *width([width])* *bin([no_of_bins])* *nodraw saving([filename])*

e.g. to do a default histogram of weight

. histogram weight

histogram weight

to display the mpg values for domestic cars with bars each covering a range of 5

. histogram mpg if foreign == 0, width(5) start(10)
(bin=5, start=10, width=5)

histogram mpg if foreign == 0, width(5) start(10)


scatter creates a basic scatter plot of two datasets

scatter [variable1] [variable2] *[w=[weight_variable]]*

e.g. to create a simple scatter plot of price and weight

. scatter price weight

ScatterPriceWeight

To create a scatter plot of mpg and weight with markers weighted based on the price

. scatter mpg weight [w=price], msymbol(circle_hollow)
(analytic weights assumed)
(analytic weights assumed)
(analytic weights assumed)

scatter mpg weight [w=price], msymbol(circle_hollow)


twoway is the generic form for a two way plot (e.g. scatter, line etc.) and provides the ability to plot multiple lines on the same axis.


graph As with table the graph functionality is the generic form for graphs and can be used to create any graph, and some graphing forms require this. See Stata Documentation for full details.

Execute

regress reg

predict

Test

test

ttest

Survivor Model

stset

streg

sts

Useful features

Maths Macros

+ - / * ^ Operators

log ln

exp

sqrt

sum

min max

floor ceiling round

int

Factor Variables

http://www.stata.com/help.cgi?fvvarlist

xi

Special Variables

egen

local

return r

ereturn

log

capture

using

Flow control

while

for

Labelling

Labelling could have fallen under the main category, however it turns out that it is incredibly powerful within Stata. With support for i18n, labelling data sets, variables and data values, it cannot easily be covered in the same short-format as other commands. label applies and manages labels for datasets, variables, and data values.

For a great guide to this see the Princeton Stata Tutorial Section 2.3