Piglatin Ref2 PDF
Piglatin Ref2 PDF
Piglatin Ref2 PDF
by
Table of contents
1
Overview............................................................................................................................2
Relational Operators........................................................................................................ 45
Diagnostic Operators........................................................................................................76
UDF Statements............................................................................................................... 83
Eval Functions................................................................................................................. 90
10
11
1. Overview
Use this manual together with Pig Latin Reference Manual 1.
Also, be sure to review the information in the Pig Cookbook.
1.1. Conventions
Conventions for the syntax and code examples in the Pig Latin Reference Manual are
described here.
Convention
Description
Example
()
Multiple items:
(1, abc, (2,4,6) )
Optional items:
[INNER | OUTER]
UPPERCASE
lowercase
Page 2
system
A, f1 are names (aliases)
data supplied by you
1.2. Keywords
Pig keywords are listed here.
-- A
-- B
-- C
-- D
-- E
-- F
-- G
generate, group
-- H
help
-- I
-- J
join
Page 3
-- K
kill
-- L
-- M
-- N
not, null
-- O
-- P
-- Q
quit
-- R
-- S
-- T
-- U
union, using
-- V, W, X, Y, Z
-- Symbols
Page 4
A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database,
where the tuples in the bag correspond to the rows in a table. Unlike a relational table,
however, Pig relations don't require that every tuple contain the same number of fields or that
the fields in the same position (column) have the same type.
Also note that relations are unordered which means there is no guarantee that tuples are
processed in any particular order. Furthermore, processing may be parallelized in which case
tuples are not processed according to any total ordering.
2.1.1. Referencing Relations
Relations are referred to by name (or alias). Names are assigned by you as part of the Pig
Latin statement. In this example the name (alias) of the relation is A.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,
gpa:float);
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
Second Field
Third Field
Data type
chararray
int
float
Positional notation
(generated by system)
$0
$1
$2
name
age
gpa
Page 5
John
18
4.0
As shown in this example when you assign names to fields you can still refer to the fields
using positional notation. However, for debugging purposes and ease of comprehension, it is
better to use names.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,
gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
(John,4.0F)
(Mary,3.8F)
(Bill,3.9F)
(Joe,3.8F)
In this example an error is generated because the requested column ($3) is outside of the
declared schema (positional notation begins with $0). Note that the error is caught before the
statements are executed.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
B = FOREACH A GENERATE $3;
DUMP B;
2009-01-21 23:03:46,715 [main] ERROR org.apache.pig.tools.grunt.GruntParser
- java.io.IOException:
Out of bound access. Trying to access non-existent : 3. Schema {f1:
bytearray,f2: bytearray,f3: bytearray} has 3 column(s).
etc ...
Page 6
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
A = LOAD 'data' AS (t1:tuple(t1a:int,
t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
DUMP A;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
X = FOREACH A GENERATE t1.t1a,t2.$0;
DUMP X;
(3,4)
(1,3)
(2,9)
Description
Example
int
10
long
Data:
Scalars
10L or 10l
Display: 10L
float
double
Data:
Page 7
bytearray
(19,2)
bag
An collection of tuples.
{(19,2), (18,1)}
map
[open#apache]
If a schema is defined as part of a load statement, the load function will attempt to
enforce the schema. If the data does not conform to the schema, the loader will generate a
null value or an error.
A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
If an explicit cast is not supported, an error will occur. For example, you cannot cast a
chararray to int.
A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE (int)name;
This will cause an error
If Pig cannot resolve incompatible types through implicit casts, an error will occur. For
example, you cannot add chararray and float (see the Types Table for addition and
subtraction).
A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name + gpa;
Page 8
2.2.2. Tuple
A tuple is an ordered set of fields.
2.2.2.1. Syntax
( field [, field ] )
2.2.2.2. Terms
( )
field
2.2.2.3. Usage
You can think of a tuple as a row with one or more fields, where each field can be any data
type and any field may or may not have data. If a field has no data, then the following
happens:
In a load statement, the loader will inject null into the tuple. The actual value that is
substituted for null is loader specific; for example, PigStorage substitutes an empty field
for null.
In a non-load statement, if a requested field is missing from a tuple, Pig will inject null.
2.2.2.4. Example
2.2.3. Bag
A bag is a collection of tuples.
2.2.3.1. Syntax: Inner bag
{ tuple [, tuple ] }
Page 9
2.2.3.2. Terms
{ }
tuple
A tuple.
2.2.3.3. Usage
In this example A is a relation or bag of tuples. You can think of this bag as an outer bag.
A = LOAD 'data' as (f1:int, f2:int, f3;int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
Page 10
(8,{(8,3,4)})
2.2.4. Map
A map is a set of key value pairs.
2.2.4.1. Syntax (<> denotes optional)
[ key#value <, key#value > ]
2.2.4.2. Terms
[]
key
value
2.2.4.3. Usage
2.3. Nulls
In Pig Latin, nulls are implemented using the SQL definition of null as unknown or
non-existent. Nulls can occur naturally in data or can be the result of an operation.
2.3.1. Nulls and Operators
Pig Latin operators interact with nulls as shown in this table.
Operator
Interaction
Comparison operators:
Page 11
==, !=
>, <
>=, <=
Comparison operator:
matches
Arithmetic operators:
+ , -, *, /
% modulo
? bincond
Null operator:
is null
Null operator:
is not null
Dereference operators:
Functions:
COUNT
Function:
CONCAT
Function:
Page 12
SIZE
For Boolean sub-expressions, note the results when nulls are used with these operators:
FILTER operator If a filter expression results in null value, the filter does not pass them
through (if X is null, !X is also null, and the filter will reject both).
Bincond operator If a Boolean sub-expression results in null value, the resulting
expression is null (see the interactions above for Arithmetic operators)
2.3.2. Nulls and Constants
Nulls can be used as constant expressions in place of expressions of any type.
In this example a and null are projected.
A = LOAD 'data' AS (a, b, c).
B = FOREACH A GENERATE a, null;
In this example of an outer join, if the join key is missing from a table it is replaced by null.
A = LOAD 'student' AS (name: chararray, age: int, gpa: float);
B = LOAD 'votertab10k' AS (name: chararray, age: int, registration:
chararray, donation: float);
C = COGROUP A BY name, B BY name;
D = FOREACH C GENERATE FLATTEN((IsEmpty(A) ? null : A)),
FLATTEN((IsEmpty(B) ? null : B));
Like any other expression, null constants can be implicitly or explicitly cast.
In this example both a and null will be implicitly cast to double.
A = LOAD 'data' AS (a, b, c).
B = FOREACH A GENERATE a + null;
In this example both a and null will be cast to int, a implicitly, and null explicitly.
A = LOAD 'data' AS (a, b, c).
B = FOREACH A GENERATE a + (int)null;
Page 13
Dereferencing a key that does not exist in a map. For example, given a map, info,
containing [name#john, phone#5551212] if a user tries to use info#address a null is
returned.
Accessing a field that does not exist in a tuple.
2.4. Constants
Page 14
Pig provides constant representations for all data types except bytearrays.
Constant Example
Notes
19
long
19L
float
19.2F or 1.92e2f
double
19.2 or 1.92e2
Arrays
chararray
'hello world'
bytearray
Not applicable.
(19, 2, 1)
bag
map
Page 15
The data type definitions for tuples, bags, and maps apply to constants:
A tuple can contain fields of any data type
A bag is a collection of tuples
A map key must be a scalar; a map value can be any data type
Complex constants (either with or without values) can be used in the same places scalar
constants can be used; that is, in FILTER and GENERATE statements.
A = LOAD 'data' USING MyStorage() AS (T: tuple(name:chararray, age: int));
B = FILTER A BY T == ('john', 25);
D = FOREACH B GENERATE T.name, [25#5.6], {(1, 5, 18)};
2.5. Expressions
In Pig Latin, expressions are language constructs used with the FILTER, FOREACH,
GROUP, and SPLIT operators as well as the eval functions.
Expressions are written in conventional mathematical infix notation and are adapted to the
UTF-8 character set. Depending on the context, expressions can include:
Any Pig data type (simple data types, complex data types)
Any Pig operator (arithmetic, comparison, null, boolean, dereference, sign, and cast)
Any Pig built-in function.
Any user-defined function (UDF) written in Java.
In Pig Latin,
An arithmetic expression could look like this:
X = GROUP A BY f2*f3;
Page 16
A string expression could look like this, where a and b are both chararrays:
X = FOREACH A GENERATE CONCAT(a,b);
In this example, the programmer really wants to count the number of elements in the bag in
the second field: COUNT($1).
2.5.3. Boolean expressions
Boolean expressions can be made up of UDFs that return a boolean value or boolean
operators (see Boolean Operators).
2.5.4. Tuple expressions
Tuple expressions form subexpressions into tuples. The tuple expression has the form
(expression [, expression ]), where expression is a general expression. The simplest tuple
expression is the star expression, which represents all fields.
Page 17
2.6. Schemas
Schemas enable you to assign names to and declare types for fields. Schemas are optional but
we encourage you to use them whenever possible; type declarations result in better
parse-time error checking and more efficient code execution.
Schemas are defined using the AS keyword with the LOAD, STREAM, and FOREACH
operators. If you define a schema using the LOAD operator, then it is the load function that
enforces the schema (see the LOAD operator and the Pig UDF Manual for more
information).
Note the following:
You can define a schema that includes both the field name and field type.
You can define a schema that includes the field name only; in this case, the field type
defaults to bytearray.
You can choose not to define a schema; in this case, the field is un-named and the field
type defaults to bytearray.
If you assign a name to a field, you can refer to that field using the name or by positional
notation. If you don't assign a name to a field (the field is un-named) you can only refer to
the field using positional notation.
If you assign a type to a field, you can subsequently change the type using the cast operators.
If you don't assign a type to a field, the field defaults to bytearray; you can change the default
type using the cast operators.
2.6.1. Schemas with LOAD and STREAM Statements
With LOAD and STREAM statements, the schema following the AS keyword must be
enclosed in parentheses.
In this example the LOAD statement includes a schema definition for simple data types.
A = LOAD 'data' AS (f1:int, f2:int);
Page 18
In this example the FOREACH statement includes a schema for simple data types.
X = FOREACH A GENERATE f1+f2 AS x1:int;
2.6.3.2. Terms
alias
type
(,)
2.6.3.3. Examples
Page 19
Mary 19
Bill 20
Joe 18
3.8
3.9
3.8
In this example field "gpa" will default to bytearray because no type is declared.
cat student;
John 18 4.0
Mary 19 3.8
Bill 20 3.9
Joe 18 3.8
A = LOAD 'data' AS (name:chararray, age:int, gpa);
DESCRIBE A;
A: {name: chararray,age: int,gpa: bytearray}
DUMP A;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
2.6.5.2. Terms
alias
Page 20
:tuple
()
alias[:type]
2.6.5.3. Examples
In this example the schema defines one tuple. The load statements are equivalent.
cat data;
(3,8,9)
(1,4,7)
(2,5,8)
A = LOAD 'data' AS (T: tuple (f1:int, f2:int, f3:int));
A = LOAD 'data' AS (T: (f1:int, f2:int, f3:int));
DESCRIBE A;
A: {T: (f1: int,f2: int,f3: int)}
DUMP A;
((3,8,9))
((1,4,7))
((2,5,8))
Page 21
((1,4,7),(john,18))
((2,5,8),(joe,18))
2.6.6.2. Terms
alias
:bag
{}
tuple
2.6.6.3. Examples
In this example the schema defines a bag. The two load statements are equivalent.
cat data;
{(3,8,9)}
{(1,4,7)}
{(2,5,8)}
A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});
A = LOAD 'data' AS (B: {T: (t1:int, t2:int, t3:int)});
DESCRIBE A:
A: {B: {T: (t1: int,t2: int,t3: int)}}
DUMP A;
({(3,8,9)})
({(1,4,7)})
({(2,5,8)})
Page 22
2.6.7.2. Terms
alias
:map
[]
2.6.7.3. Example
In this example the schema defines a map. The load statements are equivalent.
cat data;
[open#apache]
[apache#hadoop]
A = LOAD 'data' AS (M:map []);
A = LOAD 'data' AS (M:[]);
DESCRIBE A;
a: {M: map[ ]}
DUMP A;
([open#apache])
([apache#hadoop])
Page 23
2.7.1.3. Terms
pig
Keyword
Note: exec, run, and explain also support parameter
substitution.
param
param_name
param_value
param_file
file_name
debug
Page 25
dryrun
Flag. With this option, the script is not run and a fully
substituted Pig script produced in the current working
directory named original_script_name.substituted
script
%declare
%default
2.7.1.4. Usage
Parameter substitution enables you to write Pig scripts that include parameters and to supply
values for these parameters at run time. For instance, suppose you have a job that needs to
Page 26
run every day using the current day's data. You can create a Pig script that includes a
parameter for the date. Then, when you run this script you can specify or supply a value for
the date parameter using one of the supported methods.
Specifying Parameters
Suppose we have a data file called 'mydata' and a pig script called 'myscript.pig'.
mydata
1
4
8
2
2
3
3
1
4
Page 27
myscript.pig
A = LOAD '$data' USING PigStorage() AS (f1:int, f2:int, f3:int);
DUMP A;
In this example the parameter (data) and the parameter value (mydata) are specified in the
command line. If the parameter name in the command line (data) and the parameter name in
the script ($data) do not match, the script will not run. If the value for the parameter (mydata)
is not found, an error is generated.
$ pig param data=mydata myscript.pig
(1,2,3)
(4,2,1)
(8,3,4)
In this example the parameters and values are passed to the script using the parameter file.
$ pig param_file myparams script2.pig
In this example the command is executed and its stdout is used as the parameter value.
%declare CMD 'generate_date';
A = LOAD '/data/mydata/$CMD';
B = FILTER A BY $0>'5';
etc...
In this example the parameter (DATE) and value ('20090101') are specified in the Pig script
using the default statement. If a value for DATE is not specified elsewhere, the default value
20090101 is used.
%default DATE '20090101';
A = load '/data/mydata/$DATE';
etc...
Page 28
In this example the characters (in this case, Joe's URL) can be enclosed in single or double
quotes, and quotes within the sequence of characters can be escaped.
%declare DES 'Joe\'s URL';
A = LOAD 'data' AS (name, description, url);
B = FILTER A BY description == '$DES';
etc...
In this example single word values that don't use special characters (in this case, mydata)
don't have to be enclosed in quotes.
$ pig param data=mydata myscript.pig
In this example the command is enclosed in back ticks. First, the parameters mycmd and date
are substituted when the declare statement is encountered. Then the resulting command is
executed and its stdout is placed in the path before the load statement is run.
%declare CMD '$mycmd $date';
A = LOAD '/data/mydata/$CMD';
B = FILTER A BY $0>'5';
etc...
Symbol
addition
subtraction
multiplication
division
Notes
Page 29
modulo
bincond
?:
(condition ? value_if_true :
value_if_false)
The bincond should be enclosed in
parenthesis.
The schemas for the two
conditional outputs of the bincond
should match.
Use expressions only (relational
operators are not allowed).
3.1.1.1. Examples
In this example the modulo operator is used with fields f1 and f2.
X = FOREACH A GENERATE f1, f2, f1%f2;
DUMP X;
(10,1,0)
(10,3,1)
(10,6,4)
In this example the bincond operator is used with fields f2 and B. The condition is "f2 equals
1"; if the condition is true, return 1; if the condition is false, return the count of the number of
tuples in B.
X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B));
DUMP X;
(1,1L)
(3,2L)
Page 30
(6,3L)
bag
bag
tuple
map
int
long
float
double
chararray bytearray
error
error
error
error
error
error
error
error
error
not yet
error
error
error
error
error
error
error
error
error
error
error
error
error
error
int
long
float
double
error
cast as
int
long
float
double
error
cast as
long
float
double
error
cast as
float
double
error
cast as
double
error
error
tuple
map
int
long
float
double
chararray
bytearray
cast as
double
bag
tuple
bag
tuple
map
int
long
float
double
chararray bytearray
error
error
error
not yet
not yet
not yet
not yet
error
error
error
error
not yet
not yet
not yet
not yet
error
error
Page 31
map
error
int
error
error
error
error
error
error
int
long
float
double
error
cast as
int
long
float
double
error
cast as
long
float
double
error
cast as
float
double
error
cast as
double
error
error
long
float
double
chararray
bytearray
cast as
double
int
int
long
bytearray
int
long
cast as int
long
cast as long
long
bytearray
error
Symbol
equal
==
not equal
!=
Notes
Page 32
less than
<
greater than
>
<=
>=
pattern matching
matches
3.2.1.4. Types Table: equal (==) and not equal (!=) operators
bag
tuple
map
bag
tuple
map
int
long
float
double
chararray bytearray
error
error
error
error
error
error
error
error
error
boolean
(see
Note 1)
error
error
error
error
error
error
error
boolean
error
error
error
error
error
error
(see
Note 2)
Page 33
int
boolean
long
boolean
boolean
boolean
error
cast as
boolean
boolean
boolean
boolean
error
cast as
boolean
boolean
boolean
error
cast as
boolean
boolean
error
cast as
boolean
boolean
cast as
boolean
float
double
chararray
bytearray
boolean
Note 1: boolean (Tuple A is equal to tuple B if they have the same size s, and for all 0 <= i <
s A[i] = = B[i])
Note 2: boolean (Map A is equal to map B if A and B have the same number of entries, and
for every key k1 in A with a value of v1, there is a key k2 in B with a value of v2, such that
k1 = = k2 and v1 = = v2)
3.2.1.5.
bag
tuple
map
int
bag
tuple
map
int
long
float
double
chararray bytearray
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
boolean
boolean
boolean
boolean
error
boolean
(bytearray
cast as
int)
Page 34
long
boolean
float
boolean
boolean
error
boolean
(bytearray
cast as
long)
boolean
boolean
error
boolean
(bytearray
cast as
float)
boolean
error
boolean
(bytearray
cast as
double)
boolean
boolean
(bytearray
cast as
chararray)
double
chararray
bytearray
boolean
bytearray*
chararray
boolean
boolean
bytearray
boolean
boolean
Operator
Symbol
Notes
is null
is null
is not null
is not null
Page 35
3.3.1.1. Example
X = FILTER A BY f1 is not null;
Symbol
AND
and
OR
or
NOT
not
Notes
Pig does not support a boolean data type. However, the result of a boolean expression (an
expression that includes boolean and comparison operators) is always of type boolean (true
or false).
3.4.1.1. Example
X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1));
Symbol
Notes
tuple dereference
tuple.id or tuple.(id,)
Page 36
bag.id or bag.(id,)
map dereference
map#'key'
In this example dereferencing is used to retrieve two fields from tuple f2.
X = FOREACH A GENERATE f2.t1,f2.t3;
DUMP X;
(1,3)
(4,6)
(7,9)
(1,7)
(2,8)
Page 37
Suppose we have relation B, formed by grouping relation A (see the GROUP operator for
information about the field names in relation B).
A = LOAD 'data' AS (f1:int, f2:int,f3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = GROUP A BY f1;
DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
ILLUSTRATE B;
etc
---------------------------------------------------------| b
| group: int | a: bag({f1: int,f2: int,f3: int}) |
----------------------------------------------------------
In this example dereferencing is used with relation X to project the first field (f1) of each
tuple in the bag (a).
X = FOREACH B GENERATE a.f1;
DUMP X;
({(1)})
({(4),(4)})
({(7)})
({(8),(8)})
Suppose we have relation B, formed by grouping relation A (see the GROUP operator for
information about the field names in relation B).
A = LOAD 'data' AS (f1:int, f2:int, f3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
Page 38
(8,4,3)
B = GROUP A BY (f1,f2);
DUMP B;
((1,2),{(1,2,3)})
((4,2),{(4,2,1)})
((4,3),{(4,3,3)})
((7,2),{(7,2,5)})
((8,3),{(8,3,4)})
((8,4),{(8,4,3)})
ILLUSTRATE B;
etc
------------------------------------------------------------------------------| b
| group: tuple({f1: int,f2: int}) | a: bag({f1: int,f2: int,f3:
int}) |
------------------------------------------------------------------------------|
| (8, 3)
| {(8, 3, 4), (8, 3, 4)} |
-------------------------------------------------------------------------------
In this example dereferencing is used to project a field (f1) from a tuple (group) and a field
(f1) from a bag (a).
X = FOREACH B GENERATE group.f1, a.f1;
DUMP X;
(1,{(1)})
(4,{(4)})
(4,{(4)})
(7,{(7)})
(8,{(8)})
(8,{(8)})
Page 39
()
()
()
Symbol
Notes
positive
Has no effect.
negative (negation)
3.6.1.1. Example
A = LOAD 'data' as (x, y, z);
B = FOREACH A GENERATE -x, y;
error
tuple
error
map
error
int
int
long
long
float
float
double
double
chararray
error
bytearray
Page 40
bag
bag
tuple
map
int
long
float
double
chararray bytearray
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
yes
yes
yes
error
error
yes
yes
error
error
tuple
error
map
error
error
int
error
error
error
long
error
error
error
yes
Page 41
float
error
error
error
yes
yes
yes
double
error
error
error
yes
yes
yes
chararray error
error
error
error
error
error
error
bytearray yes
yes
yes
yes
yes
yes
yes
error
error
error
error
error
yes
3.8.1.1. Syntax
{(data_type) | (tuple(data_type)) | (bag{tuple(data_type)}) | (map[]) } field
3.8.1.2. Terms
(data_type)
field
3.8.1.3. Usage
Cast operators enable you to cast or convert data from one type to another, as long as
conversion is supported (see the table above). For example, suppose you have an integer
field, myint, which you want to convert to a string. You can cast this field from int to
chararray using (chararray)myint.
Please note the following:
A field can be explicitly cast. Once cast, the field remains that type (it is not
automatically cast back). In this example $0 is explicitly cast to int.
B = FOREACH A GENERATE (int)$0 + 1;
Where possible, Pig performs implicit casts. In this example $0 is cast to int (regardless
of underlying data) and $1 is cast to double.
Page 42
When two bytearrays are used in arithmetic expressions or with built-in aggregate
functions (such as SUM) they are implicitly cast to double. If the underlying data is really
int or long, youll get better performance by declaring the type or explicitly casting the
data.
Downcasts may cause loss of data. For example casting from long to int may drop bits.
3.8.1.4. Examples
Page 43
Page 44
[pig#grunt]
A = LOAD 'data' AS fld:bytearray;
DESCRIBE A;
A: {fld: bytearray}
DUMP A;
([open#apache])
([apache#hadoop])
([hadoop#pig])
([pig#grunt])
B = FOREACH A GENERATE ((map[])fld;
DESCRIBE B;
B: {map[ ]}
DUMP B;
([open#apache])
([apache#hadoop])
([hadoop#pig])
([pig#grunt])
4. Relational Operators
4.1. COGROUP
COGROUP is the same as GROUP. For readability, programmers usually use GROUP when
only one relation is involved and COGROUP with multiple relations re involved. See
GROUP for more information.
4.2. CROSS
Computes the cross product of two or more relations.
4.2.1. Syntax
alias = CROSS alias, alias [, alias ] [PARALLEL n];
4.2.2. Terms
alias
PARALLEL n
Page 45
4.2.3. Usage
Use the CROSS operator to compute the cross product (Cartesian product) of two or more
relations.
CROSS is an expensive operation and should be used sparingly.
4.2.4. Example
Suppose we have relations A and B.
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
4.3. DISTINCT
Removes duplicate tuples in a relation.
Page 46
4.3.1. Syntax
alias = DISTINCT alias [PARALLEL n];
4.3.2. Terms
alias
PARALLEL n
4.3.3. Usage
Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT does not
preserve the original order of the contents (to eliminate duplicates, Pig must first sort the
data). You cannot use DISTINCT on a subset of fields. To do this, use FOREACH
GENERATE to select the fields, and then use DISTINCT.
4.3.4. Example
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
Page 47
DUMP X;
(1,2,3)
(4,3,3)
(8,3,4)
4.4. FILTER
Selects tuples from a relation based on some condition.
4.4.1. Syntax
alias = FILTER alias BY expression;
4.4.2. Terms
alias
BY
Required keyword.
expression
A boolean expression.
4.4.3. Usage
Use the FILTER operator to work with tuples or rows of data (if you want to work with
columns of data, use the FOREACH GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you dont want.
4.4.4. Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example the condition states that if the third field equals 3, then include the tuple with
relation X.
Page 48
X = FILTER A BY f3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
In this example the condition states that if the first field equals 8 or if the sum of fields f2 and
f3 is not greater than first field, then include the tuple relation X.
X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));
DUMP X;
(4,2,1)
(8,3,4)
(7,2,5)
(8,4,3)
4.5. FOREACH
Generates data transformations based on columns of data.
4.5.1. Syntax
alias = FOREACH { gen_blk | nested_gen_blk } [AS schema];
4.5.2. Terms
alias
gen_blk
nested_gen_blk
Page 49
Where:
The nested block is enclosed in opening and closing
brackets { }.
The GENERATE keyword must be the last statement
within the nested block.
expression
An expression.
nested_alias
nested_op
AS
Keyword.
schema
4.5.3. Usage
Use the FOREACH GENERATE operation to work with columns of data (if you want to
work with tuples or rows of data, use the FILTER operation).
FOREACH GENERATE works with relations (outer bags) as well as inner bags:
If A is a relation (outer bag), a FOREACH statement could look like this.
X = FOREACH A GENERATE f1;
Page 50
4.5.4. Examples
Suppose we have relations A, B, and C (see the GROUP operator for information about the
field names in relation C).
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
C = COGROUP A BY a1 inner, B BY b1 inner;
DUMP C;
(1,{(1,2,3)},{(1,3)})
(4,{(4,2,1),(4,3,3)},{(4,6),(4,9)})
(8,{(8,3,4),(8,4,3)},{(8,9)})
ILLUSTRATE C;
etc
-------------------------------------------------------------------------------------| c
| group: int | a: bag({a1: int,a2: int,a3: int}) | B: bag({b1:
int,b2: int}) |
-------------------------------------------------------------------------------------|
| 1
| {(1, 2, 3)}
| {(1, 3)}
|
-------------------------------------------------------------------------------------
Page 51
DUMP X;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example two fields from relation A are projected to form relation X.
X = FOREACH A GENERATE a1, a2;
DUMP X;
(1,2)
(4,2)
(8,3)
(4,3)
(7,2)
(8,4)
Page 52
DUMP X;
(3)
(6)
(11)
(7)
(9)
(12)
Y = FILTER X BY f1 > 10;
DUMP Y;
(11)
(12)
Another FLATTEN example. Note that for the group '4' in C, there are two tuples in each
bag. Thus, when both bags are flattened, the cross product of these tuples is returned; that is,
Page 53
tuples (4, 2, 6), (4, 3, 6), (4, 2, 9), and (4, 3, 9).
X = FOREACH C GENERATE FLATTEN(A.(a1, a2)), FLATTEN(B.$1);
DUMP X;
(1,2,3)
(4,2,6)
(4,2,9)
(4,3,6)
(4,3,9)
(8,3,9)
(8,4,9)
Another FLATTEN example. Here, relations A and B both have a column x. When forming
relation E, you need to use the :: operator to identify which column x to use - either relation
A column x (A::x) or relation B column x (B::x). This example uses relation A column x
(A::x).
A =
B =
C =
D =
E =
In this example we perform two of the operations allowed in a nested block, FILTER and
DISTINCT. Note that the last statement in the nested block must be GENERATE.
Page 54
X = foreach B {
FA= FILTER A BY outlink == 'www.xyz.org';
PA = FA.outlink;
DA = DISTINCT PA;
GENERATE GROUP, COUNT(DA);
}
DUMP X;
(www.ddd.com,1L)
(www.www.com,1L)
4.6. GROUP
Groups the data in one or multiple relations. GROUP is the same as COGROUP. For
readability, programmers usually use GROUP when only one relation is involved and
COGROUP with multiple relations are involved.
4.6.1. Syntax
alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression ] [USING 'collected']
[PARALLEL n];
4.6.2. Terms
alias
ALL
BY
expression
USING
Keyword
Page 55
'collected'
PARALLEL n
4.6.3. Usage
The GROUP operator groups together tuples that have the same group key (key field). The
key field will be a tuple if the group key has more than one field, otherwise it will be the
same type as that of the group key. The result of a GROUP operation is a relation that
includes one tuple per group. This tuple contains two fields:
The first field is named "group" (do not confuse this with the GROUP operator) and is
the same type as the group key.
The second field takes the name of the original relation and is type bag.
The names of both fields are generated by the system as shown in the example below.
Note that the GROUP (and thus COGROUP) and JOIN operators perform similar functions.
GROUP creates a nested set of output tuples while JOIN creates a flat set of output tuples.
4.6.4. Example
Suppose we have relation A.
A = load 'student' AS (name:chararray,age:int,gpa:float);
Page 56
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
Now, suppose we group relation A on field "age" for form relation B. We can use the
DESCRIBE and ILLUSTRATE operators to examine the structure of relation B. Relation B
has two fields. The first field is named "group" and is type int, the same as field "age" in
relation A. The second field is name "A" after relation A and is type bag.
B = GROUP A BY age;
DESCRIBE B;
B: {group: int, A: {name: chararray,age: int,gpa: float}}
ILLUSTRATE B;
etc
---------------------------------------------------------------------| B
| group: int | A: bag({name: chararray,age: int,gpa: float}) |
---------------------------------------------------------------------|
| 18
| {(John, 18, 4.0), (Joe, 18, 3.8)}
|
|
| 20
| {(Bill, 20, 3.9)}
|
---------------------------------------------------------------------DUMP B;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Continuing on, as shown in these FOREACH statements, we can refer to the fields in relation
B by names "group" and "A" or by positional notation.
C = FOREACH B GENERATE group, COUNT(A);
DUMP C;
(18,2L)
(19,1L)
(20,1L)
C = FOREACH B GENERATE $0, $1.name;
DUMP C;
(18,{(John),(Joe)})
(19,{(Mary)})
(20,{(Bill)})
4.6.5. Example
Page 57
In this example tuples are co-grouped using field owner from relation A and field friend2
from relation B as the key fields. The DESCRIBE operator shows the schema for relation X,
which has two fields, "group" and "A" (see the GROUP operator for information about the
field names).
X = COGROUP A BY owner, B BY friend2;
DESCRIBE X;
X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1:
chararray,friend2: chararray}}
Relation X looks like this. A tuple is created for each unique key field. The tuple includes the
key field and two bags. The first bag is the tuples from the first relation with the matching
key field. The second bag is the tuples from the second relation with the matching key field.
Page 58
In this example tuples are co-grouped and the INNER keyword is used to ensure that only
bags with at least one tuple are returned.
X = COGROUP A BY owner INNER, B BY friend2 INNER;
DUMP X;
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
In this example tuples are co-grouped and the INNER keyword is used asymmetrically on
only one of the relations.
X = COGROUP A BY owner, B BY friend2 INNER;
DUMP X;
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
(Jane,{},{(Paul,Jane)})
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
4.6.6. Example
This example shows to group using multiple keys.
A = LOAD 'allresults' USING PigStorage() AS (tcid:int, tpid:int,
date:chararray, result:chararray, tsid:int, tag:chararray);
B = GROUP A BY (tcid, tpid);
4.6.7. Example
This example shows a map-side group.
register zebra.jar;
A = LOAD 'studentsortedtab' USING
org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, 'sorted');
B = GROUP A BY name USING "collected";
C = FOREACH b GENERATE group, MAX(a.age), COUNT_STAR(a);
Page 59
4.7.2. Terms
alias
BY
Keyword
expression
A field expression.
Example: X = JOIN A BY fieldA, B BY fieldB, C
BY fieldC;
USING
Keyword
'replicated'
'skewed'
'merge'
PARALLEL n
4.7.3. Usage
Use the JOIN operator to perform an inner, equijoin join of two or more relations based on
common field values. The JOIN operator always performs an inner join. Inner joins ignore
null keys, so it makes sense to filter them out before the join.
Note that the JOIN and COGROUP operators perform similar functions. JOIN creates a flat
Page 60
set of output records while COGROUP creates a nested set of output records.
4.7.4. Example
Suppose we have relations A and B.
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
Page 61
4.8.2. Terms
alias
alias-column
BY
Keyword
LEFT
RIGHT
FULL
OUTER
(Optional) Keyword
USING
Keyword
'replicated'
'skewed'
PARALLEL n
4.8.3. Usage
Page 62
Use the OUTER JOIN operator to perform left, right, or full outer joins. The Pig Latin syntax
closely adheres to the SQL standard. The keyword OUTER is optional for outer joins (the
keywords LEFT, RIGHT and FULL will imply left outer, right outer and full outer joins
respectively when OUTER is omitted).
Please note the following:
Outer joins will only work provided the relations which need to produce nulls (in the case
of non-matching keys) have schemas.
Outer joins will only work for two-way joins; to perform a multi-way outer join, you will
need to perform multiple two-way outer join statements.
4.8.4. Examples
This example shows a left outer join.
A = LOAD 'a.txt' AS (n:chararray, a:int);
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;
4.9. LIMIT
Limits the number of output tuples.
4.9.1. Syntax
alias = LIMIT alias n;
Page 63
4.9.2. Terms
alias
4.9.3. Usage
Use the LIMIT operator to limit the number of output tuples. If the specified number of
output tuples is equal to or exceeds the number of tuples in the relation, the output will
include all tuples in the relation.
There is no guarantee which tuples will be returned, and the tuples that are returned can
change from one run to the next. A particular set of tuples can be requested using the
ORDER operator followed by LIMIT.
Note: The LIMIT operator allows Pig to avoid processing all tuples in a relation. In most
cases a query that uses LIMIT will run more efficiently than an identical query that does not
use LIMIT. It is always a good idea to use limit if you can.
4.9.4. Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example output is limited to 3 tuples. Note that there is no guarantee which three
tuples will be output.
X = LIMIT A 3;
DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
In this example the ORDER operator is used to order the tuples and the LIMIT operator is
Page 64
4.10. LOAD
Loads data from the file system.
4.10.1. Syntax
LOAD 'data' [USING function] [AS schema];
4.10.2. Terms
'data'
USING
Keyword.
If the USING clause is omitted, the default load
function PigStorage is used.
function
Page 65
AS
Keyword.
schema
4.10.3. Usage
Use the LOAD operator to load data from the file system.
4.10.4. Examples
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are
newline-separated.
1 2 3
4 2 1
8 3 4
In this example the default load function, PigStorage, loads data from myfile.txt to form
relation A. The two LOAD statements are equivalent. Note that, because no schema is
specified, the fields are not named and all fields default to type bytearray.
A = LOAD 'myfile.txt';
A = LOAD 'myfile.txt' USING PigStorage('\t');
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
Page 66
In this example a schema is specified using the AS keyword. The two LOAD statements are
equivalent. You can use the DESCRIBE and ILLUSTRATE operators to view the schema.
A = LOAD 'myfile.txt' AS (f1:int, f2:int, f3:int);
A = LOAD 'myfile.txt' USING PigStorage(\t) AS (f1:int, f2:int, f3:int);
DESCRIBE A;
a: {f1: int,f2: int,f3: int}
ILLUSTRATE A;
--------------------------------------------------------| a
| f1: bytearray | f2: bytearray | f3: bytearray |
--------------------------------------------------------|
| 4
| 2
| 1
|
----------------------------------------------------------------------------------------------| a
| f1: int | f2: int | f3: int |
--------------------------------------|
| 4
| 2
| 1
|
---------------------------------------
For examples of how to specify more complex schemas for use with the LOAD operator, see
Schemas for Complex Data Types and Schemas for Multiple Types.
4.11. ORDER
Sorts a relation based on one or more fields.
4.11.1. Syntax
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] ] }
[PARALLEL n];
4.11.2. Terms
alias
BY
Required keyword.
ASC
Page 67
DESC
field_alias
PARALLEL n
4.11.3. Usage
In Pig, relations are unordered (see Relations, Bags, Tuples, and Fields):
If you order relation A to produce relation X (X = ORDER A BY * DESC;) relations A
and X still contain the same thing.
If you retrieve the contents of relation X (DUMP X;) they are guaranteed to be in the
order you specified (descending).
However, if you further process relation X (Y = FILTER X BY $0 > 1;) there is no
guarantee that the contents will be processed in the order you originally specified
(descending).
4.11.4. Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example relation A is sorted by the third field, f3 in descending order. Note that the
order of the three tuples ending in 3 can vary.
Page 68
X = ORDER A BY a3 DESC;
DUMP X;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
(8,4,3)
(4,2,1)
4.12. SAMPLE
Partitions a relation into two or more relations.
4.12.1. Syntax
SAMPLE alias size;
4.12.2. Terms
alias
size
4.12.3. Usage
Use the SAMPLE operator to select a random data sample with the stated sample size.
SAMPLE is a probabalistic operator; there is no guarantee that the exact same number of
tuples will be returned for a particular sample size each time the operator is used.
4.12.4. Example
In this example relation X will contain 1% of the data in relation A.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
X = SAMPLE A 0.01;
4.13. SPLIT
Partitions a relation into two or more relations.
Page 69
4.13.1. Syntax
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression ];
4.13.2. Terms
alias
INTO
Required keyword.
IF
Required keyword.
expression
An expression.
4.13.3. Usage
Use the SPLIT operator to partition the contents of a relation into two or more relations based
on some expression. Depending on the conditions stated in the expression:
A tuple may be assigned to more than one relation.
A tuple may not be assigned to any relation.
4.13.4. Example
In this example relation A is split into three relations, X, Y, and Z.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
Page 70
DUMP Z;
(1,2,3)
(7,8,9)
4.14. STORE
Stores or saves results to the file system.
4.14.1. Syntax
STORE alias INTO 'directory' [USING function];
4.14.2. Terms
alias
INTO
Required keyword.
'directory'
USING
function
4.14.3. Usage
Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to
Page 71
the file system. Use STORE for production scripts and batch mode processing.
Note: To debug scripts during development, you can use DUMP to check intermediate
results.
4.14.4. Examples
In this example data is stored using PigStorage and the asterisk character (*) as the field
delimiter.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
STORE A INTO 'myoutput' USING PigStorage ('*');
CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3
In this example, the CONCAT function is used to format the data before it is stored.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = FOREACH A GENERATE CONCAT('a:',(chararray)f1),
CONCAT('b:',(chararray)f2), CONCAT('c:',(chararray)f3);
DUMP B;
(a:1,b:2,c:3)
(a:4,b:2,c:1)
(a:8,b:3,c:4)
(a:4,b:3,c:3)
(a:7,b:2,c:5)
Page 72
(a:8,b:4,c:3)
STORE B INTO 'myoutput' using PigStorage(',');
CAT myoutput;
a:1,b:2,c:3
a:4,b:2,c:1
a:8,b:3,c:4
a:4,b:3,c:3
a:7,b:2,c:5
a:8,b:4,c:3
4.15. STREAM
Sends data to an external script or program.
4.15.1. Syntax
alias = STREAM alias [, alias ] THROUGH {'command' | cmd_alias } [AS schema] ;
4.15.2. Terms
alias
THROUGH
Keyword.
'command'
cmd_alias
AS
Keyword.
schema
4.15.3. Usage
Use the STREAM operator to send data through an external script or program. Multiple
stream operators can appear in the same Pig script. The stream operators can be adjacent to
Page 73
When used with a cmd_alias, a stream statement could look like this, where mycmd is the
defined alias.
A = LOAD 'data';
DEFINE mycmd 'stream.pl n 5';
B = STREAM A THROUGH mycmd;
Page 74
B = GROUP A BY $1;
C = FOREACH B FLATTEN(A);
D = STREAM C THROUGH 'stream.pl';
4.16. UNION
Computes the union of two or more relations.
4.16.1. Syntax
alias = UNION alias, alias [, alias ];
4.16.2. Terms
alias
4.16.3. Usage
Use the UNION operator to merge the contents of two or more relations. The UNION
operator:
Does not preserve the order of tuples. Both the input and output relations are interpreted
as unordered bags of tuples.
Does not ensure (as databases do) that all tuples adhere to the same schema or that they
have the same number of fields. In a typical scenario, however, this should be the case;
Page 75
therefore, it is the user's responsibility to either (1) ensure that the tuples in the input
relations have the same schema or (2) be able to process varying tuples in the output
relation.
Does not eliminate duplicate tuples.
4.16.4. Example
In this example the union of relation A and B is computed.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
B = LOAD 'data' AS (b1:int,b2:int);
DUMP A;
(2,4)
(8,9)
(1,3)
X = UNION A, B;
DUMP X;
(1,2,3)
(4,2,1)
(2,4)
(8,9)
(1,3)
5. Diagnostic Operators
5.1. DESCRIBE
Returns the schema of an alias.
5.1.1. Syntax
DESCRIBE alias;
5.1.2. Terms
alias
Page 76
5.1.3. Usage
Use the DESCRIBE operator to review the schema of a particular alias.
5.1.4. Example
In this example a schema is specified using the AS clause. If all data conforms to the schema,
Pig will use the assigned types.
A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
B = FILTER A BY name matches 'J.+';
C = GROUP B BY name;
D = FOREACH B GENERATE COUNT(B.age);
DESCRIBE A;
A: {group, B: (name: chararray,age: int,gpa: float}
DESCRIBE B;
B: {group, B: (name: chararray,age: int,gpa: float}
DESCRIBE C;
C: {group, chararry,B: (name: chararray,age: int,gpa: float}
DESCRIBE D;
D: {long}
In this example no schema is specified. All fields default to type bytearray or long (see Data
Types).
a = LOAD 'student';
b = FILTER a BY $0 matches 'J.+';
c = GROUP b BY $0;
d = FOREACH c GENERATE COUNT(b.$1);
DESCRIBE a;
Schema for a unknown.
DESCRIBE b;
2008-12-05 01:17:15,316 [main] WARN org.apache.pig.PigServer - bytearray
is implicitly cast to chararray under LORegexp Operator
Schema for b unknown.
DESCRIBE c;
2008-12-05 01:17:23,343 [main] WARN
org.apache.pig.PigServer - bytearray
Page 77
5.2. DUMP
Dumps or displays results to screen.
5.2.1. Syntax
DUMP alias;
5.2.2. Terms
alias
5.2.3. Usage
Use the DUMP operator to run (execute) Pig Latin statements and display the results to your
screen. DUMP is meant for interactive mode; statements are executed immediately and the
results are not saved (persisted). You can use DUMP as a debugging device to make sure that
the results you are expecting are actually generated.
Note that production scripts should not use DUMP as it will disable multi-query
optimizations and is likely to slow down execution (see Store vs. Dump).
5.2.4. Example
In this example a dump is performed after each statement.
A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
DUMP A;
(John,18,4.0F)
(Mary,19,3.7F)
(Bill,20,3.9F)
(Joe,22,3.8F)
(Jill,20,4.0F)
B = FILTER A BY name matches 'J.+';
DUMP B;
Page 78
(John,18,4.0F)
(Joe,22,3.8F)
(Jill,20,4.0F)
5.3. EXPLAIN
Displays execution plans.
5.3.1. Syntax
EXPLAIN [script pigscript] [out path] [brief] [dot] [param param_name = param_value] [param_file
file_name] alias;
5.3.2. Terms
script
out
brief
dot
param_file file_name
alias
5.3.3. Usage
Page 79
Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans
that are used to compute the specified relationship.
If no script is given:
The logical plan shows a pipeline of operators to be executed to build the relation. Type
checking and backend-independent optimizations (such as applying filters early on) also
apply.
The physical plan shows how the logical operators are translated to backend-specific
physical operators. Some backend optimizations also apply.
The map reduce plan shows how the physical operators are grouped into map reduce
jobs.
If a script without an alias is specified, it will output the entire execution graph (logical,
physical, or map reduce).
If a script with a alias is specified, it will output the plan for the given alias.
5.3.4. Example
In this example the EXPLAIN operator produces all three plans. (Note that only a portion of
the output is shown in this example.)
A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
B = GROUP A BY name;
C = FOREACH B GENERATE COUNT(A.age);
EXPLAIN C;
----------------------------------------------Logical Plan:
----------------------------------------------Store xxx-Fri Dec 05 19:42:29 UTC 2008-23 Schema: {long} Type: Unknown
|
|---ForEach xxx-Fri Dec 05 19:42:29 UTC 2008-15 Schema: {long} Type: bag
etc
----------------------------------------------Physical Plan:
----------------------------------------------Store(fakefile:org.apache.pig.builtin.PigStorage) - xxx-Fri Dec 05 19:42:29
UTC 2008-40
|
|---New For Each(false)[bag] - xxx-Fri Dec 05 19:42:29 UTC 2008-39
|
|
Page 80
|
etc
5.4. ILLUSTRATE
(Note! This feature is NOT maintained at the moment. We are looking for someone to adopt
it.)
Displays a step-by-step execution of a sequence of statements.
5.4.1. Syntax
ILLUSTRATE alias;
5.4.2. Terms
alias
5.4.3. Usage
Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig
Latin statements:
The data load statement must include a schema.
The Pig Latin statement used to form the relation that is used with the ILLUSTRATE
command cannot include the map data type, the LIMIT and SPLIT operators, or nested
FOREACH statements.
ILLUSTRATE accesses the ExampleGenerator algorithm which can select an appropriate
and concise set of example data automatically. It does a better job than random sampling
would do; for example, random sampling suffers from the drawback that selective operations
such as filters or joins can eliminate all the sampled data, giving you empty results which
will not help with debugging.
Page 81
With the ILLUSTRATE operator you can test your programs on small datasets and get faster
turnaround times. The ExampleGenerator algorithm uses Pig's Local mode (rather than
Hadoop mode) which means that illustrative example data is generated in near real-time.
Relation X can be used with the ILLUSTRATE operator.
X = FOREACH A GENERATE f1;
ILLUSTRATE X;
5.4.4. Example
In this example we count the number of sites a user has visited since 12/1/08. The
ILLUSTRATE statement will show how the results for num_user_visits are derived.
visits = LOAD 'visits' AS (user:chararray, ulr:chararray,
timestamp:chararray);
DUMP visits;
(Amy,cnn.com,20080218)
(Fred,harvard.edu,20081204)
(Amy,bbc.com,20081205)
(Fred,stanford.edu,20081206)
recent_visits = FILTER visits BY timestamp >= '20081201';
user_visits = GROUP recent_visits BY user;
num_user_visits = FOREACH user_visits GENERATE COUNT(recent_visits);
DUMP num_user_visits;
(1L)
(2L)
ILLUSTRATE num_user_visits;
-----------------------------------------------------------------------| visits
| user: bytearray | ulr: bytearray | timestamp: bytearray |
-----------------------------------------------------------------------|
| Amy
| cnn.com
| 20080218
|
|
| Fred
| harvard.edu
| 20081204
|
|
| Amy
| bbc.com
| 20081205
|
|
| Fred
| stanford.edu
| 20081206
|
------------------------------------------------------------------------
Page 82
-----------------------------------------------------------------------| visits
| user: chararray | ulr: chararray | timestamp: chararray |
-----------------------------------------------------------------------|
| Amy
| cnn.com
| 20080218
|
|
| Fred
| harvard.edu
| 20081204
|
|
| Amy
| bbc.com
| 20081205
|
|
| Fred
| stanford.edu
| 20081206
|
-----------------------------------------------------------------------------------------------------------------------------------------------------| recent_visits
| user: chararray | ulr: chararray | timestamp:
chararray |
------------------------------------------------------------------------------|
| Fred
| harvard.edu
| 20081204
|
|
| Amy
| bbc.com
| 20081205
|
|
| Fred
| stanford.edu
| 20081206
|
-------------------------------------------------------------------------------
---------------------------------------------------------------------------------------| user_visits
| group: chararray | recent_visits: bag({user:
chararray,ulr: chararray,timestamp: chararray}) |
---------------------------------------------------------------------------------------|
| Amy
| {(Amy, bbc.com, 20081205)}
|
|
| Fred
| {(Fred, harvard.edu, 20081204),
(Fred, stanford.edu, 20081206)}
|
---------------------------------------------------------------------------------------------------------------------| num_user_visits
| long |
-----------------------------|
| 1
|
|
| 2
|
-------------------------------
6. UDF Statements
6.1. DEFINE
Assigns an alias to a UDF function or a streaming command.
6.1.1. Syntax
DEFINE alias {function | [`command` [input] [output] [ship] [cache]] };
6.1.2. Terms
Page 83
alias
function
`command`
input
output
ship
Page 84
cache
SHIP Keyword.
'path' A file path, enclosed in single quotes.
6.1.3. Usage
Use the DEFINE statement to assign a name (alias) to a UDF function or to a streaming
command.
Use DEFINE to specify a UDF function when:
The function has a long package name that you don't want to include in a script,
especially if you call the function several times in that script.
The constructor for the function takes string parameters. If you need to use different
constructor parameters for different calls to the function you will need to create multiple
defines one for each parameter set.
Use DEFINE to specify a streaming command when:
The streaming command specification is complex.
The streaming command specification requires additional parameters (input, output, and
so on).
6.1.3.1. About Input and Output
Serialization is needed to convert data from tuples to a format that can be processed by the
streaming application. Deserialization is needed to convert the output from the streaming
application back into tuples. PigStreaming is the default serialization/deserialization function.
Streaming uses the same default format as PigStorage to serialize/deserialize the data. If you
want to explicitly specify a format, you can do it as show below (see more examples in the
Examples: Input/Output section).
Page 85
If you need an alternative format, you will need to create a custom serializer/deserializer by
implementing the following interfaces.
interface PigToStream {
/**
* Given a tuple, produce an array of bytes to be passed to the
streaming
* executable.
*/
public byte[] serialize(Tuple t) throws IOException;
}
interface StreamToPig {
/**
* Given a byte array from a streaming executable, produce a
tuple.
*/
public Tuple deserialize(byte[]) throws IOException;
/**
* This will be called on the front end during planning and not on
the back
* end during execution.
*
* @return the {@link LoadCaster} associated with this object.
* @throws IOException if there is an exception during LoadCaster
*/
public LoadCaster getLoadCaster() throws IOException;
}
Use the ship option to send streaming binary and supporting files, if any, from the client node
to the compute nodes. Pig does not automatically ship dependencies; it is your responsibility
to explicitly specify all the dependencies and to make sure that the software the processing
relies on (for instance, perl or python) is installed on the cluster. Supporting files are shipped
to the task's current working directory and only relative paths should be specified. Any
pre-installed binaries should be specified in the PATH.
Only files, not directories, can be specified with the ship option. One way to work around
this limitation is to tar all the dependencies into a tar file that accurately reflects the structure
needed on the compute nodes, then have a wrapper for your script that un-tars the
Page 86
2. Shipping files to relative paths or absolute paths is undefined and mostly will fail since
you may not have permissions to read/write/execute from arbitraty paths on the actual
clusters.
6.1.3.3. About Cache
The ship option works with binaries, jars, and small datasets. However, loading larger
datasets at run time for every execution can severely impact performance. Instead, use the
cache option to access large files already moved to and available on the compute nodes. Only
files, not directories, can be specified with the cache option.
6.1.3.4. About Auto-Ship
If the ship and cache options are not specified, Pig will attempt to auto-ship the binary in the
following way:
If the first word on the streaming command is perl or python, Pig assumes that the binary
is the first non-quoted string it encounters that does not start with dash.
Otherwise, Pig will attempt to ship the first string from the command line as long as it
does not come from /bin, /usr/bin, /usr/local/bin. Pig will determine this
by scanning the path if an absolute path is provided or by executing which. The paths
can be made configurable using the set stream.skippath option (you can use multiple set
commands to specify more than one path to skip).
Page 87
If you don't supply a DEFINE for a given streaming command, then auto-shipping is turned
off.
Note the following:
1. If Pig determines that it needs to auto-ship an absolute path it will not ship it at all since
there is no way to ship files to the necessary location (lack of permissions and so on).
OP = stream IP through '/a/b/c/script';
or
OP = stream IP through 'perl /a/b/c/script.pl';
2. Pig will not auto-ship files in the following system directories (this is determined by
executing 'which <file>' command).
/bin /usr/bin /usr/local/bin /sbin /usr/sbin /usr/local/sbin
3. To auto-ship, the file in question should be present in the PATH. So if the file is in the
current working directory then the current working directory should be in the PATH.
6.1.4. Examples: Input/Output
In this example PigStreaming is the default serialization/deserialization function. The tuples
from relation A are converted to tab-delimited lines that are passed to the script.
X = STREAM A THROUGH 'stream.pl';
In this example user-defined serialization/deserialization functions are used with the script.
DEFINE Y 'stream.pl' INPUT(stdin USING MySerializer) OUTPUT (stdout USING
MyDeserializer);
X = STREAM A THROUGH Y;
Page 88
In this example cache is used to specify a file located on the cluster compute nodes.
DEFINE Y 'stream.pl data.gz' SHIP('/work/stream.pl')
CACHE('/input/data.gz#data.gz');
X = STREAM A THROUGH Y;
In this example a function is defined for use with the FOREACH GENERATE operator.
REGISTER /src/myfunc.jar
DEFINE myFunc myfunc.MyEvalfunc('foo');
A = LOAD 'students';
B = FOREACH A GENERATE myFunc($0);
6.4. REGISTER
Registers a JAR file so that the UDFs in the file can be used.
6.4.1. Syntax
REGISTER alias;
Page 89
6.4.2. Terms
alias
6.4.3. Usage
Use the REGISTER statement inside a Pig script to specify the path of a Java JAR file
containing UDFs.
You can register additional files (to use with your Pig script) via the command line using the
-Dpig.additional.jars option.
For more information about UDFs, see the User Defined Function Guide. Note that Pig
currently only supports functions written in Java.
6.4.4. Example
In this example REGISTER states that myfunc.jar is located in the /src directory.
/src $ java -jar pig.jar
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
In this example additional jar files are registered via the command line.
pig -Dpig.additional.jars=my.jar:your.jar script.pig
7. Eval Functions
7.1. AVG
Computes the average of the numeric values in a single-column bag.
7.1.1. Syntax
AVG(expression)
7.1.2. Terms
Page 90
expression
7.1.3. Usage
Use the AVG function to compute the average of the numeric values in a single-column bag.
AVG requires a preceding GROUP ALL statement for global averages and a GROUP BY
statement for group averages.
The AVG function now ignores NULL values.
7.1.4. Example
In this example the average GPA for each student is computed (see the GROUP operators for
information about the field names in relation B).
A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
AVG
int
long
float
double
chararray
bytearray
long
long
double
double
error
cast as
double
Page 91
7.2. CONCAT
Concatenates two fields of type chararray or two fields of type bytearray.
7.2.1. Syntax
CONCAT (expression, expression)
7.2.2. Terms
expression
7.2.3. Usage
Use the CONCAT function to concatenate two elements. The data type of the two elements
must be the same, either chararray or bytearray.
7.2.4. Example
In this example fields f2 and f3 are concatenated.
A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);
DUMP A;
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
X = FOREACH A GENERATE CONCAT(f2,f3);
DUMP X;
(opensource)
(mapreduce)
(piglatin)
chararray
bytearray
chararray
bytearray
chararray
cast as chararray
bytearray
Page 92
7.3. COUNT
Computes the number of elements in a bag.
7.3.1. Syntax
COUNT(expression)
7.3.2. Terms
expression
7.3.3. Usage
Use the COUNT function to compute the number of elements in a bag. COUNT requires a
preceding GROUP ALL statement for global counts and a GROUP BY statement for group
counts.
The COUNT function ignores NULL values. If you want to include NULL values in the
count computation, use COUNT_STAR.
Note: You cannot use the tuple designator (*) with COUNT; that is, COUNT(*) will not
work.
7.3.4. Example
In this example the tuples in the bag are counted (see the GROUP operator for information
about the field names in relation B).
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = GROUP A BY f1;
DUMP B;
(1,{(1,2,3)})
Page 93
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
X = FOREACH B GENERATE COUNT(A);
DUMP X;
(1L)
(2L)
(1L)
(2L)
COUNT
int
long
float
double
chararray
bytearray
long
long
long
long
long
long
7.4. COUNT_STAR
Computes the number of elements in a bag.
7.4.1. Syntax
COUNT_STAR(expression)
7.4.2. Terms
expression
7.4.3. Usage
Use the COUNT_STAR function to compute the number of elements in a bag.
COUNT_STAR requires a preceding GROUP ALL statement for global counts and a
GROUP BY statement for group counts.
COUNT_STAR includes NULL values in the count computation (unlike COUNT, which
ignores NULL values).
7.4.4. Example
In this example COUNT_STAR is used the count the tuples in a bag.
Page 94
7.5. DIFF
Compares two fields in a tuple.
7.5.1. Syntax
DIFF (expression, expression)
7.5.2. Terms
expression
7.5.3. Usage
The DIFF function compares two fields in a tuple. If the field values match, null is returned.
If the field values do not match, the non-matching elements are returned.
7.5.4. Example
In this example the two fields are bags. DIFF compares the tuples in each bag.
A = LOAD 'bag_data' AS
(B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)});
DUMP A;
({(8,9),(0,1)},{(8,9),(1,1)})
({(2,3),(4,5)},{(2,3),(4,5)})
({(6,7),(3,7)},{(2,2),(3,7)})
DESCRIBE A;
a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}
X = FOREACH A DIFF(B1,B2);
grunt> dump x;
({(0,1),(1,1)})
({})
({(6,7),(2,2)})
7.6. IsEmpty
Checks if a bag or map is empty.
Page 95
7.6.1. Syntax
IsEmpty(expression)
7.6.2. Terms
expression
7.6.3. Usage
The IsEmpty function checks if a bag or map is empty (has no data). The function can be
used to filter data.
7.6.4. Example
In this example all students with an SSN but no name are located.
SSN = load 'ssn.txt' using PigStorage() as (ssn:long);
SSN_NAME = load 'students.txt' using PigStorage() as (ssn:long,
name:chararray);
-- do a left out join of SSN with SSN_Name
X = cogroup SSN by ssn inner, SSN_NAME by ssn;
-- only keep those ssn's for which there is no name
Y = filter X by IsEmpty(SSN_NAME);
7.7. MAX
Computes the maximum of the numeric values or chararrays in a single-column bag. MAX
requires a preceding GROUP ALL statement for global maximums and a GROUP BY
statement for group maximums.
7.7.1. Syntax
MAX(expression)
7.7.2. Terms
expression
Page 96
7.7.3. Usage
Use the MAX function to compute the maximum of the numeric values or chararrays in a
single-column bag.
7.7.4. Example
In this example the maximum GPA for all terms is computed for each student (see the
GROUP operator for information about the field names in relation B).
A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
X = FOREACH B GENERATE group, MAX(A.gpa);
DUMP X;
(John,4.0F)
(Mary,4.0F)
MAX
int
long
float
double
chararray
bytearray
int
long
float
double
chararray
cast as
double
7.8. MIN
Computes the minimum of the numeric values or chararrays in a single-column bag. MIN
requires a preceding GROUP ALL statement for global minimums and a GROUP BY
Page 97
7.8.2. Terms
expression
7.8.3. Usage
Use the MIN function to compute the minimum of a set of numeric values or chararrays in a
single-column bag.
7.8.4. Example
In this example the minimum GPA for all terms is computed for each student (see the
GROUP operator for information about the field names in relation B).
A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
X = FOREACH B GENERATE group, MIN(A.gpa);
DUMP X;
(John,3.7F)
(Mary,3.8F)
Page 98
MIN
int
long
float
double
chararray
bytearray
int
long
float
double
chararray
cast as
double
7.9. SIZE
Computes the number of elements based on any Pig data type.
7.9.1. Syntax
SIZE(expression)
7.9.2. Terms
expression
7.9.3. Usage
Use the SIZE function to compute the number of elements based on the data type (see the
Types Tables below). SIZE includes NULL values in the size computation. SIZE is not
algebraic.
7.9.4. Example
In this example the number of characters in the first field is computed.
A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
X = FOREACH A GENERATE SIZE(f1);
DUMP X;
(6L)
(6L)
(3L)
returns 1
Page 99
long
returns 1
float
returns 1
double
returns 1
chararray
bytearray
tuple
bag
map
7.10. SUM
Computes the sum of the numeric values in a single-column bag. SUM requires a preceding
GROUP ALL statement for global sums and a GROUP BY statement for group sums.
7.10.1. Syntax
SUM(expression)
7.10.2. Terms
expression
7.10.3. Usage
Use the SUM function to compute the sum of a set of numeric values in a single-column bag.
7.10.4. Example
In this example the number of pets is computed. (see the GROUP operator for information
about the field names in relation B).
Page 100
SUM
int
long
float
double
chararray
bytearray
long
long
double
double
error
cast as
double
7.11. TOKENIZE
Splits a string and outputs a bag of words.
7.11.1. Syntax
TOKENIZE(expression)
7.11.2. Terms
expression
7.11.3. Usage
Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag
of words (each word in a single tuple). The following characters are considered to be word
separators: space, double quote("), coma(,) parenthesis(()), star(*).
Page 101
7.11.4. Example
In this example the strings in each row are split.
A
DUMP A;
(Here is the first string.)
(Here is the second string.)
(Here is the third string.)
X = FOREACH A GENERATE TOKENIZE(f1);
DUMP X;
({(Here),(is),(the),(first),(string.)})
({(Here),(is),(the),(second),(string.)})
({(Here),(is),(the),(third),(string.)})
8. Load/Store Functions
Load/Store functions determine how data goes into Pig and comes out of Pig. Pig provides a
set of built-in load/store functions, described in the sections below. You can also write your
own load/store functions (see the Pig UDF Manual).
To work with bzip compressed files, the input/output files need to have a .bz or .bz2
extension. Because the compression is block-oriented, bzipped files can be split across
multiple maps.
A = load myinput.bz;
store A into myoutput.bz;
Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT
CONCATENATED FILES generated in this manner:
Page 102
If you use concatenated gzip or bzip files with your Pig jobs, you will NOT see a failure but
the results will be INCORRECT.
8.2. BinStorage
Loads and stores data in machine-readable format.
8.2.1. Syntax
BinStorage()
8.2.2. Terms
none
no parameters
8.2.3. Usage
BinStorage works with data that is represented on disk in machine-readable format.
BinStorage does NOT support compression.
BinStorage is used internally by Pig to store the temporary data that is created between
multiple map/reduce jobs.
8.2.4. Example
In this example BinStorage is used with the LOAD and STORE functions.
A = LOAD 'data' USING BinStorage();
STORE X into 'output' USING BinStorage();
8.3. PigStorage
Loads and stores data in UTF-8 format.
8.3.1. Syntax
Page 103
PigStorage(field_delimiter)
8.3.2. Terms
field_delimiter
Parameter.
The default field delimiter is tab ('\t').
You can specify other characters as field delimiters;
however, be sure to encase the characters in single
quotes.
8.3.3. Usage
PigStorage is the default function for the LOAD and STORE operators and works with both
simple and complex data types.
PigStorage supports structured text files (in human-readable UTF-8 format). PigStorage also
supports compression.
Load statements PigStorage expects data to be formatted using field delimiters, either the
tab character ('\t') or other specified character.
Store statements PigStorage outputs data using field deliminters, either the tab character
('\t') or other specified character, and the line feed record delimiter ('\n').
Field Delimiters For load and store statements the default field delimiter is the tab character
('\t'). You can use other characters as field delimiters, but separators such as ^A or Ctrl-A
should be represented in Unicode (\u0001) using UTF-16 encoding (see Wikipedia ASCII,
Unicode, and UTF-16).
Record Deliminters For load statements Pig interprets the line feed ( '\n' ), carriage return (
'\r' or CTRL-M) and combined CR + LF ( '\r\n' ) characters as record delimiters (do not use
these characters as field delimiters). For store statements Pig uses the line feed ('\n') character
as the record delimiter.
8.3.4. Example
In this example PigStorage expects input.txt to contain tab-separated fields and
newline-separated records. The statements are equivalent.
A = LOAD 'student' USING PigStorage('\t') AS (name: chararray, age:int,
gpa: float);
Page 104
In this example PigStorage stores the contents of X into files with fields that are delimited
with an asterisk ( * ). The STORE function specifies that the files will be located in a
directory named output and that the files will be named part-nnnnn (for example,
part-00000).
STORE X INTO
8.4. PigDump
Stores data in UTF-8 format.
8.4.1. Syntax
PigDump()
8.4.2. Terms
none
no parameters
8.4.3. Usage
PigDump stores data as tuples in human-readable UTF-8 format.
8.4.4. Example
In this example PigDump is used with the STORE function.
STORE X INTO 'output' USING PigDump();
8.5. TextLoader
Loads unstructured data in UTF-8 format.
8.5.1. Syntax
TextLoader()
8.5.2. Terms
none
no parameters
Page 105
8.5.3. Usage
TextLoader works with unstructured data in UTF8 format. Each resulting tuple contains a
single field with one line of input text. TextLoader also supports compression.
Currently, TextLoader support for compression is limited.
TextLoader cannot be used to store data.
8.5.4. Example
In this example TextLoader is used with the LOAD function.
A = LOAD 'data' USING TextLoader();
9. Shell Commands
9.1. fs
Invokes any FSShell command from within a Pig script or the Grunt shell.
9.1.1. Syntax
fs subcommand subcommand_parameters
9.1.2. Terms
subcommand
subcommand_parameters
9.1.3. Usage
Use the fs command to invoke any FSShell command from within a Pig script or Grunt shell.
The fs command greatly extends the set of supported file system commands and the
capabilities supported for existing commands such as ls that will now support globing. For a
complete list of FSShell commands, see HDFS File System Shell Guide
9.1.4. Examples
Page 106
10.1. cat
Prints the content of one or more files to the screen.
10.1.1. Syntax
cat path [ path ]
10.1.2. Terms
path
10.1.3. Usage
The cat command is similar to the Unix cat command. If multiple files are specified, content
from all files is concatenated together. If multiple directories are specified, content from all
files in all directories is concatenated together.
10.1.4. Example
In this example the students file in the data directory is printed.
grunt> cat data/students;
joe smith
john adams
anne white
10.2. cd
Changes the current directory to another directory.
Page 107
10.2.1. Syntax
cd [dir]
10.2.2. Terms
dir
10.2.3. Usage
The cd command is similar to the Unix cd command and can be used to navigate the file
system. If a directory is specified, this directory is made your current working directory and
all other operations happen relatively to this directory. If no directory is specified, your home
directory (/user/NAME) becomes the current working directory.
10.2.4. Example
In this example we move to the /data directory.
grunt> cd /data
10.3. copyFromLocal
Copies a file or directory from the local file system to HDFS.
10.3.1. Syntax
copyFromLocal src_path dst_path
10.3.2. Terms
src_path
dst_path
10.3.3. Usage
The copyFromLocal command enables you to copy a file or a director from the local file
system to the Hadoop Distributed File System (HDFS). If a directory is specified, it is
Page 108
recursively copied over. Dot "." can be used to specify that the new file/directory should be
created in the current working directory and retain the name of the source file/directory.
10.3.4. Example
In this example a file (students) and a directory (/data/tests) are copied from the local file
system to HDFS.
grunt> copyFromLocal /data/students students
grunt> ls students
/data/students <r 3> 8270
grunt>
copyFromLocal
/data/tests new_tests
grunt> ls new_test
/data/new_test/test1.data <r 3> 664
/data/new_test/test2.data <r 3> 344
/data/new_test/more_data
10.4. copyToLocal
Copies a file or directory from HDFS to a local file system.
10.4.1. Syntax
copyToLocal src_path dst_path
10.4.2. Terms
src_path
dst_path
10.4.3. Usage
The copyToLocal command enables you to copy a file or a director from Hadoop Distributed
File System (HDFS) to a local file system. If a directory is specified, it is recursively copied
over. Dot "." can be used to specify that the new file/directory should be created in the
current working directory (directory from which the script was executed or grunt shell
started) and retain the name of the source file/directory.
Page 109
10.4.4. Example
In this example two files are copied from HDFS to the local file system.
grunt> copyToLocal students /data
grunt> copyToLocal data /data/mydata
10.5. cp
Copies a file or directory within HDFS.
10.5.1. Syntax
cp src_path dst_path
10.5.2. Terms
src_path
dst_path
10.5.3. Usage
The cp command is similar to the Unix cp command and enables you to copy files or
directories within DFS. If a directory is specified, it is recursively copied over. Dot "." can be
used to specify that the new file/directory should be created in the current working directory
and retain the name of the source file/directory.
10.5.4. Example
In this example a file (students) is copied to another file (students_save).
grunt> cp students students_save
10.6. ls
Lists the contents of a directory.
10.6.1. Syntax
ls [path]
Page 110
10.6.2. Terms
path
10.6.3. Usage
The ls command is similar to the Unix ls command and enables you to list the contents of a
directory. If DIR is specified, the command lists the content of the specified directory.
Otherwise, the content of the current working directory is listed.
10.6.4. Example
In this example the contents of the data directory are listed.
grunt> ls /data
/data/DDLs <dir>
/data/count <dir>
/data/data <dir>
/data/schema <dir>
10.7. mkdir
Creates a new directory.
10.7.1. Syntax
mkdir path
10.7.2. Terms
path
10.7.3. Usage
The mkdir command is similar to the Unix mkdir command and enables you to create a new
directory. If you specify a directory or path that does not exist, it will be created.
10.7.4. Example
In this example a directory and subdirectory are created.
Page 111
10.8. mv
Moves a file or directory within the Hadoop Distributed File System (HDFS).
10.8.1. Syntax
mv src_path dst_path
10.8.2. Terms
src_path
dst_path
10.8.3. Usage
The mv command is identical to the Unix mv command (which copies files or directories
within DFS) except that it deletes the source file or directory as soon as it is copied.
If a directory is specified, it is recursively moved. Dot "." can be used to specify that the new
file/directory should be created in the current working directory and retain the name of the
source file/directory.
10.8.4. Example
In this example the output directory is copied to output2 and then deleted.
grunt> mv output output2
grunt> ls output
File or directory output does not exist.
grunt> ls output2
/data/output2/map-000000<r 3>
/data/output2/output3
<dir>
/data/output2/part-00000<r 3>
508844
0
10.9. pwd
Prints the name of the current working directory.
Page 112
10.9.1. Syntax
pwd
10.9.2. Terms
none
no parameters
10.9.3. Usage
The pwd command is identical to Unix pwd command and it prints the name of the current
working directory.
10.9.4. Example
In this example the name of the current working directory is /data.
grunt> pwd
/data
10.10. rm
Removes one or more files or directories.
10.10.1. Syntax
rm path [path]
10.10.2. Terms
path
10.10.3. Usage
The rm command is similar to the Unix rm command and enables you to remove one or more
files or directories.
Note: This command recursively removes a directory even if it is not empty and it does not
confirm remove and the removed data is not recoverable.
Page 113
10.10.4. Example
In this example files are removed.
grunt> rm /data/students
grunt> rm students students_sav
10.11. rmf
Forcibly removes one or more files or directories.
10.11.1. Syntax
rmf path [path ]
10.11.2. Terms
path
10.11.3. Usage
The rmf command is similar to the Unix rm -f command and enables you to forcibly remove
one or more files or directories.
Note: This command recursively removes a directory even if it is not empty and it does not
confirm remove and the removed data is not recoverable.
10.11.4. Example
In this example files are forcibly removed.
grunt> rmf /data/students
grunt> rmf students students_sav
Page 114
11.1.2. Terms
param param_name = param_value
param_file file_name
script
11.1.3. Usage
Use the exec command to run a Pig script with no interaction between the script and the
Grunt shell (batch mode). Aliases defined in the script are not available to the shell; however,
the files produced as the output of the script and stored on the system are visible after the
script is run. Aliases defined via the shell are not available to the script.
With the exec command, store statements will not trigger execution; rather, the entire script
is parsed before execution starts. Unlike the run command, exec does not change the
command history or remembers the handles used inside the script. Exec without any
parameters can be used in scripts to force execution up to the point in the script where the
exec occurs.
For comparison, see the run command. Both the exec and run commands are useful for
debugging because you can modify a Pig script in an editor and then rerun the script in the
Grunt shell without leaving the shell. Also, both commands promote Pig script modularity as
they allow you to reuse existing components.
11.1.4. Examples
In this example the script is displayed and run.
grunt> cat myscript.pig
a = LOAD 'student' AS (name, age, gpa);
b = LIMIT a 3;
DUMP b;
grunt> exec myscript.pig
(alice,20,2.47)
(luke,18,4.00)
(holly,24,3.27)
Page 115
11.2. help
Prints a list of Pig commands.
11.2.1. Syntax
help
11.2.2. Terms
none
no parameters
11.2.3. Usage
The help command prints a list of Pig commands.
11.2.4. Example
In this example the students file in the data directory is printed out.
grunt> help
Commands:
<pig latin statement>;
store <alias> into <filename> [using <functionSpec>]
dump <alias>
etc
11.3. kill
Kills a job.
11.3.1. Syntax
Page 116
kill jobid
11.3.2. Terms
jobid
11.3.3. Usage
The kill command enables you to kill a job based on a job id.
11.3.4. Example
In this example the job with id job_0001 is killed.
grunt> kill job_0001
11.4. quit
Quits from the Pig grunt shell.
11.4.1. Syntax
exit
11.4.2. Terms
none
no parameters
11.4.3. Usage
The quit command enables you to quit or exit the Pig grunt shell.
11.4.4. Example
In this example the quit command exits the Pig grunt shall.
grunt> quit
11.5. run
Run a Pig script.
Page 117
11.5.1. Syntax
run [param param_name = param_value] [param_file file_name] script
11.5.2. Terms
param param_name = param_value
param_file file_name
script
11.5.3. Usage
Use the run command to run a Pig script that can interact with the Grunt shell (interactive
mode). The script has access to aliases defined externally via the Grunt shell. The Grunt shell
has access to aliases defined within the script. All commands from the script are visible in the
command history.
With the run command, every store triggers execution. The statements from the script are put
into the command history and all the aliases defined in the script can be referenced in
subsequent statements after the run command has completed. Issuing a run command on the
grunt command line has basically the same effect as typing the statements manually.
For comparison, see the exec command. Both the run and exec commands are useful for
debugging because you can modify a Pig script in an editor and then rerun the script in the
Grunt shell without leaving the shell. Also, both commands promote Pig script modularity as
they allow you to reuse existing components.
11.5.4. Example
In this example the script interacts with the results of commands issued via the Grunt shell.
grunt> cat myscript.pig
b = ORDER a BY name;
c = LIMIT b 10;
grunt> a = LOAD 'student' AS (name, age, gpa);
grunt> run myscript.pig
grunt> d = LIMIT c 3;
Page 118
grunt> DUMP d;
(alice,20,2.47)
(alice,27,1.95)
(alice,36,2.27)
11.6. set
Assigns values to keys used in Pig.
11.6.1. Syntax
set key 'value'
11.6.2. Terms
key
value
11.6.3. Usage
The set command enables you to assign values to keys, as shown in the table.
Key
Value
Description
default_parallel
a whole number
debug
on/off
job.name
Page 119
job.priority
stream.skippath
11.6.4. Example
In this example debug is set on, the job is assigned a name, and the number of reducers is set
to 100.
grunt> set debug 'on'
grunt> set job.name 'my job'
grunt> set default_parallel 100
In this example default_parallel is set in the Pig script; all MapReduce jobs that get launched
will use 20 reducers.
SET DEFAULT_PARALLEL 20;
A = LOAD 'myfile.txt' USING PigStorage() AS (t, u, v);
B = GROUP A BY t;
C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
D = ORDER C BY mycount;
STORE D INTO 'mysortedcount' USING PigStorage();
Page 120