Pig

Map Reduce is built in Pig. So programmer can just think on programming and not worry about Map Reduce.

Well Known Users of Pig 1.Yahoo 2.Twitter

Dataflow Language  log = LOAD 'excite-small.log' AS (user, time, query); grpd = GROUP log BY user;cntd = FOREACH grpd GENERATE group, COUNT(log); DUMP cntd; Datatype

Running Pig Local pig -x local Entering the Grunt shell in Hadoop mode is Mapreduce

pig -x mapreduce

Managing Grunt Shell

In addition to running Pig Latin statements (which we’ll look at in a later section), the Grunt shell supports some basic utility commands.3 Typing help will print out a help screen of such utility commands. You exit the Grunt shell with quit. You can stop a Hadoop job with the kill command followed by the Hadoop job ID. Some Pig param-eters are set with the set command. For example,

grunt> set debug on

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">grunt> set job.name 'my job'

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Table for Data Read / Write Operators in Pig Latin <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Diagnostic Operators in Pig Latin
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Utility commands || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">help ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">quit ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">kill jobid ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">set debug [on|off] ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">set job.name 'jobname' ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">File commands || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">cat, cd, copyFromLocal, copyToLocal, cp, ls, mkdir, ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">mv, pwd, rm, rmf, exec, run ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">LOAD || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = LOAD 'file' [USING function] [AS schema]; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Load data from a file into a relation. Uses the PigStorage load function as default ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">unless specified otherwise with the USING option. The data can be given a schema ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">using the AS option. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">LIMIT || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = LIMIT alias n; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Limit the number of tuples to n. When used right after alias was processed by an ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">ORDER operator,LIMITreturns the firstntuples. Otherwise there’s no guarantee which ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">tuples are returned. The LIMIT operator defies categorization because it’s certainly ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">not a read/write operator but it’s not a true relational operator either. We include it here ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">for the practical reason that a reader looking up the DUMP operator, explained later, will ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">remember to use the LIMIT operator right before it. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DUMP || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DUMP alias; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Display the content of a relation. Use mainly for debugging. The relation should be ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">small enough for printing on screen. You can apply the LIMIT operation on an alias to ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">make sure it’s small enough for display. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">STORE || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">STORE alias INTO 'directory' [USING function]; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Store data from a relation into a directory. The directory must not exist when this ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">command is executed. Pig will create the directory and store the relation in files named ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">part-nnnnnin it. Uses the PigStorage store function as default unless specified ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">otherwise with the USING option. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DESCRIBE || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DESCRIBE alias; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Display the schema of a relation. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">EXPLAIN || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">EXPLAIN [-out path] [-brief] [-dot] [-param ...] ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">[-param_file ...] alias; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Display the execution plan used to compute a relation. When used with a script ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">name, for example, EXPLAIN myscript.pig, it will show the execution plan ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">of the script. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">ILLUSTRATE || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">ILLUSTRATE alias; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Display step-by-step how data is transformed, starting with a load command, to ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">arrive at the resulting relation. To keep the display and processing manageable, ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">only a (not completely random) sample of the input data is used to simulate the ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">execution. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">In the unfortunate case where none of Pig’s initial sample will survive the ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">script to generate meaningful data, Pig will “fake” some similar initial data that will ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">survive to generate data for alias. For example, consider these operations: ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">A = LOAD 'student.data' as (name, age); ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">B = FILTER A by age > 18; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">ILLUSTRATE B; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">If every tuple Pig samples for A happens to have age less than or equal to 18, ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">B is empty and not much is “illustrated.” Instead Pig will construct for A some ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">tuples with age greater than 18. This way B won’t be an empty relation and users ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">can see how the script works. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">In order for ILLUSTRATE to work, the load command in the first step must ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">include a schema. The subsequent transformations must not include the LIMIT ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">or SPLIT operators, or the nested FOREACH operator, or the use of the map ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">data type (to be explained in section 10.5.1). ||

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;"> Datatype in Pig Latin
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">int || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Signed 32-bit integer ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">long || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Signed 64-bit integer ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">float || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">32-bit floating point ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">double || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">64-bit floating point ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">chararray || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Character array (string) in Unicode UTF-8 ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">bytearray || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Byte array (binary object) ||

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;"> Complex Data Type
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Tuple || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">(12.5,hello world,-2) ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">A tuple is an ordered set of fields. It’s most often used as a row in a relation. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">It’s represented by fields separated by commas, all enclosed by parentheses. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Bag || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">{(12.5,hello world,-2),(2.87,bye world,10)} ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">A bag is an unordered collection of tuples. A relation is a special kind of bag, ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">sometimes called an outer bag. An inner bag is a bag that is a field within some ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">complex type. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">A bag is represented by tuples separated by commas, all enclosed by curly ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">brackets. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Tuples in a bag aren’t required to have the same schema or even have the ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">same number of fields. It’s a good idea to do this though, unless you’re handling ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">semistructured or unstructured data. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Map || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">[key#value] ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">A map is a set of key/value pairs. Keys must be unique and be a string ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">(chararray). The value can be any type. ||

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Expressions in Pig Latin


 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Constant || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">12, 19.2, || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Constant values such as 19 or “hello world.” Numeric values ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">'hello world' || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">without decimal point are treated as int unless l or L is ||
 * ||  || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">appended to the number to make it a long. Numeric values ||
 * ||  || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">with a decimal point are treated as double unless f or F is ||
 * ||  || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">appended to the number to make it a float. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Basic || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">+,-,*,/ || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Plus, minus, multiply, and divide. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">arithmetic ||  ||   ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Sign || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">+x, -x || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Negation (-) changes the sign of a number. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Cast || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">(t)x || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Convert the value of x into type t. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Modulo || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">x % y || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">The remainder of x divided by y. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Conditional || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">(x ? y : z) || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Returns y if x is true, z otherwise. This expression must be ||
 * ||  || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">enclosed in parentheses. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Comparison || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">==,!=,<,>, || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Equals to, not equals to, greater than, less than, etc. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;"><=,>= ||  ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Pattern || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">x matches || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Regular expression matching of string x. Uses Java’s regular ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">matching || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">regex || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">expression format (under the java.util.regex. ||
 * ||  || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Pattern class) to specify regex. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Null || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">x is null, || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Check if x is null (or not). ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">x is not null ||  ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Boolean || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">x and y, || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Boolean and, or, not. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">x or y ||  ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">not x ||  ||

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Function in Pig Latin


 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">AVG || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Calculate the average of numeric values in a single-column bag. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">CONCAT || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Concatenate two strings (chararray) or two bytearrays. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">COUNT || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Calculate the number of tuples in a bag. See SIZE for other data types. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DIFF || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Compare two fields in a tuple. If the two fields are bags, it will return tuples that are ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">in one bag but not the other. If the two fields are values, it will emit tuples where ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">the values don’t match. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">MAX || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Calculate the maximum value in a single-column bag. The column must be a ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">numeric type or a chararray. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">MIN || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Calculate the minimum value in a single-column bag. The column must be a ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">numeric type or a chararray. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">SIZE || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Calculate the number of elements. For a bag it counts the number of tuples. For a ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">tuple it counts the number of elements. For a chararray it counts the number of ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">characters. For a bytearray it counts the number of bytes. For numeric scalars it ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">always returns 1. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">SUM || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Calculate the sum of numeric values in a single-column bag. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">TOKENIZE || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Split a string (chararray) into a bag of words (each word is a tuple in the bag). ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Word separators are space, double quote ("), comma, parentheses, and asterisk (*). ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">IsEmpty || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Check if a bag or map is empty. ||

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Relational Operators in Pig Latin


 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">FILTER || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = FILTER alias BY expression; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Selects tuples based on Boolean expression. Used to select tuples that you ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">want or remove tuples that you don’t want. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DISTINCT || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = DISTINCT alias [PARALLEL n]; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Remove duplicate tuples. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">SAMPLE || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = SAMPLE alias factor; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Randomly sample a relation. The sampling factor is given in factor. For ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">example, a 1% sample of data in relation large_data is ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">small_data = SAMPLE large_data 0.01; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">The operation is probabilistic in such a way that the size of small_data will ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">not be exactly 1% of large_data, and there’s no guarantee the operation will ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">return the same number of tuples each time. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">FOREACH || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = FOREACH alias GENERATE expression [,expression ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">...] [AS schema]; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Loop through each tuple and generate new tuple(s). Usually applied to transform ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">columns of data, such as adding or deleting fields. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">One can optionally specify a schema for the output relation; for example, ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">naming new fields. ||

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Join has two more option with Using options:
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">FOREACH || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = FOREACH nested_alias { ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">(nested) || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = nested_op; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">[alias = nested_op; ...] ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">GENERATE expression [, expression ...]; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">}; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Loop through each tuple in nested_alias and generate new tuple(s). ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">At least one of the fields of nested_alias should be a bag. DISTINCT, ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">FILTER, LIMIT, ORDER, and SAMPLE are allowed operations in nested_op ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">to operate on the inner bag(s). ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">JOIN || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = JOIN alias BY field_alias, alias BY field_alias [, ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias BY field_alias …] [USING "replicated"] [PARALLEL n]; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Compute inner join of two or more relations based on common field values. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">When using the replicated option, Pig stores all relations after the first one in ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">memory for faster processing. You have to ensure that all those smaller relations ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">together are indeed small enough to fit in memory. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Under JOIN, when the input relations are flat, the output relation is also flat. In ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">addition, the number of fields in the output relation is the sum of the number of ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">fields in the input relations, and the output relation’s schema is a concatenation ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">of the input relations’ schemas. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">GROUP || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = GROUP alias { [ALL] | [BY {[field_alias [, ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">field_alias]] | * | [expression]] } [PARALLEL n]; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Within a single relation, group together tuples with the same group key. Usually ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">the group key is one or more fields, but it can also be the entire tuple (*) or an ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">expression. One can also use GROUP alias ALL to group all tuples into ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">one group. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">The output relation has two fields with autogenerated names. The first field is ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">always named “group” and it has the same type as the group key. The second ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">field takes the name of the input relation and is a bag type. The schema for the ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">bag is the same as the schema for the input relation. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">COGROUP || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = COGROUP alias BY field_alias [INNER | OUTER], ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias BY field_alias [INNER | OUTER] [PARALLEL n]; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Group tuples from two or more relations, based on common group values. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">The output relation will have a tuple for each unique group value. Each tuple will ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">have the group value as its first field. The second field is a bag containing tuples ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">from the first input relation with matching group value. Ditto for the third field of ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">the output tuple. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">In the default OUTER join semantic, all group values appearing in any input ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">relation are represented in the output relation. If an input relation doesn’t ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">have any tuple with a particular group value, it will have an empty bag in the ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">corresponding output tuple. If the INNER option is set for a relation, then only ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">group values that exist in that input relation are allowed in the output relation. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">There can’t be an empty bag for that relation in the output. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">You can group on multiple fields. For this, you have to specify the fields in a ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">comma-separated list enclosed by parentheses for field_alias. ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">COGROUP (with INNER) and JOIN are similar except that COGROUP generates ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">nested output tuples. ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">CROSS || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">alias = CROSS alias, alias [, alias …] [PARALLEL n]; ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Compute the (flat) cross-product of two or more relations. This is an expensive ||
 * || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">operation and you should avoid it as far as possible. ||

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Programming using Java

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">*/
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Pig Latin type || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Java class ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Bytearray || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DataByteArray ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Chararray || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">String ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Int || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Integer ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Long || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Long ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Float || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Float ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Double || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Double ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Tuple || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Tuple ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Bag || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DataBag ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Map || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Map<Object, Object> ||
 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Map || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Map<Object, Object> ||

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">log = LOAD 'excite-small.log' AS (user, time, query); lmt = LIMIT log 4; -- Only show 4 tuples

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DUMP lmt;


 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">End of program


 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Parameter substitution

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">When you write a reusable script, it’s generally parameterized such that you can vary its operation for each run. For example, the script may take the file paths of its input and output from the user each time. Pig supports parameter substitution to allow the user to specify such information at runtime. It denotes such parameters by the $ prefix within the script. For example, the following script displays a user-specified number of tuples from a user-specified log file:

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">log = LOAD '$input' AS (user, time, query); lmt = LIMIT log $size;

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">DUMP lmt;

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">The parameters in this script are $input and $size. If you run this script using the pig command, you specify the parameters using the -param name=value argument.

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">pig -param input=excite-small.log -param size=4 Myscript.pig

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Note that you don’t need the $ prefix in the arguments. You can enclose a param-eter value in single or double quotes, if it has multiple words. A useful technique is to use Unix commands to generate the parameter values, particularly for dates. This is accomplished through Unix’s command substitution, which executes commands enclosed in back ticks (`).

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">pig -param input=web-'date +%y-%m-%d'.log -param size=4 Myscript.pig

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">By doing this, the input file for Myscript.pig will be based on the date the script is run. For example, the input file will be web-09-07-29.log if the script is run on July 29, 2009.

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">If you have to specify many parameters, it may be more convenient to put them in a file and tell Pig to execute the script using parameter substitution based on that file. For example, we can create a file Myparams.txt with the following content:

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;"># Comments in a parameter file start with hash input=excite-small.log

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">size=4

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">The parameter file is passed to the pig command with the -param_file filename argument.

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">pig -param_file Myparams.txt Myscript.pig

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">You can specify multiple parameter files as well as mix parameter files with direct specifi-cation of parameters at the command line using -param. If you define a parameter mul-tiple times, the last definition takes precedence. When in doubt about what parameter values a script ends up using, you can run the pig command with the -debug option.


 * <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Working with scripts || <span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">239 ||

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">This tells Pig to run the script and also output a file named original_script_name. substituted that has the original script but with all the parameters fully substituted. Exe-cuting pig with the -dryrun option outputs the same file but doesn’t execute the script.

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">The exec and run commands allow you to run Pig Latin scripts from within the Grunt shell, and they support parameter substitution using the same -param and -param_file arguments; for example:

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">grunt> exec -param input=excite-small.log -param size=4 Myscript.pig

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">However, parameter substitution in exec and run doesn’t support Unix commands, and there’s no debug or dryrun option.

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">10.7.3 Multiquery execution

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">In the Grunt shell, a DUMP or STORE operation processes all previous statements need-ed for the result. On the other hand, Pig optimizes and processes an entire Pig script as a whole. This difference would have no effect at all if your script has only one DUMP or STORE command at the end. If your script has multiple DUMP/STORE, Pig script’s mul-tiquery execution improves efficiency by avoiding redundant evaluations. For example,let’s say you have a script that stores intermediate data:

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">a = LOAD ...

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">b = some transformation of a

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">STORE b ...

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">c = some further transformation of b

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">STORE c ...

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">If you enter the statements in Grunt, where there’s no multiquery execution, it will generate a chain of jobs on the STORE b command to compute b. On encountering STORE c, Grunt will run another chain of jobs to compute c, but this time it will evalu-ate both a and b again! You can manually avoid this reevaluation by inserting a b = LOAD ... statement right after STORE b, to force the computation of c to use the savedvalue of b. This works on the assumption that the stored value of b has not been modi-fied, because Grunt, by itself, has no way of knowing.

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">On the other hand, if you run all the statements as a script, multiquery execution can optimize the execution by intelligently handling intermediate data. Pig compiles all the statements together and can locate the dependency and redundancy. Multiquery execution is enabled by default and usually has no effect on the computed results. But multiquery execution can fail if there are data dependencies that Pig is not aware of. This is quite rare but can happen with, for example, UDFs. Consider this script:

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">STORE a INTO 'out1'; b = LOAD ...

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">c = FOREACH b GENERATE MYUDF($0,'out1'); STORE c INTO 'out2';

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">If the custom function MYUDF is such that it accesses a through the file out1, the Pig compiler would have no way of knowing that. Not seeing the dependency, the Pig com-piler may erroneously think it OK to evaluate b and c before evaluating a. To disable multiquery execution, run the pig command with -M or -no_multiquery option.

<span style="font-family: Verdana,Geneva,sans-serif; font-size: 90%;">Ref: []