Pig Latin

Posted on Updated on

Pig’s language Pig Latin is used to express data flows. Using Pig Latin we post queries to our input data and get the required answer.

Lets understand Pig Latin operators using the below sample csv file having employee id,employee name,place and age :

Emp

Launch PIG as explained in the previous post either in local or MR mode.

LOAD:

Load is used to load data from the filesystem.The below command will load the data , EMP is the alias for the relation:

EMP = load ‘Employee.csv’ using PigStorage(‘,’) AS (EMP_ID:int,EMP_NAME:chararray,PLACE:chararray,AGE:int);

PigStorage is a Pig operator used to store or load a relation using a field delimiter. Default is tab character. As we are using a csv file I have given ‘,’ as a parameter to PigStorage.

Describe:

Describe prints a relation’s schema. Describe operator will print the schema of the relation. For instance , describe EMP will print :

grunt> describe EMP
EMP: {EMP_ID: int,EMP_NAME: chararray,PLACE: chararray,AGE: int}

DUMP:

Dump is a diagnostic tool which is helpful to see the output of a relation.Dump operator prints a relation to the console. For example, DUMP EMP:

dump

FOREACH:

Adds or removes fields to or from a relation. Now lets see how foreach..generate operator works. Lets say we are only interested in the employee name from the relation EMP. The below command will remove other fields except employee name:

EMP_NAME = foreach EMP generate EMP_NAME;

dump EMP_NAME will print the below result:

foreach

Lets discuss about a use case where we have to find the total amount used by  individual employee from a shopping website. Lets use the below sample Transactions.csv file:

EMP_ID,PROD_ID,AMOUNT,DATE
1,10,100.50,1/5/2015 2,20,50,2/5/2015 1,20,30,2/4/2015 1,20,45.50,3/3/2015 3,20,25,2/5/2015 4,10,30,5/2/2015 2,25,2,15/2/2015

Create a relation to load transactions.csv file:

TRAN = load ‘Transactions.csv’ using PigStorage(‘,’) as (EMP_ID:int,PROD_ID:int,AMOUNT:float,date:chararray);

Now we have to group the transactions based on employee id, for this we can use PIG group operator:

GRO = GROUP TRAN BY EMP_ID;
grunt> describe GRO
GRO: {group: int,TRAN: {(EMP_ID: int,PROD_ID: int,AMOUNT: float,date: chararray)}}
grunt> RES = foreach GRO generate group,SUM(TRAN.AMOUNT) AS TOT_AMT;

dump RES; will produce below results

(1,176.0)
(2,52.0)
(3,25.0)
(4,30.0)

JOIN:

If we want to display employee name then we can use the EMP relation and JOIN with RES relation.

grunt> J = JOIN EMP by EMP_ID,RES BY group;
grunt> describe J
J: {EMP::EMP_ID: int,EMP::EMP_NAME: chararray,EMP::PLACE: chararray,EMP::AGE: int,RES::group: int,RES::TOT_AMT: double}
grunt> FIN_RES = foreach J generate EMP_ID,EMP_NAME,TOT_AMT;

dump FIN_RES; will print the results along with employee name
FIN_RES

STORE:

Dump operator only displays the relation output on the console , to store it Hadoop file system we have to use store operator.

STORE FIN_RES INTO using PigStorage(‘,’);

Pig Latin supports many operators for grouping and joining data. Join is only such operator which we discussed above. We have other operators like COGROUP,CROSS,GROUP and so on which can be used based on our requirements.

One thought on “Pig Latin

    Apache PIG Introduction | sireeshkomuravelly said:
    May 3, 2015 at 2:14 pm

    […] Pig Latin […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s