pig

PIG UDF using Python

Posted on Updated on

PIG UDF using Python

As discussed in previous post, sometimes its necessary to write user defined functions (UDF) to extend the Pig Latin functionality. In this post lets discuss writing UDF using Python. UDF can also be written using Java,Ruby and JavaScript.

Two steps are involved here:

  • Write the required python script
  • Register the python script in PIG and use the python function

Lets take a use case where a company needs to calculate tax @10% based on employee salary. In the below data set last column is the employee salary for which we have to find tax.

cat EMP_SAL.csv
1,Ravi,10.0
2,Prashant,120.5
3,Preeti,240.0
4,Swati,80.0

Tax.py:

@outputSchema(“sal:float”)
def tax(num):
return ((num)+((10*num)/100))

@outputSchema is a decorator – it defines schema in a format that Pig understands and is able to parse. Other decorators like outputFunctionSchema  and schemaFunction  are also supported.

PIG script:

REGISTER ‘Hello.py’ using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs;A = load ‘EMP_SAL.csv’ using PigStorage(‘,’) as (EMP_ID:int,NAME:chararray,SALARY:float);

Tax = foreach A generate EMP_ID,NAME,myfuncs.tax(SALARY);

describe Tax
Tax: {EMP_ID: int,NAME: chararray,sal: float}

Note the alias name ‘sal‘ in the above ‘Tax’ schema , which we defined in the @outputSchema decorator.

The final output showing the tax of each employee:

UDF

Note: Please make sure to export PIG_CLASSPATH to PIG installation/lib directory so that jython lib file is loaded by PIG.

——————————————————————————————————————————-

Lets take a complex scenario where we have to parse apache log having below format:

127.0.0.1 – – [07/Mar/2015:16:10:02 -0800] “GET /xxx/test.jpg HTTP/1.1” 400 6291

Here we have to extract IP(127.0.0.1),date(07/Mar/2015),time(16:10:02),timezone(-0800),apache status code(400). Its very easy to parse this log using python script which we can use in our PIG script.

Python code:

import re

@outputSchema(‘output_field_name:tuple(IP:chararray,date:chararray,
time:chararray,timezone:chararray,statuscode:int)’)
def output(str):
match = re.match(r'(.*)\s\S\s\S\s[[]([\w/]+):([\w:]+)\s([-\w]+)[]]\s”[\S]+\s[\w/.]+\s[\w/.”]+\s([\d]+).*’,str)
return (match.group(1),match.group(2),match.group(3),match.group(4),
match.group(5))

reg

Pig Script:

apa_python

dump B will give the below output:

dump_python

Pig Latin

Posted on Updated on

Pig’s language Pig Latin is used to express data flows. Using Pig Latin we post queries to our input data and get the required answer.

Lets understand Pig Latin operators using the below sample csv file having employee id,employee name,place and age :

Emp

Launch PIG as explained in the previous post either in local or MR mode.

LOAD:

Load is used to load data from the filesystem.The below command will load the data , EMP is the alias for the relation:

EMP = load ‘Employee.csv’ using PigStorage(‘,’) AS (EMP_ID:int,EMP_NAME:chararray,PLACE:chararray,AGE:int);

PigStorage is a Pig operator used to store or load a relation using a field delimiter. Default is tab character. As we are using a csv file I have given ‘,’ as a parameter to PigStorage.

Describe:

Describe prints a relation’s schema. Describe operator will print the schema of the relation. For instance , describe EMP will print :

grunt> describe EMP
EMP: {EMP_ID: int,EMP_NAME: chararray,PLACE: chararray,AGE: int}

DUMP:

Dump is a diagnostic tool which is helpful to see the output of a relation.Dump operator prints a relation to the console. For example, DUMP EMP:

dump

FOREACH:

Adds or removes fields to or from a relation. Now lets see how foreach..generate operator works. Lets say we are only interested in the employee name from the relation EMP. The below command will remove other fields except employee name:

EMP_NAME = foreach EMP generate EMP_NAME;

dump EMP_NAME will print the below result:

foreach

Lets discuss about a use case where we have to find the total amount used by  individual employee from a shopping website. Lets use the below sample Transactions.csv file:

EMP_ID,PROD_ID,AMOUNT,DATE
1,10,100.50,1/5/2015 2,20,50,2/5/2015 1,20,30,2/4/2015 1,20,45.50,3/3/2015 3,20,25,2/5/2015 4,10,30,5/2/2015 2,25,2,15/2/2015

Create a relation to load transactions.csv file:

TRAN = load ‘Transactions.csv’ using PigStorage(‘,’) as (EMP_ID:int,PROD_ID:int,AMOUNT:float,date:chararray);

Now we have to group the transactions based on employee id, for this we can use PIG group operator:

GRO = GROUP TRAN BY EMP_ID;
grunt> describe GRO
GRO: {group: int,TRAN: {(EMP_ID: int,PROD_ID: int,AMOUNT: float,date: chararray)}}
grunt> RES = foreach GRO generate group,SUM(TRAN.AMOUNT) AS TOT_AMT;

dump RES; will produce below results

(1,176.0)
(2,52.0)
(3,25.0)
(4,30.0)

JOIN:

If we want to display employee name then we can use the EMP relation and JOIN with RES relation.

grunt> J = JOIN EMP by EMP_ID,RES BY group;
grunt> describe J
J: {EMP::EMP_ID: int,EMP::EMP_NAME: chararray,EMP::PLACE: chararray,EMP::AGE: int,RES::group: int,RES::TOT_AMT: double}
grunt> FIN_RES = foreach J generate EMP_ID,EMP_NAME,TOT_AMT;

dump FIN_RES; will print the results along with employee name
FIN_RES

STORE:

Dump operator only displays the relation output on the console , to store it Hadoop file system we have to use store operator.

STORE FIN_RES INTO using PigStorage(‘,’);

Pig Latin supports many operators for grouping and joining data. Join is only such operator which we discussed above. We have other operators like COGROUP,CROSS,GROUP and so on which can be used based on our requirements.

Apache PIG Introduction

Posted on Updated on

What is PIG?

Pig is a high-level procedural language platform developed to simplify querying large data sets, that is used to analyze large amounts of data by representing them as data flows.

The language for this platform is called Pig Latin. A Pig Latin program is made up of series of operations or transformations, which is applied to input data for producing desired output.

Pig Latin was written to give developers and analyst a higher level language to write MapReduce jobs . If you have a edge case where you need something not available in Pig, then you can write a UDF(user defined function) to create that functionality. UDFs can be written in Java, Python, Ruby, or even Javascript.

Background:

Pig was originally developed at Yahoo Research team but now moved into Apache Software foundation.

Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data—hence the name!

Installing PIG:

Download the latest PIG from the official apache website ‘http://www.apache.org/dyn/closer.cgi/pig’ and untar it in your hadoop client machine. Update the .bashrc with the following:

export PIG_PREFIX=enter PIG installation directory path here
export PATH=$PATH:$PIG_PREFIX/bin

Please check the correct PIG version which supports your HADOOP installation(some PIG versions only support HADOOP 1.x or 2.x). This information can be found in the apache PIG website.

 

Click here for complete information on installation.

Running PIG

PIG can be executed mainly in 2 modes: local and MapReduce mode. If you have tez or spark installed in your setup then PIG can also be executed ( PIG version should support TEZ/SPARK as execution engine so check for this before proceeding further ).

In the terminal,to launch PIG grunt shell:

‘pig -x local’ will execute PIG in local mode i.e. PIG will use local filesystem.

‘pig -x mapreduce’ or ‘pig’ will execute PIG in MapReduce mode and uses Hadoop filesystem.

PIG operators:

Some of the basic PIG operators:

  • LOAD                                          – Loads data from the file system
  • DESCRIBE                                 – Prints a relation’s schema
  • FILTER                                       – to filter out unwanted data
  • FOREACH…GENERATE        – Adds or removes fields from a relation
  • DESCRIBE                                  – Prints a relation’s schema
  • DUMP                                          – Prints a relation to the console

The result of any operator in Pig Latin is a relation. To understand what a relation is, one needs to know about a bag, tuple and field.

Field: is a piece of data. For instance Emp Id is a field in the below dataset:

Emp Id,Name
1,Sireesh
2,Kumar

Tuple: A tuple is just like a row in the database table. With respect to PIG we can say it’s a cor instance, here

(1,Sireesh)

is a tuple.

Bag: Is a collection of tuples. For example,

{(1,Sireesh), (2,Kumar)}

Please refer my next post which explains about loading and transforming data using Pig Latin. Read the rest of this entry »