PIG UDF using Python

Posted on Updated on

PIG UDF using Python

As discussed in previous post, sometimes its necessary to write user defined functions (UDF) to extend the Pig Latin functionality. In this post lets discuss writing UDF using Python. UDF can also be written using Java,Ruby and JavaScript.

Two steps are involved here:

  • Write the required python script
  • Register the python script in PIG and use the python function

Lets take a use case where a company needs to calculate tax @10% based on employee salary. In the below data set last column is the employee salary for which we have to find tax.

cat EMP_SAL.csv
1,Ravi,10.0
2,Prashant,120.5
3,Preeti,240.0
4,Swati,80.0

Tax.py:

@outputSchema(“sal:float”)
def tax(num):
return ((num)+((10*num)/100))

@outputSchema is a decorator Рit defines schema in a format that Pig understands and is able to parse. Other decorators like outputFunctionSchema  and schemaFunction  are also supported.

PIG script:

REGISTER ‘Hello.py’ using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs;A = load ‘EMP_SAL.csv’ using PigStorage(‘,’) as (EMP_ID:int,NAME:chararray,SALARY:float);

Tax = foreach A generate EMP_ID,NAME,myfuncs.tax(SALARY);

describe Tax
Tax: {EMP_ID: int,NAME: chararray,sal: float}

Note the alias name ‘sal‘ in the above ‘Tax’ schema , which we defined in the @outputSchema decorator.

The final output showing the tax of each employee:

UDF

Note: Please make sure to export PIG_CLASSPATH to PIG installation/lib directory so that jython lib file is loaded by PIG.

——————————————————————————————————————————-

Lets take a complex scenario where we have to parse apache log having below format:

127.0.0.1 – – [07/Mar/2015:16:10:02 -0800] “GET /xxx/test.jpg HTTP/1.1” 400 6291

Here we have to extract IP(127.0.0.1),date(07/Mar/2015),time(16:10:02),timezone(-0800),apache status code(400). Its very easy to parse this log using python script which we can use in our PIG script.

Python code:

import re

@outputSchema(‘output_field_name:tuple(IP:chararray,date:chararray,
time:chararray,timezone:chararray,statuscode:int)’)
def output(str):
match = re.match(r'(.*)\s\S\s\S\s[[]([\w/]+):([\w:]+)\s([-\w]+)[]]\s”[\S]+\s[\w/.]+\s[\w/.”]+\s([\d]+).*’,str)
return (match.group(1),match.group(2),match.group(3),match.group(4),
match.group(5))

reg

Pig Script:

apa_python

dump B will give the below output:

dump_python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s