Apache PIG Introduction

Posted on Updated on

What is PIG?

Pig is a high-level procedural language platform developed to simplify querying large data sets, that is used to analyze large amounts of data by representing them as data flows.

The language for this platform is called Pig Latin. A Pig Latin program is made up of series of operations or transformations, which is applied to input data for producing desired output.

Pig Latin was written to give developers and analyst a higher level language to write MapReduce jobs . If you have a edge case where you need something not available in Pig, then you can write a UDF(user defined function) to create that functionality. UDFs can be written in Java, Python, Ruby, or even Javascript.

Background:

Pig was originally developed at Yahoo Research team but now moved into Apache Software foundation.

Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data—hence the name!

Installing PIG:

Download the latest PIG from the official apache website ‘http://www.apache.org/dyn/closer.cgi/pig’ and untar it in your hadoop client machine. Update the .bashrc with the following:

export PIG_PREFIX=enter PIG installation directory path here
export PATH=$PATH:$PIG_PREFIX/bin

Please check the correct PIG version which supports your HADOOP installation(some PIG versions only support HADOOP 1.x or 2.x). This information can be found in the apache PIG website.

 

Click here for complete information on installation.

Running PIG

PIG can be executed mainly in 2 modes: local and MapReduce mode. If you have tez or spark installed in your setup then PIG can also be executed ( PIG version should support TEZ/SPARK as execution engine so check for this before proceeding further ).

In the terminal,to launch PIG grunt shell:

‘pig -x local’ will execute PIG in local mode i.e. PIG will use local filesystem.

‘pig -x mapreduce’ or ‘pig’ will execute PIG in MapReduce mode and uses Hadoop filesystem.

PIG operators:

Some of the basic PIG operators:

  • LOAD                                          – Loads data from the file system
  • DESCRIBE                                 – Prints a relation’s schema
  • FILTER                                       – to filter out unwanted data
  • FOREACH…GENERATE        – Adds or removes fields from a relation
  • DESCRIBE                                  – Prints a relation’s schema
  • DUMP                                          – Prints a relation to the console

The result of any operator in Pig Latin is a relation. To understand what a relation is, one needs to know about a bag, tuple and field.

Field: is a piece of data. For instance Emp Id is a field in the below dataset:

Emp Id,Name
1,Sireesh
2,Kumar

Tuple: A tuple is just like a row in the database table. With respect to PIG we can say it’s a cor instance, here

(1,Sireesh)

is a tuple.

Bag: Is a collection of tuples. For example,

{(1,Sireesh), (2,Kumar)}

Please refer my next post which explains about loading and transforming data using Pig Latin.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s