# Apache Pig in Practice 1

I write many pig script in the past few months and have explored some tricks with my buddies. hopes it could help someone.

Let’s focus on some interesting topics in this first article and get prepared for the later Pig rush.

## IDE & Environment

### Vim

I use Vim to write most script language and those are my favourite plugins to write Pig:

• Pig Syntax Highlight. Latest update on Jun 2014, Pig 0.12 supported.
• You complete me. Best auto-complete plugins ever. If you don’t use a MAC, Supertab is also a reasonable choice.
• Tabularize Align and keep cleaness of the Pig codelet. Most common usage is :Tab/AS to align FOREACH ... GENERATE clause.

To improve debug efficiency, I like to run pig with short cut. Here are my simple approach: add the following in .vimrc for quick run with F5

revise run_pig.sh as you like. General idea is reduce redundant work and typo.

Basic form would be:

or with default local debug settings

## Modulize your Pig code using Marco

### Under standing Marco

We have mix feelings with Marco, still I love it better.

Marco could help organize and reuse your code. Marco in Pig is quite like Marco in C – they do substitution. Think about you want to do the same series of operation with 3 dataset… Try refactor it into a marco, you will absolutely thank your mercy later.

Understanding the $ sign is important when using Marco. $ decorate those variable to be replaced

Easy.

However, you’d better not change input with in a Marco. they are just substitution, every change in input variables are global

You can also return multiple data set in Marco

### What’s not so cool

One reason we love Marco less is that after marco is plugined into Pig then error message become a little difficult to read and resolve root cause, because line number would be reflecting the reassembled Pig scripts. However, it’s still a great tool and a must have skill to use Pig.

# INPUT and OUTPUT

## INPUT

• You can almost load everything, HDFS / Avro / Protobuf / Hive / Elastic Search / MongoDB.

## OUTPUT

• Of course, there are Storage Function in Pair for above persistency / serialization tools.
• MultiStorage could help you store data hierarchily, which mean you could partition result when storing, absolutly must-know features.

# UDF

Once you know you could use Python/Ruby/JS to write UDF, I suppose nobody will try to use JAVA for common cases.
Python UDF

# Unit test

## PigUnit

Write UT to be a good man. Of course, Pig could and should be unit-tested. The PigUnit backbone are supported in Java. However docs are limited and you might run into many troubles.

## Unit test a python UDF

when using native unittest packages to test the python script，outputSchema will complains. One way is to add Pig support in Python script, the other one is to disable the outputSchema notation. Here we should the second tricks, put this codelet at the top of the UDF.

This block is intended to test the UDF with the outputSchema notation. The __name__ will be marked as ‘lib’ when script is call by Pig. So it will not take effect when the script is running as Pig UDF.

# References

Comparing Pig Latin and SQL for Constructing Data Processing Pipelines By Alan Gates, Pig Architect in Yahoo.
Programming Pig also by Alan Gates.
Pig Design Pattern

Share