Apache Pig in Practice 1


I write many pig script in the past few months and have explored some tricks with my buddies. hopes it could help someone.

Let’s focus on some interesting topics in this first article and get prepared for the later Pig rush.

IDE & Environment

Vim

I use Vim to write most script language and those are my favourite plugins to write Pig:

  • Pig Syntax Highlight. Latest update on Jun 2014, Pig 0.12 supported.
  • You complete me. Best auto-complete plugins ever. If you don’t use a MAC, Supertab is also a reasonable choice.
  • Tabularize Align and keep cleaness of the Pig codelet. Most common usage is :Tab/AS to align FOREACH ... GENERATE clause.

To improve debug efficiency, I like to run pig with short cut. Here are my simple approach: add the following in .vimrc for quick run with F5

1
2
3
4
5
6
7
8
9
10
11
12
13
14
map <F5> :call Compile_Run()<CR>
function Compile_Run()
if &filetype=="coffee"
:w
!coffee % 2>&1
elseif &filetype=="cpp"
:w
!g++ -g -o %< %; ./%<
elseif &filetype=="python"
:w
!python %
elseif &filetype=="pig"
:w
!./run_pig.sh %

revise run_pig.sh as you like. General idea is reduce redundant work and typo.

Basic form would be:

1
pig -x local $1

or with default local debug settings

1
2
3
4
intput=./input.txt
output=./output
rm -rf ${output?}
pig -x local -Dinput=${input} -Doutput=${output} $1

Modulize your Pig code using Marco

Under standing Marco

We have mix feelings with Marco, still I love it better.

Marco could help organize and reuse your code. Marco in Pig is quite like Marco in C – they do substitution. Think about you want to do the same series of operation with 3 dataset… Try refactor it into a marco, you will absolutely thank your mercy later.

Understanding the $ sign is important when using Marco. $ decorate those variable to be replaced

number1.txt
1
2
3
4
5
1
2
3
4
5
filter.marco
1
2
3
DEFINE filter_small_number (events, threshold) RETURNS filtered_events {
$filtered_events = FILTER $events BY a > $threshold;
};
1
2
3
4
5
6
7
IMPORT 'filter.marco';

events = LOAD './data/number1.txt' AS (a:int);

big_number = filter_small_number(events, 2);

DUMP big_number;
1
2
3
(3)
(4)
(5)

Easy.

However, you’d better not change input with in a Marco. they are just substitution, every change in input variables are global

1
2
3
4
5
6
7
8
IMPORT 'filter.marco';

events = LOAD './data/number1.txt' AS (a:int);

big_number = filter_small_number(events, 2);

DUMP big_number;
DUMP events;
1
2
3
4
DEFINE filter_small_number (events, threshold) RETURNS filtered_events {
$filtered_events = FILTER $events BY a > $threshold;
$events = FILTER $events BY a == 4;
};
1
2
3
4
5
6
7
big_number
(3)
(4)
(5)

events
(4)

You can also return multiple data set in Marco

1
2
3
4
5
6
7
8
9
10
11
12
DEFINE split_events (events, threshold) RETURNS big, small {
$big = FILTER $events BY a >= $threshold;
$small = FILTER $events BY a < $threshold;
};

events = LOAD './data/number1.txt' AS (a:int);

big_num, small_num = split_events(events, 3);

DUMP big_num;
DUMP small_num;

1
2
3
4
5
6
7
big
(3)
(4)
(5)
small
(1)
(2)

What’s not so cool

One reason we love Marco less is that after marco is plugined into Pig then error message become a little difficult to read and resolve root cause, because line number would be reflecting the reassembled Pig scripts. However, it’s still a great tool and a must have skill to use Pig.

INPUT and OUTPUT

INPUT

  • You can almost load everything, HDFS / Avro / Protobuf / Hive / Elastic Search / MongoDB.

OUTPUT

  • Of course, there are Storage Function in Pair for above persistency / serialization tools.
  • MultiStorage could help you store data hierarchily, which mean you could partition result when storing, absolutly must-know features.

Third party Pig library

UDF

Once you know you could use Python/Ruby/JS to write UDF, I suppose nobody will try to use JAVA for common cases.
Python UDF

Unit test

PigUnit

Write UT to be a good man. Of course, Pig could and should be unit-tested. The PigUnit backbone are supported in Java. However docs are limited and you might run into many troubles.

Unit test a python UDF

when using native unittest packages to test the python script,outputSchema will complains. One way is to add Pig support in Python script, the other one is to disable the outputSchema notation. Here we should the second tricks, put this codelet at the top of the UDF.

1
2
3
4
5
6
7
if __name__ != '__lib__': 
def outputSchema(dont_care):
def wrapper(func):
def inner(*args, **kwargs):
return func(*args, **kwargs)
return inner
return wrapper

This block is intended to test the UDF with the outputSchema notation. The __name__ will be marked as ‘lib’ when script is call by Pig. So it will not take effect when the script is running as Pig UDF.

References

Comparing Pig Latin and SQL for Constructing Data Processing Pipelines By Alan Gates, Pig Architect in Yahoo.
Programming Pig also by Alan Gates.
Pig Design Pattern

Share