I write many pig script in the past few months and have explored some tricks with my buddies. hopes it could help someone.
Let’s focus on some interesting topics in this first article and get prepared for the later Pig rush.
IDE & Environment
Vim
I use Vim to write most script language and those are my favourite plugins to write Pig:
- Pig Syntax Highlight. Latest update on Jun 2014, Pig 0.12 supported.
- You complete me. Best auto-complete plugins ever. If you don’t use a MAC, Supertab is also a reasonable choice.
- Tabularize Align and keep cleaness of the Pig codelet. Most common usage is
:Tab/AS
to alignFOREACH ... GENERATE
clause.
To improve debug efficiency, I like to run pig with short cut. Here are my simple approach: add the following in .vimrc
for quick run with F5
1
2
3
4
5
6
7
8
9
10
11
12
13
14map <F5> :call Compile_Run()<CR>
function Compile_Run()
if &filetype=="coffee"
:w
!coffee % 2>&1
elseif &filetype=="cpp"
:w
!g++ -g -o %< %; ./%<
elseif &filetype=="python"
:w
!python %
elseif &filetype=="pig"
:w
!./run_pig.sh %
revise run_pig.sh
as you like. General idea is reduce redundant work and typo.
Basic form would be:1
pig -x local $1
or with default local debug settings1
2
3
4intput=./input.txt
output=./output
rm -rf ${output?}
pig -x local -Dinput=${input} -Doutput=${output} $1
Modulize your Pig code using Marco
Under standing Marco
We have mix feelings with Marco, still I love it better.
Marco could help organize and reuse your code. Marco in Pig is quite like Marco in C – they do substitution. Think about you want to do the same series of operation with 3 dataset… Try refactor it into a marco, you will absolutely thank your mercy later.
Understanding the $
sign is important when using Marco. $
decorate those variable to be replaced
1 | 1 |
1 | DEFINE filter_small_number (events, threshold) RETURNS filtered_events { |
1 | IMPORT 'filter.marco'; |
1 | (3) |
Easy.
However, you’d better not change input with in a Marco. they are just substitution, every change in input variables are global
1 | IMPORT 'filter.marco'; |
1 | DEFINE filter_small_number (events, threshold) RETURNS filtered_events { |
1 | big_number |
You can also return multiple data set in Marco1
2
3
4
5
6
7
8
9
10
11
12DEFINE split_events (events, threshold) RETURNS big, small {
$big = FILTER $events BY a >= $threshold;
$small = FILTER $events BY a < $threshold;
};
events = LOAD './data/number1.txt' AS (a:int);
big_num, small_num = split_events(events, 3);
DUMP big_num;
DUMP small_num;
1 | big |
What’s not so cool
One reason we love Marco less is that after marco is plugined into Pig then error message become a little difficult to read and resolve root cause, because line number would be reflecting the reassembled Pig scripts. However, it’s still a great tool and a must have skill to use Pig.
INPUT and OUTPUT
INPUT
- You can almost load everything, HDFS / Avro / Protobuf / Hive / Elastic Search / MongoDB.
OUTPUT
- Of course, there are Storage Function in Pair for above persistency / serialization tools.
- MultiStorage could help you store data hierarchily, which mean you could partition result when storing, absolutly must-know features.
Third party Pig library
- piggybank
- DataFu from LinkedIn
- ElephantBird from twitter
- Hcatalog
UDF
Once you know you could use Python/Ruby/JS to write UDF, I suppose nobody will try to use JAVA for common cases.
Python UDF
Unit test
PigUnit
Write UT to be a good man. Of course, Pig could and should be unit-tested. The PigUnit backbone are supported in Java. However docs are limited and you might run into many troubles.
Unit test a python UDF
when using native unittest
packages to test the python script,outputSchema
will complains. One way is to add Pig support in Python script, the other one is to disable the outputSchema notation. Here we should the second tricks, put this codelet at the top of the UDF.
1 | if __name__ != '__lib__': |
This block is intended to test the UDF with the outputSchema notation. The __name__
will be marked as ‘lib’ when script is call by Pig. So it will not take effect when the script is running as Pig UDF.
References
Comparing Pig Latin and SQL for Constructing Data Processing Pipelines By Alan Gates, Pig Architect in Yahoo.
Programming Pig also by Alan Gates.
Pig Design Pattern