java - Cascading groupby Example -


i had cascading job. i'm never had experience distributed systems before, i'm having trouble understanding how make work.

i have config file has bunch of buckets:

bucket{   bucket_name: "x"   input_path: "s3://..."   key_column: 1   value_column: 2   multivalue: false   default_value:    type_column: int }  ... 

basically, have use collect bunch of files (each of them tsv table maps url keys value) , group url's.

so basically, how outline looks:

a --> |group | b --> |by    |--> output  c --> |url   |  

i wondering if following logic right: 1) need create tap each of buckets i.e.

tap inputtap = new globhfs(new textline(), bucket.getinputpath()); 

2) need create each pipe out of pipes (this part unsure about, need each pipe, should filter/function be?). right now, have created each pipe splits lines tabs.

regexsplitgenerator splitter = new regexsplitgenerator("\t"); pipe tokenizedpipe = bucket.getbucketname(), new field("line"), splitter)); 

3) create groupby pipe combines of these tokenized pipes together. i'm not precisely sure how force groupby pipe select key columns, technique i'm using right is:

pipe finalpipe = new groupby("output pipe", inputpipes, groupfields); 

so correct logic approach problem? or of steps redundant or incorrect?

thank you!

your thought looks me. step 2 can skipped if split records when tap reading input files.

tap inputtap = new hfs(new textdelimited("\t"), inputpath); 

cascading impatient covers how implement want achieve, it's worthy take @ it.