java - Cascading groupby Example -

i had cascading job. i'm never had experience distributed systems before, i'm having trouble understanding how make work.

i have config file has bunch of buckets:

bucket{   bucket_name: "x"   input_path: "s3://..."   key_column: 1   value_column: 2   multivalue: false   default_value:    type_column: int }  ...

basically, have use collect bunch of files (each of them tsv table maps url keys value) , group url's.

so basically, how outline looks:

a --> |group | b --> |by    |--> output  c --> |url   |

i wondering if following logic right: 1) need create tap each of buckets i.e.

tap inputtap = new globhfs(new textline(), bucket.getinputpath());

2) need create each pipe out of pipes (this part unsure about, need each pipe, should filter/function be?). right now, have created each pipe splits lines tabs.

regexsplitgenerator splitter = new regexsplitgenerator("\t"); pipe tokenizedpipe = bucket.getbucketname(), new field("line"), splitter));

3) create groupby pipe combines of these tokenized pipes together. i'm not precisely sure how force groupby pipe select key columns, technique i'm using right is:

pipe finalpipe = new groupby("output pipe", inputpipes, groupfields);

so correct logic approach problem? or of steps redundant or incorrect?

thank you!

your thought looks me. step 2 can skipped if split records when tap reading input files.

tap inputtap = new hfs(new textdelimited("\t"), inputpath);

cascading impatient covers how implement want achieve, it's worthy take @ it.

Story

Search This Blog

java - Cascading groupby Example -