i had cascading job. i'm never had experience distributed systems before, i'm having trouble understanding how make work.
i have config file has bunch of buckets:
bucket{ bucket_name: "x" input_path: "s3://..." key_column: 1 value_column: 2 multivalue: false default_value: type_column: int } ...
basically, have use collect bunch of files (each of them tsv table maps url keys value) , group url's.
so basically, how outline looks:
a --> |group | b --> |by |--> output c --> |url |
i wondering if following logic right: 1) need create tap each of buckets i.e.
tap inputtap = new globhfs(new textline(), bucket.getinputpath());
2) need create each pipe out of pipes (this part unsure about, need each pipe, should filter/function be?). right now, have created each pipe splits lines tabs.
regexsplitgenerator splitter = new regexsplitgenerator("\t"); pipe tokenizedpipe = bucket.getbucketname(), new field("line"), splitter));
3) create groupby pipe combines of these tokenized pipes together. i'm not precisely sure how force groupby pipe select key columns, technique i'm using right is:
pipe finalpipe = new groupby("output pipe", inputpipes, groupfields);
so correct logic approach problem? or of steps redundant or incorrect?
thank you!
your thought looks me. step 2 can skipped if split records when tap reading input files.
tap inputtap = new hfs(new textdelimited("\t"), inputpath);
cascading impatient covers how implement want achieve, it's worthy take @ it.