java - Cloud Dataflow: reading entire text files rather than lines by line -


i'm looking way read entire files every file read entirely single string. want pass pattern of json text files on gs://my_bucket/*/*.json, have pardo process each , every file entirely.

what's best approach it?

i going give useful answer, though there special cases [1] might different.

i think want define new subclass of filebasedsource , use read.from(<source>). source include subclass of filebasedreader; source contains configuration data , reader reading.

i think full description of api best left javadoc, highlight key override points , how relate needs:

  • filebasedsource#issplittable() want override , return false. indicate there no intra-file splitting.
  • filebasedsource#createforsubrangeoffile(string, long, long) override return sub-source file specified.
  • filebasedsource#createsinglefilereader() override produce filebasedreader current file (the method should assume split level of single file).

to implement reader:

  • filebasedreader#startreading(...) override nothing; framework have opened file you, , close it.
  • filebasedreader#readnextrecord() override read entire file single element.

[1] 1 example easy special case when have small number of files, can expand them prior job submission, , take same amount of time process. can use create.of(expand(<glob>)) followed pardo(<read file>).