Processing Multiple files

premupdate · Post by **premupdate** » Thu Nov 14, 2013 10:32 am

Hello All,

There is an existing sequencer which process some N number of files and produce N number of outputs daily. Number of Source files are dynamic. It is passing through loop one by one and producing one file at a time.Need help on the following two concerns

1)This process is consuming a lot of time.This has to be optimized.
2)Whenever this sequence is getting aborted in between,the loop is again starting from the first file till Nth file.The restartability feature has to be included in the loop.

Appreciate your help on this.

chulett · Post by **chulett** » Thu Nov 14, 2013 10:40 am

1) Do all the files need to be processed individually?
2) Checkpoints in the Sequence should solve that problem.

premupdate · Post by **premupdate** » Thu Nov 14, 2013 10:48 am

Craig,

1) Yes ,the files has to be processed individually.
2)Thanks,I will check this and get back on the outcome.

prasson_ibm · Post by **prasson_ibm** » Thu Nov 14, 2013 2:03 pm

If you are on MPP or grid system and source file is fixed width, you can improve performance by setting 'read from multiple node' it will enable file to read from more then one node, otherwise you can increase "no of reader per node" setting, it will increase the instence of read operator but overall file will read on single node and on single cpu.

ray.wurlod · Post by **ray.wurlod** » Thu Nov 14, 2013 2:30 pm

You could use a design with more than one loop, the loops executing concurrently to process distinct subsets of files.

premupdate · Post by **premupdate** » Thu Nov 14, 2013 7:49 pm

Ray,

As the number of source files are dynamic,sometimes 1 file and sometimes 60 files how can i decide the additional loop design to process.

Also,the existing process includes fetching data from Teradata table(as reference link).If am using multiple loops,i.e.,same select query will be executed multiple times against the table at the same instance.
Does this lock the table from reading it.

ray.wurlod · Post by **ray.wurlod** » Thu Nov 14, 2013 9:50 pm

If you have two loops, use "odds and evens". More generically use Mod() of the line number in the list of filenames divided by the number of loops.

chulett · Post by **chulett** » Thu Nov 14, 2013 10:51 pm

Or figure out what part of the file load is "consuming a lot of time" and optimize it. Then perhaps start some form of multi-looping if it needs further optimization.