Page 1 of 1

Processing Multiple files

Posted: Thu Nov 14, 2013 10:32 am
by premupdate
Hello All,

There is an existing sequencer which process some N number of files and produce N number of outputs daily. Number of Source files are dynamic. It is passing through loop one by one and producing one file at a time.Need help on the following two concerns

1)This process is consuming a lot of time.This has to be optimized.
2)Whenever this sequence is getting aborted in between,the loop is again starting from the first file till Nth file.The restartability feature has to be included in the loop.

Appreciate your help on this.

Posted: Thu Nov 14, 2013 10:40 am
by chulett
1) Do all the files need to be processed individually?
2) Checkpoints in the Sequence should solve that problem.

Posted: Thu Nov 14, 2013 10:48 am
by premupdate
Craig,

1) Yes ,the files has to be processed individually.
2)Thanks,I will check this and get back on the outcome.

Posted: Thu Nov 14, 2013 2:03 pm
by prasson_ibm
If you are on MPP or grid system and source file is fixed width, you can improve performance by setting 'read from multiple node' it will enable file to read from more then one node, otherwise you can increase "no of reader per node" setting, it will increase the instence of read operator but overall file will read on single node and on single cpu.

Posted: Thu Nov 14, 2013 2:30 pm
by ray.wurlod
You could use a design with more than one loop, the loops executing concurrently to process distinct subsets of files.

Posted: Thu Nov 14, 2013 7:49 pm
by premupdate
Ray,

As the number of source files are dynamic,sometimes 1 file and sometimes 60 files how can i decide the additional loop design to process.

Also,the existing process includes fetching data from Teradata table(as reference link).If am using multiple loops,i.e.,same select query will be executed multiple times against the table at the same instance.
Does this lock the table from reading it.

Posted: Thu Nov 14, 2013 9:50 pm
by ray.wurlod
If you have two loops, use "odds and evens". More generically use Mod() of the line number in the list of filenames divided by the number of loops.

Posted: Thu Nov 14, 2013 10:51 pm
by chulett
Or figure out what part of the file load is "consuming a lot of time" and optimize it. Then perhaps start some form of multi-looping if it needs further optimization.