Processing Multiple files

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
premupdate
Participant
Posts: 47
Joined: Thu Oct 04, 2007 3:37 am
Location: chennai

Processing Multiple files

Post by premupdate »

Hello All,

There is an existing sequencer which process some N number of files and produce N number of outputs daily. Number of Source files are dynamic. It is passing through loop one by one and producing one file at a time.Need help on the following two concerns

1)This process is consuming a lot of time.This has to be optimized.
2)Whenever this sequence is getting aborted in between,the loop is again starting from the first file till Nth file.The restartability feature has to be included in the loop.

Appreciate your help on this.
Cheers,
prem
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

1) Do all the files need to be processed individually?
2) Checkpoints in the Sequence should solve that problem.
-craig

"You can never have too many knives" -- Logan Nine Fingers
premupdate
Participant
Posts: 47
Joined: Thu Oct 04, 2007 3:37 am
Location: chennai

Post by premupdate »

Craig,

1) Yes ,the files has to be processed individually.
2)Thanks,I will check this and get back on the outcome.
Cheers,
prem
prasson_ibm
Premium Member
Premium Member
Posts: 536
Joined: Thu Oct 11, 2007 1:48 am
Location: Bangalore

Post by prasson_ibm »

If you are on MPP or grid system and source file is fixed width, you can improve performance by setting 'read from multiple node' it will enable file to read from more then one node, otherwise you can increase "no of reader per node" setting, it will increase the instence of read operator but overall file will read on single node and on single cpu.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You could use a design with more than one loop, the loops executing concurrently to process distinct subsets of files.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
premupdate
Participant
Posts: 47
Joined: Thu Oct 04, 2007 3:37 am
Location: chennai

Post by premupdate »

Ray,

As the number of source files are dynamic,sometimes 1 file and sometimes 60 files how can i decide the additional loop design to process.

Also,the existing process includes fetching data from Teradata table(as reference link).If am using multiple loops,i.e.,same select query will be executed multiple times against the table at the same instance.
Does this lock the table from reading it.
Cheers,
prem
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you have two loops, use "odds and evens". More generically use Mod() of the line number in the list of filenames divided by the number of loops.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Or figure out what part of the file load is "consuming a lot of time" and optimize it. Then perhaps start some form of multi-looping if it needs further optimization.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply