ok here we go,
SCENARIO,
I have 30 extract jobs (in sequence -> seq_ext, all 30 in parallel) pulling data from Source Database (1 cluster). Each extract job might at the maximum extract .5 million records.
I need to use the same extract jobs on 9 other clusters (i.e. altogether 10).
I planned to have an enterprise scheduler kick of 10 instances of the sequence - seq_ext (calling the 30 extract jobs on each cluster), simultaneously.
Each instance of seq_ext will create a flag file upon completion (altogether 10 instances will create 10 flag files upon completion)
The enterprise scheduler will keep checking for all the 10 flag files, when ready, it will kick off the main sequence for processsing.
For processing I will have to combine data from all clusters, i.e. Lets say, we have Extract_JobA
Then all files obtained from
Extract_JobA.Instance1,.....Extract_JobA.Instance10
I have to use cat command to combine these files
The things below are hindering me to take the decision. Please try to answer the first question in as muchas possible. I would really appreciate the help.
QUESTIONS
Q1) i) Is it good idea to call all 10 instances at the same time
(
Considering that these are only extract jobs, no hash files, no sorts, no aggregators,
just
Code: Select all
Source -> Transformer -> Sequential File
)
ii) Or shall I stick to 2 or 3 instances at a time.
iii) How will memory/Disk space play a role on deciding this.
iv) Is there any way I could calculate rough estimate on how many instances can be kicked at the same time?
Q2) Is it a good idea to have enterprise scheduler look for those 10 flag files (or) should a wait stage be used in the mains sequence for processing.
Q3) will cat command work well for concatenation 10 - 0.5 million records.
Thanks
Awaiting for replies