Design of job question

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
denzilsyb
Participant
Posts: 186
Joined: Mon Sep 22, 2003 7:38 am
Location: South Africa
Contact:

Design of job question

Post by denzilsyb »

Hi guys

If I have a CFF source file as stream input to a Transformer (TFM) and have 8 destination SEQ files based on a contsraint in the TFM, how would performance differ should I rewrite the job to have the source file stream into a TFM1, and from there have two 3 outputs, 2 constrained (to SEQ files) on the column and the other a [Reject] output. I continue with this process until I have 4 TFM stages each (except the last) with two SEQ destinations and the third the [reject] output.

I could of course add IPC stages between all reads and writes, which may improve performance, but I am more concerned about how the design of a job that does the above affects performance. I suppose I am looking for best practice.

So, from:

Code: Select all


 CFF ---- TFM ---- SEQ1
              ---- SEQ2
              ---- SEQ3
              ---- SEQ4
              ---- SEQ5
              ---- SEQ6
              ---- SEQ7
              ---- SEQ8
                  
to:

Code: Select all


 CFF ---- TFM1 ---- SEQ1
               ---- SEQ2
               ---- TFM2
                     ---- SEQ3
                     ---- SEQ4
                     ---- TFM3
                          ---- SEQ5
                          ---- SEQ6
                          ---- TFM4
                                ---- SEQ7
                                ---- SEQ8
                  
There is of course the alternative to splitting the job into 8 individual jobs,each job then reading the same CFF source file and outputting to a SEQ file. The overhead would then rather be on many jobs reading the same source file, as opposed to the above method which is a relatively complex job when compared to one of these simple jobs.

To the eye, it is a lot more appeasing looking at a simple job than a complex design of links and stages.

The number of rows in the source file is 40 million. I must add that the CFF has 19 columns of which I am only interested in 14. The first TFM omits the columns I dont require in the process.
dnzl
"what the thinker thinks, the prover proves" - Robert Anton Wilson
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Post by rasi »

Hi

It is always a good practice to cut the large block(40 Millions) into 10 blocks(4 Millions) or you could make it 20 small blocks depends on the machine and other jobs.

Create the multi-instance job which will read from sequential file and write into 8 sequential files based on the constraint. Run the multi-instance job the number of times you decides to break into. Once all the job is finished then merge all the files together.

Thanks
Siva
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

The multi instance job is a good idea, on Unix you will get better performance from multiple jobs instead of one job with multiple streams.

The CFF stage has the "Selection Critiria" tab that you might be able to use to partition it. Depends on your data. This might be quicker then partitioning the data in a transformer.

You can also omit unnecessary fields in the CFF stage so they don't even get to the transformer to save a bit more time.
denzilsyb
Participant
Posts: 186
Joined: Mon Sep 22, 2003 7:38 am
Location: South Africa
Contact:

Post by denzilsyb »

rasi wrote: Create the multi-instance job which will read from sequential file and write into 8 sequential files based on the constraint.
Siva, Vincent. I am going to give this a shot and will post the results. This is much better than creating 8 individual jobs!
dnzl
"what the thinker thinks, the prover proves" - Robert Anton Wilson
Post Reply