Design of job question

denzilsyb · Post by **denzilsyb** » Sat Oct 30, 2004 6:09 am

Hi guys

If I have a CFF source file as stream input to a Transformer (TFM) and have 8 destination SEQ files based on a contsraint in the TFM, how would performance differ should I rewrite the job to have the source file stream into a TFM1, and from there have two 3 outputs, 2 constrained (to SEQ files) on the column and the other a [Reject] output. I continue with this process until I have 4 TFM stages each (except the last) with two SEQ destinations and the third the [reject] output.

I could of course add IPC stages between all reads and writes, which may improve performance, but I am more concerned about how the design of a job that does the above affects performance. I suppose I am looking for best practice.

So, from:

Code: Select all


 CFF ---- TFM ---- SEQ1
              ---- SEQ2
              ---- SEQ3
              ---- SEQ4
              ---- SEQ5
              ---- SEQ6
              ---- SEQ7
              ---- SEQ8

to:

Code: Select all


 CFF ---- TFM1 ---- SEQ1
               ---- SEQ2
               ---- TFM2
                     ---- SEQ3
                     ---- SEQ4
                     ---- TFM3
                          ---- SEQ5
                          ---- SEQ6
                          ---- TFM4
                                ---- SEQ7
                                ---- SEQ8

There is of course the alternative to splitting the job into 8 individual jobs,each job then reading the same CFF source file and outputting to a SEQ file. The overhead would then rather be on many jobs reading the same source file, as opposed to the above method which is a relatively complex job when compared to one of these simple jobs.

To the eye, it is a lot more appeasing looking at a simple job than a complex design of links and stages.

The number of rows in the source file is 40 million. I must add that the CFF has 19 columns of which I am only interested in 14. The first TFM omits the columns I dont require in the process.

rasi · Post by **rasi** » Sun Oct 31, 2004 5:56 pm

Hi

It is always a good practice to cut the large block(40 Millions) into 10 blocks(4 Millions) or you could make it 20 small blocks depends on the machine and other jobs.

Create the multi-instance job which will read from sequential file and write into 8 sequential files based on the constraint. Run the multi-instance job the number of times you decides to break into. Once all the job is finished then merge all the files together.

Thanks
Siva

vmcburney · Post by **vmcburney** » Sun Oct 31, 2004 6:45 pm

The multi instance job is a good idea, on Unix you will get better performance from multiple jobs instead of one job with multiple streams.

The CFF stage has the "Selection Critiria" tab that you might be able to use to partition it. Depends on your data. This might be quicker then partitioning the data in a transformer.

You can also omit unnecessary fields in the CFF stage so they don't even get to the transformer to save a bit more time.

denzilsyb · Post by **denzilsyb** » Sun Oct 31, 2004 10:48 pm

rasi wrote: Create the multi-instance job which will read from sequential file and write into 8 sequential files based on the constraint.

Siva, Vincent. I am going to give this a shot and will post the results. This is much better than creating 8 individual jobs!