Page 1 of 1

Improve Seq Stage Performance

Posted: Wed May 11, 2005 10:43 am
by rcil
Hello All,

I have total of three dsjobs in which the first two are the extracts from the database joining 5 tables in each with total of 40 columns and in the third job I am sorting and concatinating those two tab delimited output files using ExecSH as before routine and in the dsjob I split into four different files based on the simple constraints and I with simple derivations in each. The concatinated files contain 24 million records.

The speed of the first two extracts is 3000 to 4000 records per sec and in the 3rd job the sort command takes around couple of minutes but the dsjob extracts 325 rows per second which takes hours to complete the process.

Is there a way to imporve the performance of the sequential file stage the way it pulls the records.

thanks

Posted: Wed May 11, 2005 10:57 am
by kcbland
Your problem is not the sequential stage. Your problem is that you have a single-threaded job design that will only use a single cpu to do its work. If you have 6 cpus, you could use 6 instances of your third job to each handle 1/6th of the source data. In theory, you would scale your throughput up to being done in 1/6th the time.

This method of partitioning source data and using multiple instances to divide and conquer has been discussed ad nauseum on this forum. I'll post a link to similar discussions.

viewtopic.php?t=86907

Re: Improve Seq Stage Performance

Posted: Wed May 11, 2005 11:56 am
by rcil
Thank you for the inputs. As the hash file size limit is 2GB and in the UAT environment I have 24 million records and it could be more in production. Will the hashfile handle this big?

thanks

Re: Improve Seq Stage Performance

Posted: Wed May 11, 2005 1:46 pm
by Neoyip
If hash file will excess 2gb, create it manually with HFC.
rcil wrote:Thank you for the inputs. As the hash file size limit is 2GB and in the UAT environment I have 24 million records and it could be more in production. Will the hashfile handle this big?

thanks

Re: Improve Seq Stage Performance

Posted: Wed May 11, 2005 10:17 pm
by kcbland
rcil wrote:Thank you for the inputs. As the hash file size limit is 2GB and in the UAT environment I have 24 million records and it could be more in production. Will the hashfile handle this big?

thanks
Without a doubt, spooling data into a hash file has significant overhead as compared to a sequential file. You may consider reading this post viewtopic.php?t=85364 to learn more about hash files and when/how to use them.

Regarding your question about size, yes, a 32BIT default hash file will not contain 24 million rows of data, if every row averags 100 characters of data. You'll need to use 64BIT hash files. This suggestion should be avoided.

The method I described to you is the one that lets you use more cpus and balance your efforts across multiple cpus.

Posted: Wed May 11, 2005 11:55 pm
by ray.wurlod
Sequential File stage is the fastest of the passive stages. It has clever mechanisms, such as look-ahead and buffering, built-in.

On the down side, you can't begin reading from a single sequential file until you've finished writing to it. You can with a hashed file, but it may not be appropriate to do so; this would depend on your design requirements.

On most operating systems there is no effective limit to the size of a sequential file (you may have to enable large sizes, for example by increasing your ulimit and the operating system's maximum file size).

While HFC will not create 64-bit-enabled hashed files, it will generate the commands for doing so.