increase performance

nachos · Post by **nachos** » Sun Nov 20, 2005 10:34 am

Hi.

I am trying to increase the performance of a job and have run out of ideas to try. Hoping that anyone of you might a suggestion.

Details:

Have a datafile with about 10m rows. The key is, Identifier number and occurrence number. That is there can be multiple occurrences for any given Identifier number. The data file is semi-sorted, meaning if the Identifier number has multiple rows then the occurrence number increase down the file but the occurrences might be 500 or N rows apart.

Only the row that corresponds to the max occurrences number has to be loaded into the database.

Approach:

Job 1. Build a hash file with the key and occurrence number.
Job 2. Run the seq file and the hash files thr. a transformer and lookup on the identifier/occurence number.

Cons: The hash file is large with about 7M rows.

Any pointers is appreciated. Thanks.

NP

I_Server_Whale · Post by **I_Server_Whale** » Sun Nov 20, 2005 11:14 am

hi,

you can use the HFC utility and determine the appropriate sizings for the hash file.

Let me know how your performance was affected.

Thanks!
Naveen.

nachos · Post by **nachos** » Sun Nov 20, 2005 11:20 am

Naveen,

I have used the HFC and sized the hash file appropriately, tried cache on/off, tried mofying the hash file parameters but all of these did not contribute much.

Nach

ArndW · Post by **ArndW** » Sun Nov 20, 2005 11:33 am

Take your incoming data stream and sort it by ascending identifier and descending occurrence.

Process the sorted file through a transform with a stage variable call LastIdentifier holding the last identifier. In the constraint put in a "In.IDENTIFIER<>LastIdentifier" and output that row.

You don't require a hashed file for this.

kcbland · Post by **kcbland** » Sun Nov 20, 2005 5:11 pm

If the data is sorted in ascending order, and you're starting with 10M rows and will only end up with 7M when done, your 30% of repeated rows are almost negligle. Just process all of the rows, writing to a hash file using the unique identifier (but not occurence number) as the key. The last row written to the hash file under the repeated key is the last one in the hash file. 70% of the timeyou will not repeat a key.

To tune the process, use write-delay caching, and set the modulo high enough to avoid constant dynamic sizing up. Also, use multiple job instances to simultaneous read and transform your data, writing into the same hash file. Use a MOD(your unique identifier, NumberOfInstances) = JobInstanceNumber - 1 constraint in a transformer directly reading the sequential file. If you run 10 copies of this job, each job will take 1 out of 10 rows offset by an instance number. Use 10 job calls to run each instance, passing in a parameter NumberOfInstances=10 and JobInstanceNumber=1 to 10. If you have fewer cpus, then max the number of instances to that, if you've got more, crank it up. This statement keeps all repeated occurrences of the unique identifier rows processing on the same job, so that your sorted data doesn't get scattered to other jobs and you process out of order.

When the jobs are done, use another job to spool the hash file to a sequential file. That should be really really fast.

This design technique will allow your transformer to not single-thread, because you'll be using as many cpus as your system allows. The extra processing is neglible, because any sorting is redundant because your data is sufficiently sorted. You also avoid a hash lookup to determine if the row you are on is the winner, which at 70% of the time is a wasted lookup but something you have to do 100% of the time.

By the way, this partitioning and multiple job instance technique is the same thing that PX does automatically with partitioning and parallel nodes.

nachos · Post by **nachos** » Mon Nov 21, 2005 10:27 am

Arnd,

The sort operator, just as is, runs at about 1000 r/s which is a lot slower than writing to the hash file with cache 16,000 r/s and subsequently reading from the hash was about 4000 r/s.

Kenneth,

Great new approches to try re. multiple jobs.

Thanks to Ken, Arnd, Naveen for your replies.

tcj · Post by **tcj** » Mon Nov 21, 2005 5:40 pm

If you have have alot of big text files you want to have sorted then I suggested having a look at syncsort.

We implemented it at the last datastage project I was working on. The time this program takes to sort large text files is amazing.

manojmathai · Post by **manojmathai** » Tue Nov 22, 2005 12:59 am

Hi

Try using unix sort instead of datastage sorting.
Unix sort works in great speed compared to datastage sort stage.

Thanks
Manoj.