increase performance

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
nachos
Participant
Posts: 6
Joined: Fri May 13, 2005 12:46 pm

increase performance

Post by nachos »

Hi.

I am trying to increase the performance of a job and have run out of ideas to try. Hoping that anyone of you might a suggestion.

Details:

Have a datafile with about 10m rows. The key is, Identifier number and occurrence number. That is there can be multiple occurrences for any given Identifier number. The data file is semi-sorted, meaning if the Identifier number has multiple rows then the occurrence number increase down the file but the occurrences might be 500 or N rows apart.

Only the row that corresponds to the max occurrences number has to be loaded into the database.

Approach:

Job 1. Build a hash file with the key and occurrence number.
Job 2. Run the seq file and the hash files thr. a transformer and lookup on the identifier/occurence number.

Cons: The hash file is large with about 7M rows.


Any pointers is appreciated. Thanks.


NP
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

hi,

you can use the HFC utility and determine the appropriate sizings for the hash file.


Let me know how your performance was affected.

Thanks!
Naveen.
nachos
Participant
Posts: 6
Joined: Fri May 13, 2005 12:46 pm

Post by nachos »

Naveen,

I have used the HFC and sized the hash file appropriately, tried cache on/off, tried mofying the hash file parameters but all of these did not contribute much.

Nach
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Take your incoming data stream and sort it by ascending identifier and descending occurrence.

Process the sorted file through a transform with a stage variable call LastIdentifier holding the last identifier. In the constraint put in a "In.IDENTIFIER<>LastIdentifier" and output that row.

You don't require a hashed file for this.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

If the data is sorted in ascending order, and you're starting with 10M rows and will only end up with 7M when done, your 30% of repeated rows are almost negligle. Just process all of the rows, writing to a hash file using the unique identifier (but not occurence number) as the key. The last row written to the hash file under the repeated key is the last one in the hash file. 70% of the timeyou will not repeat a key.

To tune the process, use write-delay caching, and set the modulo high enough to avoid constant dynamic sizing up. Also, use multiple job instances to simultaneous read and transform your data, writing into the same hash file. Use a MOD(your unique identifier, NumberOfInstances) = JobInstanceNumber - 1 constraint in a transformer directly reading the sequential file. If you run 10 copies of this job, each job will take 1 out of 10 rows offset by an instance number. Use 10 job calls to run each instance, passing in a parameter NumberOfInstances=10 and JobInstanceNumber=1 to 10. If you have fewer cpus, then max the number of instances to that, if you've got more, crank it up. This statement keeps all repeated occurrences of the unique identifier rows processing on the same job, so that your sorted data doesn't get scattered to other jobs and you process out of order.

When the jobs are done, use another job to spool the hash file to a sequential file. That should be really really fast.

This design technique will allow your transformer to not single-thread, because you'll be using as many cpus as your system allows. The extra processing is neglible, because any sorting is redundant because your data is sufficiently sorted. You also avoid a hash lookup to determine if the row you are on is the winner, which at 70% of the time is a wasted lookup but something you have to do 100% of the time.


By the way, this partitioning and multiple job instance technique is the same thing that PX does automatically with partitioning and parallel nodes.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
nachos
Participant
Posts: 6
Joined: Fri May 13, 2005 12:46 pm

Post by nachos »

Arnd,

The sort operator, just as is, runs at about 1000 r/s which is a lot slower than writing to the hash file with cache 16,000 r/s and subsequently reading from the hash was about 4000 r/s.

Kenneth,

Great new approches to try re. multiple jobs.


Thanks to Ken, Arnd, Naveen for your replies.
tcj
Premium Member
Premium Member
Posts: 98
Joined: Tue Sep 07, 2004 6:57 pm
Location: QLD, Australia
Contact:

Post by tcj »

If you have have alot of big text files you want to have sorted then I suggested having a look at syncsort.

We implemented it at the last datastage project I was working on. The time this program takes to sort large text files is amazing.
manojmathai
Participant
Posts: 23
Joined: Mon Jul 04, 2005 6:25 am

Post by manojmathai »

Hi

Try using unix sort instead of datastage sorting.
Unix sort works in great speed compared to datastage sort stage.

Thanks
Manoj.
Post Reply