Hi.
I am trying to increase the performance of a job and have run out of ideas to try. Hoping that anyone of you might a suggestion.
Details:
Have a datafile with about 10m rows. The key is, Identifier number and occurrence number. That is there can be multiple occurrences for any given Identifier number. The data file is semi-sorted, meaning if the Identifier number has multiple rows then the occurrence number increase down the file but the occurrences might be 500 or N rows apart.
Only the row that corresponds to the max occurrences number has to be loaded into the database.
Approach:
Job 1. Build a hash file with the key and occurrence number.
Job 2. Run the seq file and the hash files thr. a transformer and lookup on the identifier/occurence number.
Cons: The hash file is large with about 7M rows.
Any pointers is appreciated. Thanks.
NP
increase performance
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
Take your incoming data stream and sort it by ascending identifier and descending occurrence.
Process the sorted file through a transform with a stage variable call LastIdentifier holding the last identifier. In the constraint put in a "In.IDENTIFIER<>LastIdentifier" and output that row.
You don't require a hashed file for this.
Process the sorted file through a transform with a stage variable call LastIdentifier holding the last identifier. In the constraint put in a "In.IDENTIFIER<>LastIdentifier" and output that row.
You don't require a hashed file for this.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
If the data is sorted in ascending order, and you're starting with 10M rows and will only end up with 7M when done, your 30% of repeated rows are almost negligle. Just process all of the rows, writing to a hash file using the unique identifier (but not occurence number) as the key. The last row written to the hash file under the repeated key is the last one in the hash file. 70% of the timeyou will not repeat a key.
To tune the process, use write-delay caching, and set the modulo high enough to avoid constant dynamic sizing up. Also, use multiple job instances to simultaneous read and transform your data, writing into the same hash file. Use a MOD(your unique identifier, NumberOfInstances) = JobInstanceNumber - 1 constraint in a transformer directly reading the sequential file. If you run 10 copies of this job, each job will take 1 out of 10 rows offset by an instance number. Use 10 job calls to run each instance, passing in a parameter NumberOfInstances=10 and JobInstanceNumber=1 to 10. If you have fewer cpus, then max the number of instances to that, if you've got more, crank it up. This statement keeps all repeated occurrences of the unique identifier rows processing on the same job, so that your sorted data doesn't get scattered to other jobs and you process out of order.
When the jobs are done, use another job to spool the hash file to a sequential file. That should be really really fast.
This design technique will allow your transformer to not single-thread, because you'll be using as many cpus as your system allows. The extra processing is neglible, because any sorting is redundant because your data is sufficiently sorted. You also avoid a hash lookup to determine if the row you are on is the winner, which at 70% of the time is a wasted lookup but something you have to do 100% of the time.
By the way, this partitioning and multiple job instance technique is the same thing that PX does automatically with partitioning and parallel nodes.
To tune the process, use write-delay caching, and set the modulo high enough to avoid constant dynamic sizing up. Also, use multiple job instances to simultaneous read and transform your data, writing into the same hash file. Use a MOD(your unique identifier, NumberOfInstances) = JobInstanceNumber - 1 constraint in a transformer directly reading the sequential file. If you run 10 copies of this job, each job will take 1 out of 10 rows offset by an instance number. Use 10 job calls to run each instance, passing in a parameter NumberOfInstances=10 and JobInstanceNumber=1 to 10. If you have fewer cpus, then max the number of instances to that, if you've got more, crank it up. This statement keeps all repeated occurrences of the unique identifier rows processing on the same job, so that your sorted data doesn't get scattered to other jobs and you process out of order.
When the jobs are done, use another job to spool the hash file to a sequential file. That should be really really fast.
This design technique will allow your transformer to not single-thread, because you'll be using as many cpus as your system allows. The extra processing is neglible, because any sorting is redundant because your data is sufficiently sorted. You also avoid a hash lookup to determine if the row you are on is the winner, which at 70% of the time is a wasted lookup but something you have to do 100% of the time.
By the way, this partitioning and multiple job instance technique is the same thing that PX does automatically with partitioning and parallel nodes.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
-
- Participant
- Posts: 23
- Joined: Mon Jul 04, 2005 6:25 am