Page 1 of 3

keymgt slow performance?

Posted: Wed Mar 29, 2006 7:21 pm
by rachelsu
In developing an upsert job, I tried KeyMgt functions to assign surrogate keys to a sequence file and output SKs and rest of the data to another sequence file...then load that into Oracle DB. Surprisingly, I found that KeyMgtGetNextValue gets 24 row/sec and KeyMgtGetNextValueConcurrent gets 12 row/sec throughput :evil: . Is this normal in DS Server 7? I'm new to DS. Any advice on how to get the best performance on SK generation is much appreciated! Should we leave this to the DB? Other ETL tools I've used generate SKs much much faster...Am I using this incorrectly? Or shall I use other functions in DS? TIA..

Posted: Wed Mar 29, 2006 9:50 pm
by rasi
Rachel

Welcome to DSXchange

KeyMgtGetNextValue and KeyMgtGetNextValueConcurrent is same except the latter is used when you have multiple jobs running at the same time to get the keys. First if you don't think that is never going to happen then you will be better of using KeyMgtGetNextValue. Second you cannot measure the performance like this. Performance depends on many factors your job design, number of jobs running at the same time, network, cpu hard disk etc.
The efficient way to implement this is use the database functionality to create surrogate keys or create a table which will hold the surrogate key and in your job read that table and use the stage variable @OUTROWNUM function to add increment number to the surrogate key column. After job finishes update the Table with the new surrogate key.

Thank you!

Posted: Wed Mar 29, 2006 10:04 pm
by rachelsu
Hi Siva, thanks very much for this ! I had similar findings after a few more tries :mrgreen:

KeyMgt functions was identified to be the blocker after adding debug params and disabling other transforms...I was pretty shocked to find that - initially I was suspecting CRC32 or stage variable calculations.

[quote="rasi"]Second you cannot measure the performance like this. Performance depends on many factors your job design, number of jobs running at the same time, network, cpu hard disk etc.
[/quote]

Posted: Wed Mar 29, 2006 10:34 pm
by balajisr
Keymgt functions uses common blocks. I read that it is better to avoid common blocks. Does this also imply that it is better to avoid keymgt functions and rather user @OUTROWNUM or any other alternative?

Posted: Wed Mar 29, 2006 10:45 pm
by rachelsu
Where did you read that? Can you provide the link here? :?:

From what I've tested, I would say KeyMgt is much slower than @OUTROWNUM or DB key generation functions...

[quote="balajisr"]Keymgt functions uses common blocks. I read that it is better to avoid common blocks. Does this also imply that it is better to avoid keymgt functions and rather user @OUTROWNUM or any other alternative?[/quote]

Posted: Wed Mar 29, 2006 10:50 pm
by balajisr
rachelsu wrote:Where did you read that? Can you provide the link here? :?:

From what I've tested, I would say KeyMgt is much slower than @OUTROWNUM or DB key generation functions...
balajisr wrote:Keymgt functions uses common blocks. I read that it is better to avoid common blocks. Does this also imply that it is better to avoid keymgt functions and rather user @OUTROWNUM or any other alternative?
page 2-12 of server job developer guide which talks about improving performance using in process and inter process.

I agree with you that KeyMgt as far as I know is slower than other alternatives.

Posted: Wed Mar 29, 2006 10:53 pm
by rachelsu
:oops: I didn't go through it in detail. Thanks for the reminder!

Posted: Thu Mar 30, 2006 1:02 am
by ArndW
Rachel,

try writing a job that reads a sequential file, uses the key management routines and wrietes to a sequential file (or to a sequential file in /dev/null) to see the actual speed on your system. The speeds should not be 24 rows per second, even on an overloaded system. Most system should give results over 10K rows/sec at a minimum.

COMMON blocks are very efficient, and I am certain that balajisr's comment about avoiding them is mistaken. The COMMON in use in these routines is to avoid having to re-open the hashed file each call - which would make the routine perform very slowly indeed.

As rasi has already mentioned, using one call to get a seed value and then using the counters @INROWNUM or @OUTROWNUM to provide the increments is a very fast way of doing doing - but since the hashed file max value is not being updated it needs to be corrected after a complete run and also handled in case of an abort. If you are writing to a database, you can always do your SQL's equivalent of get max value on the key field prior to the run and use that as the seed.

Posted: Thu Mar 30, 2006 1:37 am
by balajisr
ArndW wrote:Rachel,

try writing a job that reads a sequential file, uses the key management routines and wrietes to a sequential file (or to a sequential file in /dev/null) to see the actual speed on your system. The speeds should not be 24 rows per second, even on an overloaded system. Most system should give results over 10K rows/sec at a minimum.

COMMON blocks are very efficient, and I am certain that balajisr's comment about avoiding them is mistaken. The COMMON in use in these routines is to avoid having to re-open the hashed file each call - which would make the routine perform very slowly indeed.

As rasi has already mentioned, using one call to get a seed value and then using the counters @INROWNUM or @OUTROWNUM to provide the increments is a very fast way of doing doing - but since the hashed file max value is not being updated it needs to be corrected after a complete run and also handled in case of an abort. If you are writing to a database, you can always do your SQL's equivalent of get max value on the key field prior to the run and use that as the seed.
thanks for the clarification arnd.

I arrived at the above conclusion only on the basis of the below statements in Page 2-12 of server job developer guide
You cannot use inter-process row buffering if your job uses common blocks in transform functions to pass data between stages. This is not a recommended practice and it is advisable to redesign the job to use row buffering rather than common blocks.
what does the above statements mean exactly. :? Why the manual says that using common block is not recommended? Is it only when we use interprocess we should avoid common blocks? Is this not a major restriction in using common block?

Posted: Thu Mar 30, 2006 1:44 am
by ArndW
balajisr,

when you have interprocess turned on the transform stages will be executing in different processes and any COMMON block information will not be shared between them; this is why that page recommends not using COMMON blocks to share data between stages. If you use a COMMON block to store persistent information between calls on the same process (i.e. within a single transform stage) then you will have no problems at all. It is only when a call in Transform2 tries to read a value set by Transform1 that it will not work with interprocess turned on.

Posted: Thu Mar 30, 2006 6:59 pm
by rachelsu
I tried a simple server job

SEQ FILE A --> TRANSFORM --> SEQ FILE B

The TRANSFORM contains a direct map between A and B columns except the last numeric column is set to use KeyMgtGetNextValue('atest') instead of the input.

The speed I'm getting is 3~4 rows/sec with a stop limit at 5k rows...

Am I doing something wrong? Or is our server not working properly? But for other simple transforms on the same box I'm getting a few k rows/sec throughput...???
try writing a job that reads a sequential file, uses the key management routines and wrietes to a sequential file (or to a sequential file in /dev/null) to see the actual speed on your system. The speeds should not be 24 rows per second, even on an overloaded system. Most system should give results over 10K rows/sec at a minimum.

Posted: Thu Mar 30, 2006 7:09 pm
by rasi
Rachel

3 to 4 rows per second is not normal. I suspect something is wrong. From Sequential to Sequential without having any complexity inside your transform it should come up with few thousand records per second atleast. By the way how many columns are there in your sequential file.

Try to run this test for atleast 100,000 records to get average rows/per second.

Posted: Thu Mar 30, 2006 8:35 pm
by rachelsu
18 columns in total (approx 4-20 varchar for every column). Once I remove the KeyMgt transform, it runs at a few k rows/sec....hence my initial question about KeyMgt performance...
rasi wrote:3 to 4 rows per second is not normal. I suspect something is wrong. From Sequential to Sequential without having any complexity inside your transform it should come up with few thousand records per second atleast. By the way how many columns are there in your sequential file.

Try to run this test for atleast 100,000 records to get average rows/per second.

Posted: Fri Mar 31, 2006 1:06 am
by ArndW
rachelsu,

it should not be that slow. Perhaps there is something wrong with the file itself. I'm not at a DataStage machine right now, but I seem to recall that this is not a dynamic file (it doesn't need to be, as it holds only a couple of records). I'll post the file name once I get to a machine (you can find that name if you look at the source program) and then it would be good for you to declare that file on the designer canvas with one column and do a view data on it. It should only contain one record per key that you are tracking and each record will have the value of the next key.

Posted: Fri Mar 31, 2006 1:14 am
by ray.wurlod
Has some kind soul modified your KeyMgt routine? (There was one group of consultants running around Australia and New Zealand doing that for some years - rebadged the SDK routines as their own, and managed to ruin more than a few of them in the process.)