Page 1 of 1

In-memory lookup

Posted: Fri Aug 14, 2009 11:26 am
by ivannavi
I hardly ever do server jobs. But here it is. To make it easier I will throw some statements/conclusions of mine and ask for someone to confirm whether the following are true or false:

1) Using Oracle OCI stage that reads from a very small reference table as a source for the reference link to a transformer:
Even if I edit the OCI query not to use bind variables forwarded from the transformer (primary link), it will do roundtrips to the database issuing the same sql over and over for each row coming from the primary link.(true/false)?

2) The same thing described in 1) goes for ODBC and DRS stage. (true/false)?

3) This does not happen in parallel jobs with normal (not sparse) lookup mode (I have never had performance issues). (true/false)?

4) The only way to do what is called a normal lookup (as opposed to sparse lookup in parallel jobs) is to use hashed files for reference input. (true/false)?

Posted: Fri Aug 14, 2009 12:38 pm
by kduke
When is the interview? We don't answer interview questions.

Posted: Sat Aug 15, 2009 11:31 am
by ivannavi
There really is not any interview. I'm on the same site I've been for the last four years. No need to be rude. I believe none of the employers ever heard of something called DataStage, let alone asking something about it in a job interview. I hoped I could ask a question here about features I have never used before. And I did my homework before I asked. Thus the multiple choice questions that I put much effort into. 99$ to get an insult from a DS guru?

Posted: Sat Aug 15, 2009 12:43 pm
by kduke
Sorry if I came off as rude but the goal of this forum is to educate.

1) Using Oracle OCI stage that reads from a very small reference table as a source for the reference link to a transformer:
Even if I edit the OCI query not to use bind variables forwarded from the transformer (primary link), it will do roundtrips to the database issuing the same sql over and over for each row coming from the primary link.(true/false)?

TRUE in server. Lookups in PX are in memory. To do a in memory lookup use a hashed file.

2) The same thing described in 1) goes for ODBC and DRS stage. (true/false)?

TRUE. Same as above.

3) This does not happen in parallel jobs with normal (not sparse) lookup mode (I have never had performance issues). (true/false)?

TRUE does not hit database. Sparse is same as server. Hits the database if not found in previous lookup.

4) The only way to do what is called a normal lookup (as opposed to sparse lookup in parallel jobs) is to use hashed files for reference input. (true/false)?

FALSE. No such thing as a hashed file in PX. A PX lookup is in memory.

Posted: Sat Aug 15, 2009 2:36 pm
by chulett
Just to help out a little bit, based on my (admittedly) limited PX knowledge:

1) Right, true. As noted, the Server equivalent of a 'in memory lookup' would be a cached hashed file reference lookup.

2) True, any DB stage when used as a lookup will make 'round trips' to the database across the network with each row... even when you don't have a Key expression.

3) True, but as I understand it there is an initial trip to the database to get all of the records needed by the lookup which are stored locally in a 'lookup file set'. Not sure how much of it after that is 'in memory' however.

4) Depends on what a 'normal' lookup is. If that means pull all of the reference data into a local structure, then I guess it would be true. False if you consider a local UV lookup a 'normal' lookup. I'm assuming you are equating a DB lookup in Server to a sparse lookup in PX.

Posted: Sun Aug 16, 2009 7:06 am
by chulett
Right, but I was specifically wondering what was being considered a 'normal' lookup in Server as opposed to PX.

As to your doubts, I've never seen any evidence that suggests that's how a database lookup works in Server. AFAIK it will always ping the database, even if the key value is the same as the last one. The assumption is the last set of values are cached in memory and I don't believe that to be the case for Server. PX perhaps, as it is more savy in ways of that nature, but not Server.

Posted: Sun Aug 16, 2009 9:43 am
by eostic
I've been told by various dedicated experts of the rdbms databases that we support (db2, oracle, etc.) that the rdbms' themselves have algorithms that are smart enough to do a degree of caching, but in Server there is nothing in particular that keeps the Stage from going back to the database each time.....and indeed, there are times when this is absolutely critical, because you want to ensure, as in a realtime Job, that you will get any "just committed" rows.

Ernie

Posted: Sun Aug 16, 2009 5:07 pm
by ray.wurlod
To all of that I must add that DataStage uses "prepared SQL" so the original assertion, that
it will do roundtrips to the database issuing the same sql over and over for each row coming from the primary link
is actually false. The SQL Statement is sent when the stage starts, and the database server runs it through its optimizer and whatever else it needs to do - it may even build an in-memory table containing the query results (all keys) at that point. In a reference input, DataStage sends the key value (or an array of them) and the database server responds with the corresponding row (or an array of them). If no row matches, either the database server or (if not) the DataStage stage code generates a row of nulls.

Posted: Mon Aug 17, 2009 2:24 am
by ivannavi
I appreciate all your replies and hereby pronounce this topic resolved.