Range Lookup

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Range Lookup

Post by varshanswamy »

I have 2 database stages
First database stage contains a number field called IP_NUMBER.

Second database stage contains fields called

IP_FROM
IP_TO
COUNTRY_NAME
COUNTRY_CODE
DOMAIN_NAME

I have done a sparse lookup and a query in the database as follows
that is to obtain COUNTRY_CDOE, COUNTRY_NAME, DOMAIN_NAME based on the IP_NUMBER.

SELECT
COUNTRY_CODE,
COUNTRY_NAME,
DOMAIN_NAME
FROM
IP_LOOKUP
WHERE
ORCHESTRATE.IP_NUMBER
BETWEEN
IP_FROM
AND
IP_TO

The problem is since it is a sparse lookup, it is taking a very long time. Even though the table has been indexed, it is executing at the rate of one record per second, there are 1 million records, to be processed, so it is really very slow. Please suggest on this.
sudharsanan
Participant
Posts: 64
Joined: Fri Jul 16, 2004 7:53 am

Post by sudharsanan »

Does COUNTRY_CDOE, COUNTRY_NAME, DOMAIN_NAME and IP_NUMBER are also keys in this table or not..or IP_NUMBER is the only key.. are you using the concept of surrogate keys in this table..give us more info or the datamodel so that we can find why this is taking more time to execute...
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Post by varshanswamy »

The problem is that there are no keys here, There are no surrogate keys too,
I have fiven the structure of the 2 tables , based on the IP_NUMBER from one table I need to lookup in another table if it is present in the range and derive the rest of the values of COUNTRY_CODE, COUNTRY_NAME and DOMAIN_NAME. The values of IP_FROM and IP_TO are also not constant.
so this is the problem
sudharsanan
Participant
Posts: 64
Joined: Fri Jul 16, 2004 7:53 am

Post by sudharsanan »

I think since you done have a key in the second lookup it is taking more time to retrive the result.. can you try using a sort stage to sort the incoming data from the database.. this might retrive the data quickly..i am guessing :wink: ..just give a try and let us know what has happened...
aparnanarale
Participant
Posts: 6
Joined: Mon Feb 07, 2005 11:54 pm

Post by aparnanarale »

There is not much difference even if I sort the data, so please let me know what to do
HSBCdev
Premium Member
Premium Member
Posts: 141
Joined: Tue Mar 16, 2004 8:22 am
Location: HSBC - UK and India
Contact:

Re: Range Lookup

Post by HSBCdev »

I want to ask this question too, but I use 2 Dataset stages instead.

So how can I build up a DS job that can get the same result as the SQL statement.

Thanks a lot !
varshanswamy wrote:I have 2 database stages
First database stage contains a number field called IP_NUMBER.

Second database stage contains fields called

IP_FROM
IP_TO
COUNTRY_NAME
COUNTRY_CODE
DOMAIN_NAME

I have done a sparse lookup and a query in the database as follows
that is to obtain COUNTRY_CDOE, COUNTRY_NAME, DOMAIN_NAME based on the IP_NUMBER.

SELECT
COUNTRY_CODE,
COUNTRY_NAME,
DOMAIN_NAME
FROM
IP_LOOKUP
WHERE
ORCHESTRATE.IP_NUMBER
BETWEEN
IP_FROM
AND
IP_TO

The problem is since it is a sparse lookup, it is taking a very long time. Even though the table has been indexed, it is executing at the rate of one record per second, there are 1 million records, to be processed, so it is really very slow. Please suggest on this.
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

Post by Gazelle »

Our friends at Ascential Support wrote a cusom stage for us... the problem is that it "collects" the entire million rows before doing the range lookup.

The rather inelegant solution we implemented was to join against all reference rows (I think it's called a cartesian product) and then have another stage that filters out only those rows where IP_NUMBER >= IP_FROM and IP_NUMBER <= IP_TO.

To do the join, we added a "dummy_key" column to each side of the join (and set its value to "1").

It's a pretty data-heavy solution, though, so I don't know whether it will really help with the execution time.

I hope it helps.

- g
"If at first you don't succeed, use a bigger hammer." :)
HSBCdev
Premium Member
Premium Member
Posts: 141
Joined: Tue Mar 16, 2004 8:22 am
Location: HSBC - UK and India
Contact:

Post by HSBCdev »

Gazelle, in fact, the method you mention we have already used before, but the performance is really poor.

How about your custom stage ?

Or any other solution ?
Gazelle wrote:Our friends at Ascential Support wrote a cusom stage for us... the problem is that it "collects" the entire million rows before doing the range lookup.

The rather inelegant solution we implemented was to join against all reference rows (I think it's called a cartesian product) and then have another stage that filters out only those rows where IP_NUMBER >= IP_FROM and IP_NUMBER <= IP_TO.

To do the join, we added a "dummy_key" column to each side of the join (and set its value to "1").

It's a pretty data-heavy solution, though, so I don't know whether it will really help with the execution time.

I hope it helps.

- g
"If at first you don't succeed, use a bigger hammer." :)
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Re: Range Lookup

Post by varshanswamy »

I have implemented the same by writing a server job and calling it in the form of a shared container in the parallel job. I still need to check on performance. In the server job, I have put the records of the IP_LOOKUP table into a hash file, and later in another job I have used the Universe Stage to do the lookup based on the range IP_NUMBER > IP_FROM and IP_NUMBER < IP_TO. But I would appreciate if anyone could give me a solution to this in PX itself.
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

Post by Gazelle »

It is much the same in PX. We just don't use a hash file.
On the incoming stream, we add a column called "dummy_key" and give it a value of "1".
On the reference stream, we also add a column called "dummy_key" and give it a value of "1".
A "Join" is done on the key "dummy_key", which pairs each rows from the input stream with all rows on the reference stream.
The joined rows then go to a "Transformer" stage where we only let through rows where IP_NUMBER >= IP_FROM AND IP_NUMBER <= IP_TO.

I can't give much info about the Custom Stage, other than:
I am having a bit of trouble getting it to work; it can only use one key (and it's max and min value); it needs the partioning set to "Entire".
I can't see that it would provide much of a performance improvement (since I believe it collects all the incoming rows before it starts processing), but it sure looks a lot more elegant!
Speak to Ascential Support for more info.

- g
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

You said you have 2 database stages. Does it imply that the 2 tables reside in separate db or in the same db in different tables?

If they reside on the same db, you may be better-off using an SQL join instead of a lookup. You can use outer join if you wish to include non-matching rows.
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Post by varshanswamy »

I want to know one thing how is it possible to filter based on IP_NUMBER >= IP_FROM AND IP_NUMBER <= IP_TO in transformer do you mean
adding it in the constraint. Because I would be having a cartesian product now of the records when I join using a "dummy_key".
Gazelle wrote:It is much the same in PX. We just don't use a hash file.
On the incoming stream, we add a column called "dummy_key" and give it a value of "1".
On the reference stream, we also add a column called "dummy_key" and give it a value of "1".
A "Join" is done on the key "dummy_key", which pairs each rows from the input stream with all rows on the reference stream.
The joined rows then go to a "Transformer" stage where we only let through rows where IP_NUMBER >= IP_FROM AND IP_NUMBER <= IP_TO.

I can't give much info about the Custom Stage, other than:
I am having a bit of trouble getting it to work; it can only use one key (and it's max and min value); it needs the partioning set to "Entire".
I can't see that it would provide much of a performance improvement (since I believe it collects all the incoming rows before it starts processing), but it sure looks a lot more elegant!
Speak to Ascential Support for more info.

- g
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Post by varshanswamy »

I have a single database, the problem why I cant join is there are no keys. But I will still try the outer join.
Sainath.Srinivasan wrote:You said you have 2 database stages. Does it imply that the 2 tables reside in separate db or in the same db in different tables?

If they reside on the same db, you may be better-off using an SQL join instead of a lookup. You can use outer join if you wish to include non-matching rows.
Gazelle
Premium Member
Premium Member
Posts: 108
Joined: Mon Nov 24, 2003 11:36 pm
Location: Australia (Melbourne)

Post by Gazelle »

Yes, the filtering was done in the Transformer stage as a "constraint".
HSBCdev
Premium Member
Premium Member
Posts: 141
Joined: Tue Mar 16, 2004 8:22 am
Location: HSBC - UK and India
Contact:

Post by HSBCdev »

I think after join stage use Filter stage is much better ( Transformer 's performance is not good).

Besides, Gazelle, Would you please send me your custom stage (source code), I would like to have a look and study.

Many many thanks !


Gazelle wrote:Yes, the filtering was done in the Transformer stage as a "constraint".
Post Reply