Suggestions needed to improve the performance(Hashedfile)

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Suggestions needed to improve the performance(Hashedfile)

Post by kris007 »

Hi All,

I have a job designed to do incremental loading based upon the last updated date in the target table. I stage all the incremental data in a sequential file and lookup against a hashed file containing all the key values of the target table and based upon key found and notfound I do my inserts and updates. Now, my problem is, the target table has around 80 million records. So, I need to load these 80 million keys into a hashed file for lookup purposes. Firstly, 32bit hashed file is not big enough to hold this data. So, I created a 64bit hashed file. Now, its taking very long to load all these keys into the hashed file ..around 100-120 minutes. I would like to know a better way to do this. My incremental data is always around 1 million rows. So, basically, I have to wait until my hashed file is loaded with all the key column values(80 million rows) and then just lookup for those 1 million rows from source and see if they exist in the target or not.
I somehow, havig a feeling that there should be a better way to do this. Any guidance would be of great help.

Thanks
Kris.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The problem is not so much hashed file performance as it is job design. Can you not flag changed records in your table so that instead of reloading 80 million rows each and every time you run this job you will just update those values that have changed?
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Take your delta data set, scan it and build a distinct list of natural keys. Bulk load those keys into a work table in your target database. Inner join that work table to your target table and spool the row (natural and surrogate keys, or ideally the entire row) out to your hashed file. It will be perfectly created for each row, only contains the minimum necessary rows, etc.

Why maintain a complete copy of your target tales in your ETL tool? Restartability, corruption, etc are all issues with your design. We talk a lot about this method on the forum, you could enjoy reading these discussions. There are refinements, such as loading two hashed files, one of natural to surrogate key cross-references, and the other a full complement of columns for the row. This way you can either reference just the keys during foreign key substitution, or have the full detail. It also leads to discussion about rapid comparisons for current to new row using CRC32 methods.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
raoraghunandan
Charter Member
Charter Member
Posts: 19
Joined: Sun Jul 20, 2003 4:29 am

Re: Suggestions needed to improve the performance(Hashedfile

Post by raoraghunandan »

kris007 wrote:Hi All,

I have a job designed to do incremental loading based upon the last updated date in the target table. I stage all the incremental data in a sequential file and lookup against a hashed file containing all the key values of the target table and based upon key found and notfound I do my inserts and updates. Now, my problem is, the target table has around 80 million records. So, I need to load these 80 million keys into a hashed file for lookup purposes. Firstly, 32bit hashed file is not big enough to hold this data. So, I created a 64bit hashed file. Now, its taking very long to load all these keys into the hashed file ..around 100-120 minutes. I would like to know a better way to do this. My incremental data is always around 1 million rows. So, basically, I have to wait until my hashed file is loaded with all the key column values(80 million rows) and then just lookup for those 1 million rows from source and see if they exist in the target or not.
I somehow, havig a feeling that there should be a better way to do this. Any guidance would be of great help.

Thanks
Kris.
You have 80 million key values!!!. Are you sure? Can you explain in brief your target table design ?

Thanks,
Raghu
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

What I meant was, there are 80 million rows in my target table, each row identified by a distinct key. Is that what you wanted to know?

Kris
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Even at 10,000 rows per second, 80 million rows will take (80,000,000 / 10,000), or 8,000 seconds. You may need to manage expectations.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

You are right Ray. The rows were being processed at 14000 rows/sec and still the time its taking is not something I am looking for. And hence, I was looking for a better way to design the job. Ken's idea was appealing and hence I am in that process. Need to check that out.
raoraghunandan
Charter Member
Charter Member
Posts: 19
Joined: Sun Jul 20, 2003 4:29 am

Post by raoraghunandan »

kris007 wrote:What I meant was, there are 80 million rows in my target table, each row identified by a distinct key. Is that what you wanted to know?

Kris
If you have 80 million unique records in your target table, then
-Time stamp your target table.
-Incrementally load your hash file. So assuming 1 million records per load..this hash file load should be much quicker. Is this not possible ?

-Raghu
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

I do have a timestamp field in my target table. But, I didnt quite follow you on how I incrementally load the hashed file.

Based upon the timestamp in my target table, I pull only those records from my source which have changed since the target table has been last loaded. Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.

Thanks
Kris.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

kris007 wrote:Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.
Kris, that's basically what Ken explained in his post.
-craig

"You can never have too many knives" -- Logan Nine Fingers
raoraghunandan
Charter Member
Charter Member
Posts: 19
Joined: Sun Jul 20, 2003 4:29 am

Post by raoraghunandan »

kris007 wrote:I do have a timestamp field in my target table. But, I didnt quite follow you on how I incrementally load the hashed file.

Based upon the timestamp in my target table, I pull only those records from my source which have changed since the target table has been last loaded. Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.

Thanks
Kris.
When you say "...Now, this incremental data, I need to load into the target table based upon the keys found or not found..."

do you do only inserts or are there updates too ...i.e. if the key is found do you also update?

-Raghu
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

Exaclty, I also do updates.
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

Kris, that's basically what Ken explained in his post.
Hey Craig, I am doing exactly what Ken has suggested. But Raghu seems to come up with a different idea. Was just trying to findout what he is suggesting, clearly so that it would help. It wouldn't hurt to know if there is another way..right :wink:

Kris
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Sure! Just think that at the end of the day they'll basically be the same. No harm in checking it out, of course.
-craig

"You can never have too many knives" -- Logan Nine Fingers
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Another way, IMHO, is to use several DB stages as your source
Give a key window in your where clause
say for eg:
in the first stage, where the key<=1000000
in the second stage, where the key >1000000 and key <=2000000
and so on
collect the out via a link collector and load it into a hash file. For 80 Million records, you will see your performance boost up trumendously, provided you have multiple cpu's.
Try it out, maybe it will work
:wink:
Regards,
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply