Page 1 of 2

Suggestions needed to improve the performance(Hashedfile)

Posted: Wed May 10, 2006 8:35 am
by kris007
Hi All,

I have a job designed to do incremental loading based upon the last updated date in the target table. I stage all the incremental data in a sequential file and lookup against a hashed file containing all the key values of the target table and based upon key found and notfound I do my inserts and updates. Now, my problem is, the target table has around 80 million records. So, I need to load these 80 million keys into a hashed file for lookup purposes. Firstly, 32bit hashed file is not big enough to hold this data. So, I created a 64bit hashed file. Now, its taking very long to load all these keys into the hashed file ..around 100-120 minutes. I would like to know a better way to do this. My incremental data is always around 1 million rows. So, basically, I have to wait until my hashed file is loaded with all the key column values(80 million rows) and then just lookup for those 1 million rows from source and see if they exist in the target or not.
I somehow, havig a feeling that there should be a better way to do this. Any guidance would be of great help.

Thanks
Kris.

Posted: Wed May 10, 2006 8:59 am
by ArndW
The problem is not so much hashed file performance as it is job design. Can you not flag changed records in your table so that instead of reloading 80 million rows each and every time you run this job you will just update those values that have changed?

Posted: Wed May 10, 2006 9:02 am
by kcbland
Take your delta data set, scan it and build a distinct list of natural keys. Bulk load those keys into a work table in your target database. Inner join that work table to your target table and spool the row (natural and surrogate keys, or ideally the entire row) out to your hashed file. It will be perfectly created for each row, only contains the minimum necessary rows, etc.

Why maintain a complete copy of your target tales in your ETL tool? Restartability, corruption, etc are all issues with your design. We talk a lot about this method on the forum, you could enjoy reading these discussions. There are refinements, such as loading two hashed files, one of natural to surrogate key cross-references, and the other a full complement of columns for the row. This way you can either reference just the keys during foreign key substitution, or have the full detail. It also leads to discussion about rapid comparisons for current to new row using CRC32 methods.

Re: Suggestions needed to improve the performance(Hashedfile

Posted: Thu May 11, 2006 9:30 am
by raoraghunandan
kris007 wrote:Hi All,

I have a job designed to do incremental loading based upon the last updated date in the target table. I stage all the incremental data in a sequential file and lookup against a hashed file containing all the key values of the target table and based upon key found and notfound I do my inserts and updates. Now, my problem is, the target table has around 80 million records. So, I need to load these 80 million keys into a hashed file for lookup purposes. Firstly, 32bit hashed file is not big enough to hold this data. So, I created a 64bit hashed file. Now, its taking very long to load all these keys into the hashed file ..around 100-120 minutes. I would like to know a better way to do this. My incremental data is always around 1 million rows. So, basically, I have to wait until my hashed file is loaded with all the key column values(80 million rows) and then just lookup for those 1 million rows from source and see if they exist in the target or not.
I somehow, havig a feeling that there should be a better way to do this. Any guidance would be of great help.

Thanks
Kris.
You have 80 million key values!!!. Are you sure? Can you explain in brief your target table design ?

Thanks,
Raghu

Posted: Thu May 11, 2006 10:13 am
by kris007
What I meant was, there are 80 million rows in my target table, each row identified by a distinct key. Is that what you wanted to know?

Kris

Posted: Thu May 11, 2006 3:01 pm
by ray.wurlod
Even at 10,000 rows per second, 80 million rows will take (80,000,000 / 10,000), or 8,000 seconds. You may need to manage expectations.

Posted: Thu May 11, 2006 3:11 pm
by kris007
You are right Ray. The rows were being processed at 14000 rows/sec and still the time its taking is not something I am looking for. And hence, I was looking for a better way to design the job. Ken's idea was appealing and hence I am in that process. Need to check that out.

Posted: Fri May 12, 2006 6:12 am
by raoraghunandan
kris007 wrote:What I meant was, there are 80 million rows in my target table, each row identified by a distinct key. Is that what you wanted to know?

Kris
If you have 80 million unique records in your target table, then
-Time stamp your target table.
-Incrementally load your hash file. So assuming 1 million records per load..this hash file load should be much quicker. Is this not possible ?

-Raghu

Posted: Fri May 12, 2006 7:37 am
by kris007
I do have a timestamp field in my target table. But, I didnt quite follow you on how I incrementally load the hashed file.

Based upon the timestamp in my target table, I pull only those records from my source which have changed since the target table has been last loaded. Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.

Thanks
Kris.

Posted: Fri May 12, 2006 7:41 am
by chulett
kris007 wrote:Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.
Kris, that's basically what Ken explained in his post.

Posted: Fri May 12, 2006 7:54 am
by raoraghunandan
kris007 wrote:I do have a timestamp field in my target table. But, I didnt quite follow you on how I incrementally load the hashed file.

Based upon the timestamp in my target table, I pull only those records from my source which have changed since the target table has been last loaded. Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.

Thanks
Kris.
When you say "...Now, this incremental data, I need to load into the target table based upon the keys found or not found..."

do you do only inserts or are there updates too ...i.e. if the key is found do you also update?

-Raghu

Posted: Fri May 12, 2006 8:20 am
by kris007
Exaclty, I also do updates.

Posted: Fri May 12, 2006 9:01 am
by kris007
Kris, that's basically what Ken explained in his post.
Hey Craig, I am doing exactly what Ken has suggested. But Raghu seems to come up with a different idea. Was just trying to findout what he is suggesting, clearly so that it would help. It wouldn't hurt to know if there is another way..right :wink:

Kris

Posted: Fri May 12, 2006 9:14 am
by chulett
Sure! Just think that at the end of the day they'll basically be the same. No harm in checking it out, of course.

Posted: Fri May 12, 2006 9:14 am
by DSguru2B
Another way, IMHO, is to use several DB stages as your source
Give a key window in your where clause
say for eg:
in the first stage, where the key<=1000000
in the second stage, where the key >1000000 and key <=2000000
and so on
collect the out via a link collector and load it into a hash file. For 80 Million records, you will see your performance boost up trumendously, provided you have multiple cpu's.
Try it out, maybe it will work
:wink:
Regards,