Suggestions needed to improve the performance(Hashedfile)

kris007 · Post by **kris007** » Wed May 10, 2006 8:35 am

Hi All,

I have a job designed to do incremental loading based upon the last updated date in the target table. I stage all the incremental data in a sequential file and lookup against a hashed file containing all the key values of the target table and based upon key found and notfound I do my inserts and updates. Now, my problem is, the target table has around 80 million records. So, I need to load these 80 million keys into a hashed file for lookup purposes. Firstly, 32bit hashed file is not big enough to hold this data. So, I created a 64bit hashed file. Now, its taking very long to load all these keys into the hashed file ..around 100-120 minutes. I would like to know a better way to do this. My incremental data is always around 1 million rows. So, basically, I have to wait until my hashed file is loaded with all the key column values(80 million rows) and then just lookup for those 1 million rows from source and see if they exist in the target or not.
I somehow, havig a feeling that there should be a better way to do this. Any guidance would be of great help.

Thanks
Kris.

ArndW · Post by **ArndW** » Wed May 10, 2006 8:59 am

The problem is not so much hashed file performance as it is job design. Can you not flag changed records in your table so that instead of reloading 80 million rows each and every time you run this job you will just update those values that have changed?

kcbland · Post by **kcbland** » Wed May 10, 2006 9:02 am

Take your delta data set, scan it and build a distinct list of natural keys. Bulk load those keys into a work table in your target database. Inner join that work table to your target table and spool the row (natural and surrogate keys, or ideally the entire row) out to your hashed file. It will be perfectly created for each row, only contains the minimum necessary rows, etc.

Why maintain a complete copy of your target tales in your ETL tool? Restartability, corruption, etc are all issues with your design. We talk a lot about this method on the forum, you could enjoy reading these discussions. There are refinements, such as loading two hashed files, one of natural to surrogate key cross-references, and the other a full complement of columns for the row. This way you can either reference just the keys during foreign key substitution, or have the full detail. It also leads to discussion about rapid comparisons for current to new row using CRC32 methods.

raoraghunandan · Post by **raoraghunandan** » Thu May 11, 2006 9:30 am

kris007 wrote:Hi All,

I have a job designed to do incremental loading based upon the last updated date in the target table. I stage all the incremental data in a sequential file and lookup against a hashed file containing all the key values of the target table and based upon key found and notfound I do my inserts and updates. Now, my problem is, the target table has around 80 million records. So, I need to load these 80 million keys into a hashed file for lookup purposes. Firstly, 32bit hashed file is not big enough to hold this data. So, I created a 64bit hashed file. Now, its taking very long to load all these keys into the hashed file ..around 100-120 minutes. I would like to know a better way to do this. My incremental data is always around 1 million rows. So, basically, I have to wait until my hashed file is loaded with all the key column values(80 million rows) and then just lookup for those 1 million rows from source and see if they exist in the target or not.
I somehow, havig a feeling that there should be a better way to do this. Any guidance would be of great help.

Thanks
Kris.

You have 80 million key values!!!. Are you sure? Can you explain in brief your target table design ?

Thanks,
Raghu

kris007 · Post by **kris007** » Thu May 11, 2006 10:13 am

What I meant was, there are 80 million rows in my target table, each row identified by a distinct key. Is that what you wanted to know?

Kris

ray.wurlod · Post by **ray.wurlod** » Thu May 11, 2006 3:01 pm

Even at 10,000 rows per second, 80 million rows will take (80,000,000 / 10,000), or 8,000 seconds. You may need to manage expectations.

kris007 · Post by **kris007** » Thu May 11, 2006 3:11 pm

You are right Ray. The rows were being processed at 14000 rows/sec and still the time its taking is not something I am looking for. And hence, I was looking for a better way to design the job. Ken's idea was appealing and hence I am in that process. Need to check that out.

raoraghunandan · Post by **raoraghunandan** » Fri May 12, 2006 6:12 am

kris007 wrote:What I meant was, there are 80 million rows in my target table, each row identified by a distinct key. Is that what you wanted to know?

Kris

If you have 80 million unique records in your target table, then
-Time stamp your target table.
-Incrementally load your hash file. So assuming 1 million records per load..this hash file load should be much quicker. Is this not possible ?

-Raghu

kris007 · Post by **kris007** » Fri May 12, 2006 7:37 am

I do have a timestamp field in my target table. But, I didnt quite follow you on how I incrementally load the hashed file.

Based upon the timestamp in my target table, I pull only those records from my source which have changed since the target table has been last loaded. Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.

Thanks
Kris.

chulett · Post by **chulett** » Fri May 12, 2006 7:41 am

kris007 wrote:Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.

Kris, that's basically what Ken explained in his post.

raoraghunandan · Post by **raoraghunandan** » Fri May 12, 2006 7:54 am

kris007 wrote:I do have a timestamp field in my target table. But, I didnt quite follow you on how I incrementally load the hashed file.

Based upon the timestamp in my target table, I pull only those records from my source which have changed since the target table has been last loaded. Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.

Thanks
Kris.

When you say "...Now, this incremental data, I need to load into the target table based upon the keys found or not found..."

do you do only inserts or are there updates too ...i.e. if the key is found do you also update?

-Raghu

kris007 · Post by **kris007** » Fri May 12, 2006 8:20 am

Exaclty, I also do updates.

kris007 · Post by **kris007** » Fri May 12, 2006 9:01 am

Kris, that's basically what Ken explained in his post.

Hey Craig, I am doing exactly what Ken has suggested. But Raghu seems to come up with a different idea. Was just trying to findout what he is suggesting, clearly so that it would help. It wouldn't hurt to know if there is another way..right

Kris

chulett · Post by **chulett** » Fri May 12, 2006 9:14 am

Sure! Just think that at the end of the day they'll basically be the same. No harm in checking it out, of course.

DSguru2B · Post by **DSguru2B** » Fri May 12, 2006 9:14 am

Another way, IMHO, is to use several DB stages as your source
Give a key window in your where clause
say for eg:
in the first stage, where the key<=1000000
in the second stage, where the key >1000000 and key <=2000000
and so on
collect the out via a link collector and load it into a hash file. For 80 Million records, you will see your performance boost up trumendously, provided you have multiple cpu's.
Try it out, maybe it will work

Regards,

DSXchange

Suggestions needed to improve the performance(Hashedfile)

Suggestions needed to improve the performance(Hashedfile)

Re: Suggestions needed to improve the performance(Hashedfile