Suggestions needed to improve the performance(Hashedfile)

raoraghunandan · Post by **raoraghunandan** » Fri May 12, 2006 9:16 am

kris007 wrote:
Kris, that's basically what Ken explained in his post.
Hey Craig, I am doing exactly what Ken has suggested. But Raghu seems to come up with a different idea. Was just trying to findout what he is suggesting, clearly so that it would help. It wouldn't hurt to know if there is another way..right

Kris

Here is what I was thinking:

1) You have 80 million records in Target with a timestamp.

2) You load these 80 million keys into a hash file (this is a problem process and takes lot of time according to you).

3)You get 1 million incremental records...you look up the hash file and you are either updating or inserting based on the availability of the keys. This is not a problem process according to you. So now you have (say) 80.6 million records in target , assuming .4 million records were updates.

4) Now try to insert the 1 million records into your hash file ie. do not clear the hash file before you insert, based on your timestamp. Your hash file will now have 80.6 million records too. This time this job will be quick as the no. of records is less.

Repeat step 3) and step 4) for your incremental loads..

Wouldn't this work ? Have I missed out something ?

However, this design is not complete and still has other problems like what happens when you want to do a full load ? You might have to handle it through other jobs.

Thanks,
Raghu

chulett · Post by **chulett** » Fri May 12, 2006 9:16 am

Kris isn't processing 80 million records, that's what's in the target.

DSguru2B · Post by **DSguru2B** » Fri May 12, 2006 9:18 am

I understand that Craig. But he needs to build that hash file with the target for the lkup. Right? Or am on the wrong track...?

chulett · Post by **chulett** » Fri May 12, 2006 9:19 am

The key here is to not load all 80 million target records into the hashed used for lookup - there's no need for that. That's what Ken explained - a method to only load the natural keys from the 80 million target records that match up with the staged incremental data. That way your lookup has the minimum amount of information needed for the current run.

DSguru2B · Post by **DSguru2B** » Fri May 12, 2006 9:22 am

Even in that case. When he is building the hashed file for the first time, if the natural key count is in millions, with multiple cpu's at our service, he will def. see a performance boost. And for the daily runs he can maintain a seperate job for the incremental load as raoraghunandan suggested.

raoraghunandan · Post by **raoraghunandan** » Fri May 12, 2006 9:26 am

chulett wrote:The key here is to not load all 80 million target records into the hashed used for lookup - there's no need for that. That's what Ken explained - a method to only load the natural keys from the 80 million target records that match up with the staged incremental data. That way your lookup has the minimum amount of information needed for the current run.

I am only suggesting to process the hash file load incrementally. While this may be a storage issue, my question regarding Ken's suggestion is that would not joining a 80 mill table with another big table, be slow in itself?

-Raghu

kris007 · Post by **kris007** » Fri May 12, 2006 9:30 am

I am quite sure that we are never going to do a full load anytime now and in near future. So, its all a matter of doing the daily incremental loading. And, also when I load the hashed file with more than 80 million keys..it has to be a 64bit hashed file( since its gonna hit 2.2GB) which again I think is going to affect my performance. At this time, given my situation, I feel that Ken's solution is the most approachable one, since it just involves an extra step of staging the delta data which is also going to be way faster than my previously designed job.

And for the daily runs he can maintain a seperate job for the incremental load as raoraghunandan suggested.

I am under the impression that daily load and incremental loading are same if we are doing it with respect to the timestamp inthe target table.Please correct me if I am wrong.

Kris

DSguru2B · Post by **DSguru2B** » Fri May 12, 2006 9:32 am

Thats exactly what i meant Kris. I am more used to the term Daily over here at my client as our incremental loads are of Daily frequency.

raoraghunandan · Post by **raoraghunandan** » Fri May 12, 2006 9:34 am

kris007 wrote:I am quite sure that we are never going to do a full load anytime now and in near future. So, its all a matter of doing the daily incremental loading. And, also when I load the hashed file with more than 80 million keys..it has to be a 64bit hashed file( since its gonna hit 2.2GB) which again I think is going to affect my performance. At this time, given my situation, I feel that Ken's solution is the most approachable one, since it just involves an extra step of staging the delta data which is also going to be way faster than my previously designed job.

And for the daily runs he can maintain a seperate job for the incremental load as raoraghunandan suggested.
I am under the impression that daily load and incremental loading are same if we are doing it with respect to the timestamp inthe target table.Please correct me if I am wrong.

Kris

It depends if your timestamp goes on to hours and minutes. All the best and let us know your results.

-Raghu

chulett · Post by **chulett** » Fri May 12, 2006 9:53 am

raoraghunandan wrote:my question regarding Ken's suggestion is that would not joining a 80 mill table with another big table, be slow in itself?

Not if it's done properly.

We do this all the time, it's a matter of ensuring the query is as optimal as possible. If it takes hints or an additional index, so be it. The net savings over loading everything into a gi-huge-ic hashed file can be substantial.

thurmy34 · Post by **thurmy34** » Wed Aug 27, 2008 5:37 am

Hi all
This topic is great because we all have this king of problem.
What about the Replace existing rows completely.(I have this option in the DB2 plugging).
You don't have to do a lookup but i don't know the impact on the Database.

chulett · Post by **chulett** » Wed Aug 27, 2008 7:32 am

That 'update action' first deletes any record and then inserts a new one in its place. If you are willing to burn two DML statements for every record when one would do, I suppose it could be used in that manner.