Suggestions needed to improve the performance(Hashedfile)
Moderators: chulett, rschirm, roy
Suggestions needed to improve the performance(Hashedfile)
Hi All,
I have a job designed to do incremental loading based upon the last updated date in the target table. I stage all the incremental data in a sequential file and lookup against a hashed file containing all the key values of the target table and based upon key found and notfound I do my inserts and updates. Now, my problem is, the target table has around 80 million records. So, I need to load these 80 million keys into a hashed file for lookup purposes. Firstly, 32bit hashed file is not big enough to hold this data. So, I created a 64bit hashed file. Now, its taking very long to load all these keys into the hashed file ..around 100-120 minutes. I would like to know a better way to do this. My incremental data is always around 1 million rows. So, basically, I have to wait until my hashed file is loaded with all the key column values(80 million rows) and then just lookup for those 1 million rows from source and see if they exist in the target or not.
I somehow, havig a feeling that there should be a better way to do this. Any guidance would be of great help.
Thanks
Kris.
I have a job designed to do incremental loading based upon the last updated date in the target table. I stage all the incremental data in a sequential file and lookup against a hashed file containing all the key values of the target table and based upon key found and notfound I do my inserts and updates. Now, my problem is, the target table has around 80 million records. So, I need to load these 80 million keys into a hashed file for lookup purposes. Firstly, 32bit hashed file is not big enough to hold this data. So, I created a 64bit hashed file. Now, its taking very long to load all these keys into the hashed file ..around 100-120 minutes. I would like to know a better way to do this. My incremental data is always around 1 million rows. So, basically, I have to wait until my hashed file is loaded with all the key column values(80 million rows) and then just lookup for those 1 million rows from source and see if they exist in the target or not.
I somehow, havig a feeling that there should be a better way to do this. Any guidance would be of great help.
Thanks
Kris.
The problem is not so much hashed file performance as it is job design. Can you not flag changed records in your table so that instead of reloading 80 million rows each and every time you run this job you will just update those values that have changed?
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Take your delta data set, scan it and build a distinct list of natural keys. Bulk load those keys into a work table in your target database. Inner join that work table to your target table and spool the row (natural and surrogate keys, or ideally the entire row) out to your hashed file. It will be perfectly created for each row, only contains the minimum necessary rows, etc.
Why maintain a complete copy of your target tales in your ETL tool? Restartability, corruption, etc are all issues with your design. We talk a lot about this method on the forum, you could enjoy reading these discussions. There are refinements, such as loading two hashed files, one of natural to surrogate key cross-references, and the other a full complement of columns for the row. This way you can either reference just the keys during foreign key substitution, or have the full detail. It also leads to discussion about rapid comparisons for current to new row using CRC32 methods.
Why maintain a complete copy of your target tales in your ETL tool? Restartability, corruption, etc are all issues with your design. We talk a lot about this method on the forum, you could enjoy reading these discussions. There are refinements, such as loading two hashed files, one of natural to surrogate key cross-references, and the other a full complement of columns for the row. This way you can either reference just the keys during foreign key substitution, or have the full detail. It also leads to discussion about rapid comparisons for current to new row using CRC32 methods.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
-
- Charter Member
- Posts: 19
- Joined: Sun Jul 20, 2003 4:29 am
Re: Suggestions needed to improve the performance(Hashedfile
You have 80 million key values!!!. Are you sure? Can you explain in brief your target table design ?kris007 wrote:Hi All,
I have a job designed to do incremental loading based upon the last updated date in the target table. I stage all the incremental data in a sequential file and lookup against a hashed file containing all the key values of the target table and based upon key found and notfound I do my inserts and updates. Now, my problem is, the target table has around 80 million records. So, I need to load these 80 million keys into a hashed file for lookup purposes. Firstly, 32bit hashed file is not big enough to hold this data. So, I created a 64bit hashed file. Now, its taking very long to load all these keys into the hashed file ..around 100-120 minutes. I would like to know a better way to do this. My incremental data is always around 1 million rows. So, basically, I have to wait until my hashed file is loaded with all the key column values(80 million rows) and then just lookup for those 1 million rows from source and see if they exist in the target or not.
I somehow, havig a feeling that there should be a better way to do this. Any guidance would be of great help.
Thanks
Kris.
Thanks,
Raghu
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Charter Member
- Posts: 19
- Joined: Sun Jul 20, 2003 4:29 am
If you have 80 million unique records in your target table, thenkris007 wrote:What I meant was, there are 80 million rows in my target table, each row identified by a distinct key. Is that what you wanted to know?
Kris
-Time stamp your target table.
-Incrementally load your hash file. So assuming 1 million records per load..this hash file load should be much quicker. Is this not possible ?
-Raghu
I do have a timestamp field in my target table. But, I didnt quite follow you on how I incrementally load the hashed file.
Based upon the timestamp in my target table, I pull only those records from my source which have changed since the target table has been last loaded. Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.
Thanks
Kris.
Based upon the timestamp in my target table, I pull only those records from my source which have changed since the target table has been last loaded. Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.
Thanks
Kris.
Kris, that's basically what Ken explained in his post.kris007 wrote:Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Charter Member
- Posts: 19
- Joined: Sun Jul 20, 2003 4:29 am
When you say "...Now, this incremental data, I need to load into the target table based upon the keys found or not found..."kris007 wrote:I do have a timestamp field in my target table. But, I didnt quite follow you on how I incrementally load the hashed file.
Based upon the timestamp in my target table, I pull only those records from my source which have changed since the target table has been last loaded. Now, this incremental data, I need to load into the target table based upon the keys found or not found. It is in this second part, where I had a problem. Please let me know if you got any ideas as I didnt seem to follow your post.
Thanks
Kris.
do you do only inserts or are there updates too ...i.e. if the key is found do you also update?
-Raghu
Hey Craig, I am doing exactly what Ken has suggested. But Raghu seems to come up with a different idea. Was just trying to findout what he is suggesting, clearly so that it would help. It wouldn't hurt to know if there is another way..rightKris, that's basically what Ken explained in his post.
Kris
Another way, IMHO, is to use several DB stages as your source
Give a key window in your where clause
say for eg:
in the first stage, where the key<=1000000
in the second stage, where the key >1000000 and key <=2000000
and so on
collect the out via a link collector and load it into a hash file. For 80 Million records, you will see your performance boost up trumendously, provided you have multiple cpu's.
Try it out, maybe it will work
Regards,
Give a key window in your where clause
say for eg:
in the first stage, where the key<=1000000
in the second stage, where the key >1000000 and key <=2000000
and so on
collect the out via a link collector and load it into a hash file. For 80 Million records, you will see your performance boost up trumendously, provided you have multiple cpu's.
Try it out, maybe it will work
Regards,
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.