Loading Hash File

sumitgulati · Post by **sumitgulati** » Thu Sep 09, 2004 10:15 am

Hi All,

I have a table that has 6000000 (6 million) records. I am loading this table into a hash file to eliminate duplicate records. And then from the hash file I am loading the target table.

SrcTbl --> HashFile --> TgtTbl

The hash file type is Type 30(Dynamic). The problem is that the job got stuck after processing 4996953 records. It shows that it is processing records at the rate of 82 rows/sec but the total number of records processed does not increase.

Any idea what could be the problem.

Regards,
Sumit

ogmios · Post by **ogmios** » Thu Sep 09, 2004 10:20 am

Without more information no, maybe your hashfile is "resizing".

But maybe you can consider using distinct or a group by in the source database to remove duplicates?

Ogmios

sumitgulati · Post by **sumitgulati** » Thu Sep 09, 2004 10:24 am

Putting distinct in the source query is taking a lot if time. The query runs for hours because it has to get distinct records from a lot of 6 mi records. Moreover we may have two distinct records with same combination of column values that are difined as keys in target table.

Regards,
Sumit

sumitgulati · Post by **sumitgulati** » Thu Sep 09, 2004 10:29 am

How reliable is the statistics that you get to see when you monitor a job? May be the job is still processing records in the back ground but it is not getting reflected in the Monitor screen of the job. Is it possible?????

Please advise.

-Sumit

ogmios · Post by **ogmios** » Thu Sep 09, 2004 10:30 am

sumitgulati wrote:Putting distinct in the source query is taking a lot if time. The query runs for hours because it has to get distinct records from a lot of 6 mi records. Moreover we may have two distinct records with same combination of column values that are difined as keys in target table.

Regards,
Sumit

Selecting 6 million records from the database and writing them into a hashfile is going to be instantaneous

(hint: define a good index and your distinct will fly).

About your last problem this can be solved in SQL by "grouping by" on the key values and than tacking on the extra data that needs to be moved in 1 min or max function.

Anyway your solution also sounds tacky... The last row processed for a certain key is the one that will appear in your target.

Ogmios

sumitgulati · Post by **sumitgulati** » Thu Sep 09, 2004 11:51 am

Thanks Ogmios, I enforced the key columns uniqueness in the source query itself and knocked out the hash file from the job. The job is performing much better now.

Regards,
Sumit

kduke · Post by **kduke** » Thu Sep 09, 2004 12:33 pm

When you have a hash file that large you need a minimum modulo. Do a search this has been covered a lot. Usually if a hash file hangs it is not resizing. Dynamic hash files split one group of records into 2 groups to resize. So as the modulo grows from 1 to several hundred thousand then you had lots of splits which one group of records was written out twice as it split. Not very effecient. When a minimum modulo then you eliminate lots of these splits. At TCL run:

ANALYZE.FILE HashFileName

This will tell you the current modulo. Also when a hash file is this large then use the 64bit option. This will allow the hash file to grow very large. Do not use this option on all your hash files. There is some overhead for this option.

sumitgulati · Post by **sumitgulati** » Thu Sep 09, 2004 12:42 pm

Thanks Kim Duke, I will try this option also.

Regards,
Sumit

Deepak_J · Post by **Deepak_J** » Thu Sep 30, 2004 1:03 pm

Summit,
It is possible your hash file might be hitting 2GB limit.

Deepak

JDionne · Post by **JDionne** » Thu Sep 30, 2004 2:13 pm

Deepak_J wrote:Summit,
It is possible your hash file might be hitting 2GB limit.

Deepak

There is a hashfile size limit?????
what happens after 2 gig?
Jim

kcbland · Post by **kcbland** » Thu Sep 30, 2004 2:24 pm

32-bit addressable files have a 2.2 gigagbyte limit. DS Server by default uses 32-bit hash files. You can set DS to use 64-bit as the default, or, manually create hash files supplying the 64BIT keyword option. Search the forum, it's covered a lot.

sonia jacob · Post by **sonia jacob** » Thu Sep 30, 2004 3:00 pm

What about using an aggregator stage to remove duplicates?

JDionne · Post by **JDionne** » Thu Sep 30, 2004 3:17 pm

sonia jacob wrote:What about using an aggregator stage to remove duplicates?

The only problem with the aggrigator is that it is slow. It does preform a crude group by if that is what you need. I would still suggest doing the group by on the db server.

Jim

martin · Post by **martin** » Thu Oct 21, 2004 1:28 pm

KCBLAND: 32-bit addressable files have a 2.2 gigagbyte limit. DS Server by default uses 32-bit hash files. You can set DS to use 64-bit as the default, or, manually create hash files supplying the 64BIT keyword option. Search the forum, it's covered a lot.

How To set 64 Bit As Default

Thanks In Advance

kcbland · Post by **kcbland** » Thu Oct 21, 2004 1:50 pm

Search the forum, we've covered this before. It's in the uvconfig file, so requires a uvregen and is a server wide setting, affecting all projects.

DSXchange

Loading Hash File

Loading Hash File

Re: Loading Hash File