I have a table that has 6000000 (6 million) records. I am loading this table into a hash file to eliminate duplicate records. And then from the hash file I am loading the target table.
SrcTbl --> HashFile --> TgtTbl
The hash file type is Type 30(Dynamic). The problem is that the job got stuck after processing 4996953 records. It shows that it is processing records at the rate of 82 rows/sec but the total number of records processed does not increase.
Putting distinct in the source query is taking a lot if time. The query runs for hours because it has to get distinct records from a lot of 6 mi records. Moreover we may have two distinct records with same combination of column values that are difined as keys in target table.
How reliable is the statistics that you get to see when you monitor a job? May be the job is still processing records in the back ground but it is not getting reflected in the Monitor screen of the job. Is it possible?????
sumitgulati wrote:Putting distinct in the source query is taking a lot if time. The query runs for hours because it has to get distinct records from a lot of 6 mi records. Moreover we may have two distinct records with same combination of column values that are difined as keys in target table.
Regards,
Sumit
Selecting 6 million records from the database and writing them into a hashfile is going to be instantaneous (hint: define a good index and your distinct will fly).
About your last problem this can be solved in SQL by "grouping by" on the key values and than tacking on the extra data that needs to be moved in 1 min or max function.
Anyway your solution also sounds tacky... The last row processed for a certain key is the one that will appear in your target.
Thanks Ogmios, I enforced the key columns uniqueness in the source query itself and knocked out the hash file from the job. The job is performing much better now.
When you have a hash file that large you need a minimum modulo. Do a search this has been covered a lot. Usually if a hash file hangs it is not resizing. Dynamic hash files split one group of records into 2 groups to resize. So as the modulo grows from 1 to several hundred thousand then you had lots of splits which one group of records was written out twice as it split. Not very effecient. When a minimum modulo then you eliminate lots of these splits. At TCL run:
ANALYZE.FILE HashFileName
This will tell you the current modulo. Also when a hash file is this large then use the 64bit option. This will allow the hash file to grow very large. Do not use this option on all your hash files. There is some overhead for this option.
32-bit addressable files have a 2.2 gigagbyte limit. DS Server by default uses 32-bit hash files. You can set DS to use 64-bit as the default, or, manually create hash files supplying the 64BIT keyword option. Search the forum, it's covered a lot.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
sonia jacob wrote:What about using an aggregator stage to remove duplicates?
The only problem with the aggrigator is that it is slow. It does preform a crude group by if that is what you need. I would still suggest doing the group by on the db server.
KCBLAND: 32-bit addressable files have a 2.2 gigagbyte limit. DS Server by default uses 32-bit hash files. You can set DS to use 64-bit as the default, or, manually create hash files supplying the 64BIT keyword option. Search the forum, it's covered a lot.
Search the forum, we've covered this before. It's in the uvconfig file, so requires a uvregen and is a server wide setting, affecting all projects.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle