Large hashed file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

First, if you know the DATA.30 and OVER.30 are going into the 6-8GB size every time it runs, you should create the file manually using mkdbfile and a minimum modulus like 2000000 or something. The files may never grow, the thing to watch is the timestamp on the files. If they're changing, that means that data is being placed somewhere within the DATA file itself, without needing to go into the OVER file.

Second, we don't know if the problem is Oracle or DataStage. If you're hanging an open SELECT for hours, you could be getting into issues on the Oracle side regarding rollback and holding that result set open. Seriously consider spooling to a sequential file, or better yet, using multiple job instances with a ranging query to have multiple spooling processes dumping out the data. Then, concatentate the multiple output files together and then run into the hashed file.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Pedro,

setting the uvconfig value for 64 bit files to 1 means that any and all files created thereafter are going to be 64bit. This will mean that all of the hundreds of small hashed files that make up your projects and jobs are going to be use the bigger file pointers that are slower. It is generally recommended to not set this flag.

If you only have this one large file you should think about making it a partitioned file. If you can use a simple algorithm on your key to split this file into 10 or perhaps more "groups" you can do 2 things that will speed up your hashed file load and perhaps increase your lookup speed, too:

(a) you can run your hashed file load in parallel with 10 processes each writing directly to the appropriate partfile and

(b) you can avoid having one big physical dynamic file and have 10 small, manageable and more efficient dynamic files all grouped together as one virtual dynamic file.

(c) If your job process allows it, you might structure your lookups to use only the partfiles and perhaps even be able to load those single files to memory for faster throughput.
pramod_au
Participant
Posts: 30
Joined: Thu Feb 06, 2003 8:30 am
Location: London,UK

Post by pramod_au »

Hi

How about trying distributed hash file using data segmentation? I hope this will be another work around.

Cheers
Pramod
Thanks
Pramod
Post Reply