Concatenate Hashfiles.

nag0143 · Post by **nag0143** » Wed Dec 01, 2004 3:04 pm

Is there any command which i can use as before stage routine to concatenate hashfiles ,
I know i can concatenate using LC, but is there any basic command or any command like cat for sequential files

Thanks
Nag

chucksmith · Post by **chucksmith** » Wed Dec 01, 2004 3:15 pm

Within ServerEngine (Universe) there is a COPY command. You could use this to copy rows from one hash file to another. It will require VOC pointers to both files.

As I recall, the format was:

Code: Select all

COPY FROM HashFile1 TO HashFile2 ALL OVERWRITING

Check the IBM Universe documentation first!!!

I think it is better to parameterize your hash file names, and run the job twice.

nag0143 · Post by **nag0143** » Wed Dec 01, 2004 3:28 pm

Chuck,
Let me explain my scenariao, I am splitting a source into four different jobs , each job populating a different hashfile(my hash are pathed to a directory) and then in the next stream, i want to concatenate these 4 hashfiles into one file and then process the data.

I am doing the above scenario using link collecter.

Is it possible to conacatenate the pathed hashfiles and then use the result as above,
COPY command does the cat of pathed hashfiles??

Thanks again for your time
Nag

chulett · Post by **chulett** » Wed Dec 01, 2004 3:44 pm

Why not just have your four jobs all write to the same hash file? Save you the 'concatenate' step. I don't believe there will be a problem with them all doing this at the same time.

chucksmith · Post by **chucksmith** » Wed Dec 01, 2004 4:48 pm

I agree with Craig. I would just run your process job four times, all four writing to the same hash file. Merging the files seems like a waste of computer time.

I have also had some issues with the Link Collecter at version 6 if each link does not receive the same number of rows. Eventually, the collector just waits when the row difference gets over a certain threshold. I understand that this changes in more current releases.

Another consideration is if your final output is a hash file and the four jobs are running in parallel, then you do not want to enable write cache in the output hash file stage. I would test both serial execution with caching enabled and parallel with it disabled, and see which performs better.

The COPY command must have VOC Pointers. You can create VOC pointers using my CreateVocPtr routine. Look at the DataStage Tools page of www.anotheritco.com .

nag0143 · Post by **nag0143** » Wed Dec 01, 2004 4:54 pm

chulett wrote:Why not just have your four jobs all write to the same hash file? Save you the 'concatenate' step. I don't believe there will be a problem with them all doing this at the same time.

i am running all these jobs parallelly , these are run as a part of daily job stream, if i run all at the same time, then there will be a chance that data will be appended each day to same hashfile ??

chucksmith · Post by **chucksmith** » Wed Dec 01, 2004 4:58 pm

Not really. Since a hash file is a keyed file, duplicate rows will overwrite.

chulett · Post by **chulett** » Wed Dec 01, 2004 5:13 pm

That and you should clear the hash before you start.

ray.wurlod · Post by **ray.wurlod** » Wed Dec 01, 2004 7:30 pm

Using a UV stage and user-defined SQL you can create a UNION of the rows from all the hashed files. If there may be duplicate rows, specify UNION ALL.

Code: Select all

SELECT COL1,COL2,COL3 FROM HF1
UNION ALL
SELECT COL1,COL2,COL3 FROM HF2
UNION ALL
SELECT COL1,COL2,COL3 FROM HF3 
UNION ALL
SELECT COL1,COL2,COL3 FROM HF4 ;

To use the UV stage you will need VOC pointers to all four hashed files.

kcbland · Post by **kcbland** » Wed Dec 01, 2004 10:38 pm

Hey, don't bust on Nag for splitting his hash files.

If you're attempting to stay under a 32-bit limitation on either the filesystem or the hash file because of volume, splitting your hash files up into separate files can yield a lot of performance boost.

Get ready to learn some new tricks here:

1. By partitioning a source data stream into 4 smaller pieces, Nag can have 4 smaller hash files being created SIMULTANEOUSLY, instead of one BIG hash file.
2. If one hash file is undersized, you only size up for that one, which is easier and faster than UPSIZING a BIG hash file.
3. You can spread i/o around because you have 4 underlying objects.
4. If you have 4 job instances writing to the same hash file, ALL 4 instances stall while the hash file upsizes; whereas separate hash files unique to each instance upsize on demand and only stall the localized stream.

So, how can Nag quickly merge the 4 separate hash files back into a single stream. My recommendation is to high-performance dump the hash files into text files and then issue the 'cat' statement. Full scanning and dumping a hash file to a sequential file is VERY FAST. DataStage adds overhead, he could consider writing a very simple BASIC subroutine to dump a hash file to a delimited text file. You don't have to worry about formatting, just do delimited. Simply select all rows from the hash file, then read each row, convert @AM to "|", and WRITESEQ out the row. Think of it as a mindless bulk-unload.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 02, 2004 3:08 pm

Another possibility, since the record structure is identical (and only if the keys are disjoint), is to create a Distributed file as an "umbrella" over the four hashed files.

nag0143 · Post by **nag0143** » Thu Dec 02, 2004 4:55 pm

Thanks KC, I tried your logic and dump into a seq file.. its working good.

Ray, Can you please elaborate on technique you suggested, distributed file??
Thanks
Nag

chucksmith · Post by **chucksmith** » Thu Dec 02, 2004 5:45 pm

Ray, that's a great idea

ray.wurlod · Post by **ray.wurlod** » Thu Dec 02, 2004 7:30 pm

A Distributed file effectively gives you the UNION ALL of a number of hashed files. There are two prerequisites to being able to use a Distributed file: there must be no duplicate key values across the entire set of hashed files, and there must be some algorithmic method for detrmining in which particular hashed file any particular key belongs.

If your data satisfy these conditions, you allocate a unique (positive) part number to each hashed file, and use a DEFINE.DF command to establish the Distributed file. This command is a UniVerse command, so must be executed in a dssh environment.

Within this command you specify a file name, the name of each "part file" (hashed file) and its corresponding part number, and the partitioning algorithm use to allocate key values to particular part files.

kcbland · Post by **kcbland** » Thu Dec 02, 2004 9:26 pm

The distributed hash file has a lot of maintenance overhead, in that you must define partitioning algorithms. A distributed hash file is the same concept as a partitioned table in Oracle. You must include the partition key column when doing any lookup/join to benefit from partition pruning.

The point of having separate hash files is lost when throwing your lot in with a distributed hash file. Let's say you have 100M orders to process, doing a lookup against a 10M row customer dimension. If you put all 10M rows into a hash file, with extraneous/superfluous columns removed, you still exceed 2.2 gigabytes and have to go 64 bit, which means manual creation, etc. In order to leverage a multi-cpu server, you're going to want to partition the 100M orders into subsets and use an instantiated transform job to spread processing across all the cpus.

Create a job parameter PartitionNumber and PartitionCount. Set PartitionCount to 10 and run 10 instances of the following jobs setting PartitionNumber to 1..10 in the jobs:

You build a hash lookup creation job like: SEQ (customers) --> XFM w/MOD statement on customer ID --> HASH w/ PartitionNumber in the filename

Your main transformation job looks like: SEQ (orders) --> XFM w/MOD statement on customer ID to partition data ---> XFM w/hash lookup on customer ID where hash filename has PartitionNumber --> SEQ

What this does is give you smaller, faster hash files to reference, keeping you in the 32-bit realm, distributes CPU load, and lets you balance processing better. As you add more CPUS, you can simply increase the PartitionCount value. With intelligent job control, this becomes just a runtime variable (how many instances should I create of this job?).

I use this techique for non-PX imlementations to leverage all CPU power on high-end SMP boxes. It keeps things nice and neat.