Performance of "Merge" stage

dickfong · Post by **dickfong** » Mon Feb 10, 2003 8:24 am

Im trying to merge 2 sequential files using the "merge" stage on a key X.

The "merge" stage sounds perfectly matched what I need to do but it perform extremely slow ( somewhat like

ray.wurlod · Post by **ray.wurlod** » Mon Feb 10, 2003 3:23 pm

Try pre-sorting the two source files for the merge stage, each on the key.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

dickfong · Post by **dickfong** » Wed Feb 12, 2003 10:00 am

Thanks Wurlod. I've tried that, the performance is better (from

ray.wurlod · Post by **ray.wurlod** » Wed Feb 12, 2003 4:14 pm

Load one into a UniVerse table (= hashed file) using INSERT.
Load the other into the same UniVerse table using UPDATE.
SELECT the result.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

vmcburney · Post by **vmcburney** » Thu Feb 13, 2003 1:26 am

Why not extract file 2 into file 1, adding the extra column. Then you can process the full file.
OR
Process the data in 2 ETL jobs run in sequence.
OR
Create a job for file 1 with the required transformations and outputs. Convert everything after the input stage into a share container. Create a second job for file 2 which converts the data into 4 columns then feeds it to the shared container. You will have two jobs to run however the transformation and output logic will be shared making maintenance easier.

Vincent McBurney
Data Integration Services
www.intramatix.com

dickfong · Post by **dickfong** » Fri Feb 14, 2003 1:16 am

Thanks for the reply.

We have tried using select with data in universe stage but the /tmp is full before we got the result (We are using /tmp as the tmp directory for datastage as I read from uvconfig, can we specify a tmp directory for a particular universe stage instead of changing the datastage-wise default? Where would the normal practive to place the datastage tmp directory?)

wmcburney, could you elaborate more on your third method?

We found an error message "Error occured while deleting temporary file" in a job with only a merge stage(source) and a sequential file stage (target). We have cleaned the specfied temp directory and even changed the job setting to another tmp directory but none of these can make the job to run again, any idea of this?

Thanks a lot.

ray.wurlod · Post by **ray.wurlod** » Fri Feb 14, 2003 4:53 pm

See new topic "Temp Directory for DataStage" (so that it's easier for future users to search for).

vmcburney · Post by **vmcburney** » Sun Feb 16, 2003 12:16 am

You see one of the drawbacks of the merge stage, also true of the aggregation stage, that DataStage builds the merge within the temp directory space which adds a performance overhead. You may find it faster to load file 2 onto file 1.

Getting back to shared containers, if you want to process two files but the fields just don't match up and the transforms are quite complex and you have many lookups you don't want to maintain two jobs that are virtually the same because it's time consuming and creates additional risk. A way around this is to put all the transform and output into a shared container and hook it into two jobs that handle the reading of the files.

Shared containers are a version 5 and 6 feature. The automatic creation of a shared container is just about the most visually interesting thing you'll ever see DataStage do! You create a full job, select those parts of it you want to turn into a Shared Container and generate the container from the menu option. All those items disappear from your job and magically appear in a seperate job leaving a shared container input stage.

Next you create your second job loading in file 2, you add the shared container to your job and link file 2 to the shared container input fields.

So you have two jobs reading input fields with a different file structure sending the data to the same transform and output stages.

Vincent McBurney
Data Integration Services
www.intramatix.com