Performance of "Merge" stage

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
dickfong
Participant
Posts: 68
Joined: Tue Apr 15, 2003 9:20 am

Performance of "Merge" stage

Post by dickfong »

Im trying to merge 2 sequential files using the "merge" stage on a key X.

The "merge" stage sounds perfectly matched what I need to do but it perform extremely slow ( somewhat like
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Try pre-sorting the two source files for the merge stage, each on the key.


Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
dickfong
Participant
Posts: 68
Joined: Tue Apr 15, 2003 9:20 am

Post by dickfong »

Thanks Wurlod. I've tried that, the performance is better (from
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Load one into a UniVerse table (= hashed file) using INSERT.
Load the other into the same UniVerse table using UPDATE.
SELECT the result.


Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Why not extract file 2 into file 1, adding the extra column. Then you can process the full file.
OR
Process the data in 2 ETL jobs run in sequence.
OR
Create a job for file 1 with the required transformations and outputs. Convert everything after the input stage into a share container. Create a second job for file 2 which converts the data into 4 columns then feeds it to the shared container. You will have two jobs to run however the transformation and output logic will be shared making maintenance easier.


Vincent McBurney
Data Integration Services
www.intramatix.com
dickfong
Participant
Posts: 68
Joined: Tue Apr 15, 2003 9:20 am

Post by dickfong »

Thanks for the reply.

We have tried using select with data in universe stage but the /tmp is full before we got the result (We are using /tmp as the tmp directory for datastage as I read from uvconfig, can we specify a tmp directory for a particular universe stage instead of changing the datastage-wise default? Where would the normal practive to place the datastage tmp directory?)


wmcburney, could you elaborate more on your third method?

We found an error message "Error occured while deleting temporary file" in a job with only a merge stage(source) and a sequential file stage (target). We have cleaned the specfied temp directory and even changed the job setting to another tmp directory but none of these can make the job to run again, any idea of this?

Thanks a lot.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

See new topic "Temp Directory for DataStage" (so that it's easier for future users to search for).
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

You see one of the drawbacks of the merge stage, also true of the aggregation stage, that DataStage builds the merge within the temp directory space which adds a performance overhead. You may find it faster to load file 2 onto file 1.

Getting back to shared containers, if you want to process two files but the fields just don't match up and the transforms are quite complex and you have many lookups you don't want to maintain two jobs that are virtually the same because it's time consuming and creates additional risk. A way around this is to put all the transform and output into a shared container and hook it into two jobs that handle the reading of the files.

Shared containers are a version 5 and 6 feature. The automatic creation of a shared container is just about the most visually interesting thing you'll ever see DataStage do! You create a full job, select those parts of it you want to turn into a Shared Container and generate the container from the menu option. All those items disappear from your job and magically appear in a seperate job leaving a shared container input stage.

Next you create your second job loading in file 2, you add the shared container to your job and link file 2 to the shared container input fields.

So you have two jobs reading input fields with a different file structure sending the data to the same transform and output stages.


Vincent McBurney
Data Integration Services
www.intramatix.com
Post Reply