Merge stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
psriva
Participant
Posts: 44
Joined: Fri Aug 11, 2006 8:40 am

Merge stage

Post by psriva »

I have 2 input files (which are not sorted) and I want to sort and merge them.

Now when I use the merge stage should I explicitly use a sort stage to sort before merging or can I use the sort option within the merge stage ?

Which enhances performance and which is the best partitioning method in both cases.

Thank you all in advance.
ps
meena
Participant
Posts: 430
Joined: Tue Sep 13, 2005 12:17 pm

Post by meena »

Hi,

The input to the Merge stage must be key partitioned and
sorted.
Sorting the data before merging is affective when compared to other option.
I think using the same partition option is good.
Now when I use the merge stage should I explicitly use a sort stage to sort before merging or can I use the sort option within the merge stage ?

Which enhances performance and which is the best partitioning method in both cases
.
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

Hash partitioning in the Sort stage on the keys you would like to sort and Same partitioning in the following Merge stage.
Kris

Where's the "Any" key?-Homer Simpson
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

I came across this sentence in the Plug-in stage documentation for Merge stage(page 20-2):

"Choosing the auto partitioning method will ensure that partitioning and sorting is done."

Does this mean that partitioning and sorting is handled automatically by the Merge stage? Merge stage does allow to perform sort and do hash partitioning (which is key based partitioning).

Do you need to do this exclusively in Merge stage?
tejaswini
Participant
Posts: 19
Joined: Thu Aug 26, 2004 5:40 am

Post by tejaswini »

It depends on the volume of records your job is handling.
If the volume is less, you can perform hash partitioning in the 'merge' stage itself for both the input links on the keys you gonna merge and also check the perform sort option.
But if the volume is in terms of millions or more, then have an external 'sort' stage after the input files. Sort on the merge keys and also do 'hash' partitioning on the same keys in 'sort' stage itself. In merge stage do 'same' partitioning for both the input links.
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

Hi All,

First thing is How link sorting will differ from External sorting stage?

As per my knowledge, both will use the Tsort operator and both will use the same disk for sorting purpose.

Until and unless, you have a specific requirement to use sort stage ( Ex: implementing some logic by using Change Cluster mode) go for external stage otherwise perform Sort on th link. I don't think so, there is any performance problem. You can perform HASH partition and link sorting in the same stage.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Inspect the score. If you choose Auto as the partitioning algorithm the composed score will have inserted tsort operators on each input link, and will report the chosen partitioning algorithm. This *should* be one of the key-based algorithms, but why rely on that? Propose a specific algorithm (usually Hash, but Modulus may be more efficient for a single integer or Range if the data are otherwise badly skewed) and propose explicit sorting. Can you use an upstream Sort stage set to "don't sort (previously sorted)" to avoid unnecessasry re-sorting?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

ray, can you explain what you mean by "score" and "composed score"?

Thanks.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Introduction to the Orchestra

Post by ray.wurlod »

When a parallel job starts the Conductor process reads the generated osh and the configuration file, and from that composes the "score". The score is what all nodes "play" (execute).

The score is then distributed to the Section Leader processes; that this occurs can be verified by enabling the APT_STARTUP_STATUS environment variable.

You can have the score dumped into a job log event by setting the APT_DUMP_SCORE environment variable. The score shows all the data sets, partitioners, collectors, operators and processes that will be involved in executing the job.

The score is "played" by the section leader processes and governs which player processes execute at any particular time.

Anyone who is going to be serious as a developer of parallel jobs really does need to learn how to read the score; it is one of the fundamentally most important diagnostic tools.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tejaswini
Participant
Posts: 19
Joined: Thu Aug 26, 2004 5:40 am

Post by tejaswini »

why i mentioned to go for a separate sort stage as the volume increases is, in 'sort' stage there is an option called 'restricted memory usage' which by default is 20 MB. By increasing this memory, we can allocate more space for sorting which will increase the performance for higher volume of records.
Post Reply