Merge stage

psriva · Post by **psriva** » Thu Sep 14, 2006 12:39 pm

I have 2 input files (which are not sorted) and I want to sort and merge them.

Now when I use the merge stage should I explicitly use a sort stage to sort before merging or can I use the sort option within the merge stage ?

Which enhances performance and which is the best partitioning method in both cases.

Thank you all in advance.

meena · Post by **meena** » Thu Sep 14, 2006 12:58 pm

Hi,

The input to the Merge stage must be key partitioned and
sorted.
Sorting the data before merging is affective when compared to other option.
I think using the same partition option is good.

Now when I use the merge stage should I explicitly use a sort stage to sort before merging or can I use the sort option within the merge stage ?

Which enhances performance and which is the best partitioning method in both cases

.

kris007 · Post by **kris007** » Thu Sep 14, 2006 1:49 pm

Hash partitioning in the Sort stage on the keys you would like to sort and Same partitioning in the following Merge stage.

splayer · Post by **splayer** » Sun Nov 26, 2006 11:11 pm

I came across this sentence in the Plug-in stage documentation for Merge stage(page 20-2):

"Choosing the auto partitioning method will ensure that partitioning and sorting is done."

Does this mean that partitioning and sorting is handled automatically by the Merge stage? Merge stage does allow to perform sort and do hash partitioning (which is key based partitioning).

Do you need to do this exclusively in Merge stage?

tejaswini · Post by **tejaswini** » Mon Nov 27, 2006 12:02 am

It depends on the volume of records your job is handling.
If the volume is less, you can perform hash partitioning in the 'merge' stage itself for both the input links on the keys you gonna merge and also check the perform sort option.
But if the volume is in terms of millions or more, then have an external 'sort' stage after the input files. Sort on the merge keys and also do 'hash' partitioning on the same keys in 'sort' stage itself. In merge stage do 'same' partitioning for both the input links.

Nageshsunkoji · Post by **Nageshsunkoji** » Mon Nov 27, 2006 3:42 am

Hi All,

First thing is How link sorting will differ from External sorting stage?

As per my knowledge, both will use the Tsort operator and both will use the same disk for sorting purpose.

Until and unless, you have a specific requirement to use sort stage ( Ex: implementing some logic by using Change Cluster mode) go for external stage otherwise perform Sort on th link. I don't think so, there is any performance problem. You can perform HASH partition and link sorting in the same stage.

ray.wurlod · Post by **ray.wurlod** » Mon Nov 27, 2006 2:13 pm

Inspect the score. If you choose Auto as the partitioning algorithm the composed score will have inserted tsort operators on each input link, and will report the chosen partitioning algorithm. This *should* be one of the key-based algorithms, but why rely on that? Propose a specific algorithm (usually Hash, but Modulus may be more efficient for a single integer or Range if the data are otherwise badly skewed) and propose explicit sorting. Can you use an upstream Sort stage set to "don't sort (previously sorted)" to avoid unnecessasry re-sorting?

splayer · Post by **splayer** » Mon Nov 27, 2006 2:53 pm

ray, can you explain what you mean by "score" and "composed score"?

Thanks.

ray.wurlod · Post by **ray.wurlod** » Mon Nov 27, 2006 6:00 pm

When a parallel job starts the Conductor process reads the generated osh and the configuration file, and from that composes the "score". The score is what all nodes "play" (execute).

The score is then distributed to the Section Leader processes; that this occurs can be verified by enabling the APT_STARTUP_STATUS environment variable.

You can have the score dumped into a job log event by setting the APT_DUMP_SCORE environment variable. The score shows all the data sets, partitioners, collectors, operators and processes that will be involved in executing the job.

The score is "played" by the section leader processes and governs which player processes execute at any particular time.

Anyone who is going to be serious as a developer of parallel jobs really does need to learn how to read the score; it is one of the fundamentally most important diagnostic tools.

tejaswini · Post by **tejaswini** » Tue Nov 28, 2006 2:45 am

why i mentioned to go for a separate sort stage as the volume increases is, in 'sort' stage there is an option called 'restricted memory usage' which by default is 20 MB. By increasing this memory, we can allocate more space for sorting which will increase the performance for higher volume of records.

DSXchange

Merge stage

Merge stage

Introduction to the Orchestra