Merge versus join stage with four nodes

blazDelParis · Post by **blazDelParis** » Tue Mar 29, 2011 8:45 am

DATASET_1_______ 
DATASET_2_______ |MERGE|---->output

Here is a simple shcema of my job.
I've got 667 000 lines in dataset_2 and 28 millions in dataset_1.
There is no duplicate data in the two datasets.
I use a hash partition option , set on the common primary key.
The job runs with 4 nodes.

I want to join the two datasets, the dataset_2 is the main flow. So firstly, I've merged the two datasets using a Merge stage, with a "keep master row" option, setting dataset_2 as the master flow.

I get the 667 000 lines on the output, but 30 000 of them are not fulfilled with data from the dataset_1. among these 30 000 lines, most of them should have been matched.

If I replace the merge stage with a join stage, setting dataset_2 as the left part, and choosing left_outer_join option, I'll get the 667 000 lines, but only 1 500 lines not updated with the data from the dataset_1.
(which is correct)

Could you explain the difference ?

Thanks for your help.

PS : sorry for my bad english.

antonyraj.deva · Post by **antonyraj.deva** » Wed Mar 30, 2011 4:31 am

Any specific reason for choosing Hash Partioning?

And by the way your English is perfectly fine...

blazDelParis · Post by **blazDelParis** » Wed Mar 30, 2011 5:49 am

hi,
thank you for your answer.

Well, I use hash partition, because these is the only one I can expect what it will do.
I am not sure that others options (like round robin, Range or Random for exemple) would partition the data the same way with the two datasets, because round robin will paritition considering the number of data rows.

As the two datasets have differents rows number, I cannot be sure that, I'll get matching data from the two datasets in the same nodes while hash partition option, will operate a data isolation per nodes, according the value of the hash key, rather than the total number of rows in a dataset.

However, I am not a specialist about partition options, so could an another partition option explain the difference between merge and join stages ?

sparq · Post by **sparq** » Wed Mar 30, 2011 9:07 pm

Blaz,

Are the datasets sorted? if they aren't, then there will be a lot of records not matched.

blazDelParis · Post by **blazDelParis** » Thu Mar 31, 2011 2:11 am

hi,
I've forgotten to say about that.
They are partitioned and sorted on the same field. (which is the "primary key")

jwiles · Post by **jwiles** » Thu Mar 31, 2011 10:00 am

You state that the files are partitioned and sorted on the primary key. Do you mean one column out of multiple? While you can partition the data on only the primary key column, you must SORT it on ALL key columns in order to meet the requirements of the Merge or Join stages.

The partitioning method is used to bring the records into the same partition. The Sorting is used to group those records together within the partition.

Regards,

blazDelParis · Post by **blazDelParis** » Thu Mar 31, 2011 10:40 am

thank you for the answer.

well, the 2 datasets are partitioned and sorted on the same key. (one field only in the two datasets.)
The 2 datasets are merged on this key.

When I join the two datasets, in the other case, I still join them on the same key.

jwiles · Post by **jwiles** » Fri Apr 01, 2011 12:13 am

Are you using different datatypes for the key in the two versions of the job? Whether on purpose or accidentally? The datatypes of the key should match coming into the join or merge, otherwise you could miss matches in an unobvious manner.

Regards,

blazDelParis · Post by **blazDelParis** » Fri Apr 01, 2011 2:19 am

No they are both Decimal 38,10.

I know it is not the most efficient datatype, but when I do import the structure of the database table where the dataset structure is coming from, this is what Datastage do use as datatype to handle an Oracle number.