Code: Select all
DATASET_1_______
DATASET_2_______ |MERGE|---->output
Here is a simple shcema of my job.
I've got 667 000 lines in dataset_2 and 28 millions in dataset_1.
There is no duplicate data in the two datasets.
I use a hash partition option , set on the common primary key.
The job runs with 4 nodes.
I want to join the two datasets, the dataset_2 is the main flow. So firstly, I've merged the two datasets using a Merge stage, with a "keep master row" option, setting dataset_2 as the master flow.
I get the 667 000 lines on the output, but 30 000 of them are not fulfilled with data from the dataset_1. among these 30 000 lines, most of them should have been matched.
If I replace the merge stage with a join stage, setting dataset_2 as the left part, and choosing left_outer_join option, I'll get the 667 000 lines, but only 1 500 lines not updated with the data from the dataset_1.
(which is correct)
Could you explain the difference ?
Thanks for your help.
PS : sorry for my bad english.