Merge versus join stage with four nodes

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
blazDelParis
Participant
Posts: 19
Joined: Wed Sep 08, 2010 6:47 am

Merge versus join stage with four nodes

Post by blazDelParis »

Code: Select all

DATASET_1_______ 
DATASET_2_______ |MERGE|---->output 


Here is a simple shcema of my job.
I've got 667 000 lines in dataset_2 and 28 millions in dataset_1.
There is no duplicate data in the two datasets.
I use a hash partition option , set on the common primary key.
The job runs with 4 nodes.

I want to join the two datasets, the dataset_2 is the main flow. So firstly, I've merged the two datasets using a Merge stage, with a "keep master row" option, setting dataset_2 as the master flow.

I get the 667 000 lines on the output, but 30 000 of them are not fulfilled with data from the dataset_1. among these 30 000 lines, most of them should have been matched.

If I replace the merge stage with a join stage, setting dataset_2 as the left part, and choosing left_outer_join option, I'll get the 667 000 lines, but only 1 500 lines not updated with the data from the dataset_1.
(which is correct)

Could you explain the difference ?

Thanks for your help.

PS : sorry for my bad english.
it's good to give a fish to a hungry man
it's better to teach him fishing
antonyraj.deva
Premium Member
Premium Member
Posts: 138
Joined: Wed Jul 16, 2008 9:51 pm
Location: Kolkata

Post by antonyraj.deva »

Any specific reason for choosing Hash Partioning? :?:

And by the way your English is perfectly fine... :wink:
TONY
ETL Manager
Infotrellis India

"Do what you can, with what you have, from where you are and to the best of your abilities."
blazDelParis
Participant
Posts: 19
Joined: Wed Sep 08, 2010 6:47 am

Post by blazDelParis »

hi,
thank you for your answer.

Well, I use hash partition, because these is the only one I can expect what it will do.
I am not sure that others options (like round robin, Range or Random for exemple) would partition the data the same way with the two datasets, because round robin will paritition considering the number of data rows.

As the two datasets have differents rows number, I cannot be sure that, I'll get matching data from the two datasets in the same nodes while hash partition option, will operate a data isolation per nodes, according the value of the hash key, rather than the total number of rows in a dataset.

However, I am not a specialist about partition options, so could an another partition option explain the difference between merge and join stages ?
it's good to give a fish to a hungry man
it's better to teach him fishing
sparq
Premium Member
Premium Member
Posts: 21
Joined: Wed Jun 09, 2010 6:06 am
Location: Australia

Post by sparq »

Blaz,

Are the datasets sorted? if they aren't, then there will be a lot of records not matched.
blazDelParis
Participant
Posts: 19
Joined: Wed Sep 08, 2010 6:47 am

Post by blazDelParis »

hi,
I've forgotten to say about that.
They are partitioned and sorted on the same field. (which is the "primary key")
it's good to give a fish to a hungry man
it's better to teach him fishing
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

You state that the files are partitioned and sorted on the primary key. Do you mean one column out of multiple? While you can partition the data on only the primary key column, you must SORT it on ALL key columns in order to meet the requirements of the Merge or Join stages.

The partitioning method is used to bring the records into the same partition. The Sorting is used to group those records together within the partition.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
blazDelParis
Participant
Posts: 19
Joined: Wed Sep 08, 2010 6:47 am

Post by blazDelParis »

thank you for the answer.

well, the 2 datasets are partitioned and sorted on the same key. (one field only in the two datasets.)
The 2 datasets are merged on this key.

When I join the two datasets, in the other case, I still join them on the same key.
it's good to give a fish to a hungry man
it's better to teach him fishing
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Are you using different datatypes for the key in the two versions of the job? Whether on purpose or accidentally? The datatypes of the key should match coming into the join or merge, otherwise you could miss matches in an unobvious manner.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
blazDelParis
Participant
Posts: 19
Joined: Wed Sep 08, 2010 6:47 am

Post by blazDelParis »

No they are both Decimal 38,10.

I know it is not the most efficient datatype, but when I do import the structure of the database table where the dataset structure is coming from, this is what Datastage do use as datatype to handle an Oracle number.
it's good to give a fish to a hungry man
it's better to teach him fishing
Post Reply