Page 1 of 1

HASH Partition not working for Checksum values

Posted: Mon Jul 31, 2017 4:25 am
by rohit_mca2003
Hi,

I need to join the columns (using join stage) which have MD5 hash values (using Checksum stage for this).

I have same data in source and target so expected to match all the records but join is not happening properly. I am doing HASH partition before join. When analysed output of HASH partition then it is giving different result (count in each partition is different) for source and target records for each partition.

It seems partition does not happen in same way for source and target.
Please help if you know the reason and how I can resolve this as I have to join on column having MD5 hash values.

Thanks.

Posted: Mon Jul 31, 2017 9:47 am
by UCDI
if you do a hash partition on the md5 value, it should put them together properly. As you hashing on multiple keys that could be different?

Posted: Mon Jul 31, 2017 12:22 pm
by chulett
Right, was thinking much the same thing.

If that turns out to not be the issue, I for one would need more clarification about certain aspects of this. For example, the join is "not happening properly" because the partitioning is wrong (i.e. matching join keys don't go to the same partition) or do you mean something else. And it seems to me the simplest test to see if your core logic is sound is to run it on a single node. Is that something you've tried?

Posted: Mon Jul 31, 2017 9:01 pm
by rohit_mca2003
When I say join is not happening properly it means if I run the join in sequence or in entire partitions then it is working fine.
but with HASH partition, partition does not seems to be working fine and records from both side (with same key) seems to be on different partition.

Posted: Tue Aug 01, 2017 5:58 am
by chulett
... which unfortunately doesn't answer the question asked.

Posted: Tue Aug 01, 2017 6:43 am
by chulett
Well... in continuing to ponder this, it seems we can infer an answer. So the join itself is in fact working, assuming that "in sequence" means "sequentially" a.k.a. either on a single node or the stage being constrained. Which means we're back to exactly what are you partitioning on? Please detail for us (words, screenshot) that information.