I am facing a weird issue with the Remove duplicates stage. The job is a straight job
Code: Select all
input_dataset===>remove_duplicates===>output_dataset
But, when we changed the partitioning on the remove duplicate stage's input tab, from auto to Hash partitioning(with the two columns set as the key columns), it is removing the duplicates successfully.
I was assuming that when we use the remove duplicates stage on a single node, we don't have to worry about the partitioning, Is my assumption wrong? Has somebody faced a similar issue.
Please help.
Thank you.