Remove duplicates stage unable to remove duplicates on 1 nod

Minhajuddin · Post by **Minhajuddin** » Wed Feb 06, 2008 6:26 am

Hi all,

I am facing a weird issue with the Remove duplicates stage. The job is a straight job

input_dataset===>remove_duplicates===>output_dataset

It just has two columns which are varchars(these were trimmed before the data was inserted). Now when we try to remove the duplicates by running the job on the default configuration (default.apt has only 1 node and that is being shown in the director too), The job was not removing the duplicates.

But, when we changed the partitioning on the remove duplicate stage's input tab, from auto to Hash partitioning(with the two columns set as the key columns), it is removing the duplicates successfully.

I was assuming that when we use the remove duplicates stage on a single node, we don't have to worry about the partitioning, Is my assumption wrong? Has somebody faced a similar issue.
Please help.

Thank you.

Vasanth · Post by **Vasanth** » Wed Feb 06, 2008 7:58 am

First of all, partitioning and node config are two diff concepts. Partitioning will partition (arrange similar group of data) your data and place it in an organized way.

Nodes are used for acheiving parallelism. It means, it will distribute your data into several machines (assume several CPU's) based on haw many nodes you give and process it.

What happened in your scenario was, the data was not sorted and that had lead to create some problems while execution. The best way to resolve this is to sort or partition your data based on index keys. Thats what u did.

Thanks,
Vasanth

Minhajuddin · Post by **Minhajuddin** » Wed Feb 06, 2008 8:59 am

Vasanth wrote:First of all, partitioning and node config are two diff concepts. Partitioning will partition (arrange similar group of data) your data and place it in an organized way.

Nodes are used for acheiving parallelism. It means, it will distribute your data into several machines (assume several CPU's) based on haw many nodes you give and process it.

What happened in your scenario was, the data was not sorted and that had lead to create some problems while execution. The best way to resolve this is to sort or partition your data based on index keys. Thats what u did.

Thanks,
Vasanth

Thanks for the reply Vasanth.

But the data is not partitioned when the job is processed on a single node. So, you see they are related. Unless I am completely off track.

tcs · Post by **tcs** » Wed Feb 06, 2008 10:41 am

I would agree with you, you shouldn't need to repartition with only one node. A couple of ideas:

Does you input_dataset have only one partition? If it has two or more, and you have the preserve partitioning flags set, I don't know that the auto partition mode will collect all of the records into one partition. Probably one of the gurus could answer this.

Are your data sorted? DataStage should insert the correct sort operators, but many developers prefer to specify the sort in the job design. Your score dump will indicate if sorts were inserted. Again, it doesn't seem as though this should be affected by the partition method, but it's something else to poke at the problem with.

manishk · Post by **manishk** » Wed Feb 06, 2008 10:48 am

Seems its Sorting problem..

Minhajuddin wrote:Hi all,

I am facing a weird issue with the Remove duplicates stage. The job is a straight job
Code: Select all
input_dataset===>remove_duplicates===>output_dataset
It just has two columns which are varchars(these were trimmed before the data was inserted). Now when we try to remove the duplicates by running the job on the default configuration (default.apt has only 1 node and that is being shown in the director too), The job was not removing the duplicates.

But, when we changed the partitioning on the remove duplicate stage's input tab, from auto to Hash partitioning(with the two columns set as the key columns), it is removing the duplicates successfully.

I was assuming that when we use the remove duplicates stage on a single node, we don't have to worry about the partitioning, Is my assumption wrong? Has somebody faced a similar issue.
Please help.

Thank you.

Minhajuddin · Post by **Minhajuddin** » Wed Feb 06, 2008 10:13 pm

Thanks for all the replies.

I'll check the tsort operator in the score(I guess it should be there).
I'll also see if the dataset comprises of a single partition.

DSXchange

Remove duplicates stage unable to remove duplicates on 1 nod

Remove duplicates stage unable to remove duplicates on 1 nod

Re: Remove duplicates stage unable to remove duplicates on 1