Remove duplicates stage unable to remove duplicates on 1 nod

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Remove duplicates stage unable to remove duplicates on 1 nod

Post by Minhajuddin »

Hi all,

I am facing a weird issue with the Remove duplicates stage. The job is a straight job

Code: Select all

input_dataset===>remove_duplicates===>output_dataset
It just has two columns which are varchars(these were trimmed before the data was inserted). Now when we try to remove the duplicates by running the job on the default configuration (default.apt has only 1 node and that is being shown in the director too), The job was not removing the duplicates.

But, when we changed the partitioning on the remove duplicate stage's input tab, from auto to Hash partitioning(with the two columns set as the key columns), it is removing the duplicates successfully.

I was assuming that when we use the remove duplicates stage on a single node, we don't have to worry about the partitioning, Is my assumption wrong? Has somebody faced a similar issue.
Please help.

Thank you.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
Vasanth
Participant
Posts: 10
Joined: Tue Apr 10, 2007 1:37 am
Location: M nagar

Post by Vasanth »

First of all, partitioning and node config are two diff concepts. Partitioning will partition (arrange similar group of data) your data and place it in an organized way.

Nodes are used for acheiving parallelism. It means, it will distribute your data into several machines (assume several CPU's) based on haw many nodes you give and process it.

What happened in your scenario was, the data was not sorted and that had lead to create some problems while execution. The best way to resolve this is to sort or partition your data based on index keys. Thats what u did.

Thanks,
Vasanth
Vasanth
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

Vasanth wrote:First of all, partitioning and node config are two diff concepts. Partitioning will partition (arrange similar group of data) your data and place it in an organized way.

Nodes are used for acheiving parallelism. It means, it will distribute your data into several machines (assume several CPU's) based on haw many nodes you give and process it.

What happened in your scenario was, the data was not sorted and that had lead to create some problems while execution. The best way to resolve this is to sort or partition your data based on index keys. Thats what u did.

Thanks,
Vasanth
Thanks for the reply Vasanth.

But the data is not partitioned when the job is processed on a single node. So, you see they are related. Unless I am completely off track.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
tcs
Premium Member
Premium Member
Posts: 5
Joined: Thu Mar 17, 2005 1:46 pm

Post by tcs »

I would agree with you, you shouldn't need to repartition with only one node. A couple of ideas:

Does you input_dataset have only one partition? If it has two or more, and you have the preserve partitioning flags set, I don't know that the auto partition mode will collect all of the records into one partition. Probably one of the gurus could answer this.

Are your data sorted? DataStage should insert the correct sort operators, but many developers prefer to specify the sort in the job design. Your score dump will indicate if sorts were inserted. Again, it doesn't seem as though this should be affected by the partition method, but it's something else to poke at the problem with.
manishk
Participant
Posts: 32
Joined: Tue Oct 25, 2005 8:45 pm

Re: Remove duplicates stage unable to remove duplicates on 1

Post by manishk »

Seems its Sorting problem..

Minhajuddin wrote:Hi all,

I am facing a weird issue with the Remove duplicates stage. The job is a straight job

Code: Select all

input_dataset===>remove_duplicates===>output_dataset
It just has two columns which are varchars(these were trimmed before the data was inserted). Now when we try to remove the duplicates by running the job on the default configuration (default.apt has only 1 node and that is being shown in the director too), The job was not removing the duplicates.

But, when we changed the partitioning on the remove duplicate stage's input tab, from auto to Hash partitioning(with the two columns set as the key columns), it is removing the duplicates successfully.

I was assuming that when we use the remove duplicates stage on a single node, we don't have to worry about the partitioning, Is my assumption wrong? Has somebody faced a similar issue.
Please help.

Thank you.
Thanks
Manish
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

Thanks for all the replies.

I'll check the tsort operator in the score(I guess it should be there).
I'll also see if the dataset comprises of a single partition.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
Post Reply