Page 1 of 1

Composite Key Duplicate Check

Posted: Mon Dec 06, 2010 5:37 pm
by josh.guffey
Hi Everyone,

I have a small table with a 3 data element composite key. My requirement is to check for duplicates within this composite key. Here is the approach that I am taking.

Create a derived column D in a transformer that holds the values of these fields by using this logic.

Code: Select all

If IsNotNull(Src.A) Then
 If IsNotNull(Src.B) Then
  If IsNotNull(Src.C) Then
   Src.A : Src.B : Src.C
  Else
   -1
 Else
  -1
Else
 -1
The next stage is a Sort Stage and I am sorting the data on my new derived column and setting the 'Create Key Change Column' attribute = True. After the Sort Stage I am validating my duplicates in another transformer by using this logic:

Code: Select all

If IsNotNull(DupeCheck.D) Then
 If DupeCheck.keyChange = 0 Then 
  'InternalErrorCode' 
 Else 
  SetNull() 
Else
 SetNull()
2 questions regarding this example:

Is there a better way to handle a potentially NULL input column other than assigning a -1 (or some other arbitrary value) to the derived column and rejecting this record to a DataSet or Error Table?

Is this the best practice method of checking for duplicates in DataStage? I think that the new functionality in 8.5 may prove extremely useful in this case, but I just wanted to see if there are any other suggestions.

Thanks for reading this post!

Josh

Posted: Mon Dec 06, 2010 6:21 pm
by jhmckeever
Hi Josh,

there are a number of approaches to duplicate checking, and the one you take depends upon the context in which your processing is taking place.

For the code you quoted you're probably better using ...

Code: Select all

If IsNotNull(Src.A) AND IsNotNull(Src.B) AND IsNotNull(Src.C) Then 
   Src.A : Src.B : Src.C 
Else 
   SetNull()
... then using constraints based on IsNull().

Bear in mind that you don't need to concatenate the columns just for duplicate checking as you can specify multiple columns in the downstream duplicate check.

If you're still keen on the concatenation approach then watch out that concatenating "10 : 11 : 12" will be the same as "101 : 11 : 2" unless you cater for these situations. Using a delimiter is one approach.

John.

Posted: Mon Dec 06, 2010 6:53 pm
by stuartjvnorton
John's got the logic sorted for method you're using to go about it, so I'm not going to go there. ;-)


As for other ways to do it, it depends what you want to do with the dupes.

If you're happy to drop dupes, then sort by the 3 keys and then run it through a Remove Duplicates stage, using those 3 keys.

If you want to keep the dupes, use a Sort stage sorting by the 3 keys and under Properties/Options, set "Creat Key Change Column" to true. The first in the group will be set to 1, and the dupes that follow will be 0, and you can take it from there.

Cheers,
Stuart.

Re: Composite Key Duplicate Check

Posted: Tue Dec 07, 2010 12:30 am
by sachinag
Hi,

please make a single column using concatenate operator for all the composite key column,then use remove duplicate stage after sort.


[quote="josh.guffey"]Hi Everyone,

I have a small table with a 3 data element composite key. My requirement is to check for duplicates within this composite key. Here is the approach that I am taking.

Create a derived column D in a transformer that holds the values of these fields by using this logic.

[code]
If IsNotNull(Src.A) Then
If IsNotNull(Src.B) Then
If IsNotNull(Src.C) Then
Src.A : Src.B : Src.C
Else
-1
Else
-1
Else
-1
[/code]

The next stage is a Sort Stage and I am sorting the data on my new derived column and setting the 'Create Key Change Column' attribute = True. After the Sort Stage I am validating my duplicates in another transformer by using this logic:

[code]
If IsNotNull(DupeCheck.D) Then
If DupeCheck.keyChange = 0 Then
'InternalErrorCode'
Else
SetNull()
Else
SetNull()
[/code]

2 questions regarding this example:

Is there a better way to handle a potentially NULL input column other than assigning a -1 (or some other arbitrary value) to the derived column and rejecting this record to a DataSet or Error Table?

Is this the best practice method of checking for duplicates in DataStage? I think that the new functionality in 8.5 may prove extremely useful in this case, but I just wanted to see if there are any other suggestions.

Thanks for reading this post!

Josh[/quote]