Composite Key Duplicate Check

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
josh.guffey
Participant
Posts: 40
Joined: Thu Apr 17, 2008 1:52 pm
Location: Huntsville, AL

Composite Key Duplicate Check

Post by josh.guffey »

Hi Everyone,

I have a small table with a 3 data element composite key. My requirement is to check for duplicates within this composite key. Here is the approach that I am taking.

Create a derived column D in a transformer that holds the values of these fields by using this logic.

Code: Select all

If IsNotNull(Src.A) Then
 If IsNotNull(Src.B) Then
  If IsNotNull(Src.C) Then
   Src.A : Src.B : Src.C
  Else
   -1
 Else
  -1
Else
 -1
The next stage is a Sort Stage and I am sorting the data on my new derived column and setting the 'Create Key Change Column' attribute = True. After the Sort Stage I am validating my duplicates in another transformer by using this logic:

Code: Select all

If IsNotNull(DupeCheck.D) Then
 If DupeCheck.keyChange = 0 Then 
  'InternalErrorCode' 
 Else 
  SetNull() 
Else
 SetNull()
2 questions regarding this example:

Is there a better way to handle a potentially NULL input column other than assigning a -1 (or some other arbitrary value) to the derived column and rejecting this record to a DataSet or Error Table?

Is this the best practice method of checking for duplicates in DataStage? I think that the new functionality in 8.5 may prove extremely useful in this case, but I just wanted to see if there are any other suggestions.

Thanks for reading this post!

Josh
jhmckeever
Premium Member
Premium Member
Posts: 301
Joined: Thu Jul 14, 2005 10:27 am
Location: Melbourne, Australia
Contact:

Post by jhmckeever »

Hi Josh,

there are a number of approaches to duplicate checking, and the one you take depends upon the context in which your processing is taking place.

For the code you quoted you're probably better using ...

Code: Select all

If IsNotNull(Src.A) AND IsNotNull(Src.B) AND IsNotNull(Src.C) Then 
   Src.A : Src.B : Src.C 
Else 
   SetNull()
... then using constraints based on IsNull().

Bear in mind that you don't need to concatenate the columns just for duplicate checking as you can specify multiple columns in the downstream duplicate check.

If you're still keen on the concatenation approach then watch out that concatenating "10 : 11 : 12" will be the same as "101 : 11 : 2" unless you cater for these situations. Using a delimiter is one approach.

John.
<b>John McKeever</b>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

John's got the logic sorted for method you're using to go about it, so I'm not going to go there. ;-)


As for other ways to do it, it depends what you want to do with the dupes.

If you're happy to drop dupes, then sort by the 3 keys and then run it through a Remove Duplicates stage, using those 3 keys.

If you want to keep the dupes, use a Sort stage sorting by the 3 keys and under Properties/Options, set "Creat Key Change Column" to true. The first in the group will be set to 1, and the dupes that follow will be 0, and you can take it from there.

Cheers,
Stuart.
sachinag
Participant
Posts: 12
Joined: Fri Mar 12, 2010 4:38 am

Re: Composite Key Duplicate Check

Post by sachinag »

Hi,

please make a single column using concatenate operator for all the composite key column,then use remove duplicate stage after sort.


[quote="josh.guffey"]Hi Everyone,

I have a small table with a 3 data element composite key. My requirement is to check for duplicates within this composite key. Here is the approach that I am taking.

Create a derived column D in a transformer that holds the values of these fields by using this logic.

[code]
If IsNotNull(Src.A) Then
If IsNotNull(Src.B) Then
If IsNotNull(Src.C) Then
Src.A : Src.B : Src.C
Else
-1
Else
-1
Else
-1
[/code]

The next stage is a Sort Stage and I am sorting the data on my new derived column and setting the 'Create Key Change Column' attribute = True. After the Sort Stage I am validating my duplicates in another transformer by using this logic:

[code]
If IsNotNull(DupeCheck.D) Then
If DupeCheck.keyChange = 0 Then
'InternalErrorCode'
Else
SetNull()
Else
SetNull()
[/code]

2 questions regarding this example:

Is there a better way to handle a potentially NULL input column other than assigning a -1 (or some other arbitrary value) to the derived column and rejecting this record to a DataSet or Error Table?

Is this the best practice method of checking for duplicates in DataStage? I think that the new functionality in 8.5 may prove extremely useful in this case, but I just wanted to see if there are any other suggestions.

Thanks for reading this post!

Josh[/quote]
Post Reply