If you don't want to use a hased file or aggregator, then I hope your source file is pre-sorted.
Here's some ideas:
- If your source is sorted, you can use Stage Variables in a transformer to save the PK of the previous row, and check it against the PK of the current row. If it is the same, reject it.
- If your source is not sorted, I think there is a plug-in Sorter stage. Don't know how efficient it is. Otherwise, you could use an OS sort command (eg. Unix sort) and place it in the Filter Command of your SEQ file stage.
- Without sorting, you can implement a hashed file using a library routine rather than having the hashed file stages cluttering your job design. The benefits of this are very arguable - the hashed file stages may handle large volumes more efficiently. Despite this, I developed such a routine. The source code is below:
FUNCTION IsUnique(pTEMPDIR, pHashFile, pInitialize, pVal)
Code: Select all
DEFFUN NVL(A,B) calling "DSU.NVL"
DEFFUN IsEquivalent(A,B) calling "DSU.IsEquivalent"
Common /CDMUniq/ Initialized, UniqFileName, UniqFileHandle
EQUATE RoutineName TO 'IsUnique'
Ans = 0
HashPath = pTEMPDIR : "/" : pHashFile
* If the Initialize parameter is set, remove and rebuild the hash file
If pInitialize Then
Initialized = 0
* The Unix command does 3 things (and fails immediately if any of them fail):
* : Go to the TEMP Directory
* : Remove the existing hash file if it exists
* : Create a new hash file with mkdbfile
* I had to use this method, because it would fail with "unable to create ..." error if I created it with
* full path name from another directory. Dont know why.
StrCommand = "cd " : pTEMPDIR : " && rm -rf " : pHashFile : " && $DSHOME/bin/mkdbfile " : pHashFile : " 30 1 4 20 50 80 1628"
Call DSExecute("UNIX", StrCommand , OutPut, RetCode)
if RetCode <> 0 then
Call DSLogWarn("Unix error (" : RetCode : ") in ": StrCommand : " ::::: " : OutPut, "IsUnique")
end
* Attempt to open the file.
Openpath HashPath TO UniqFileHandle
Then
Initialized = 1
UniqFileName = HashPath
End
Else
Call DSLogWarn("Unable to open hash file " : NVL(HashPath,"<NULL>"), "IsUnique")
Ans = -1
End
End
If Initialized Then
* If the currently open file is a different file, then open the correct file
If Not(IsEquivalent(UniqFileName, HashPath)) Then
Openpath HashPath TO UniqFileHandle
Then
Initialized = 1
UniqFileName = HashPath
End
Else
Call DSLogWarn("Unable to open hash file " : NVL(HashPath,"<NULL>"), "IsUnique")
Ans = -1
End
End
If Ans = 0 Then
* NULL values are always unique
If IsNull(pVal) Then
Ans = 1
End
Else
* Check if Val already exists
Read ValExists From UniqFileHandle, pVal
Then
* Return FALSE if exists
Ans = 0
End
Else
* Return TRUE and add to file if not exists
Write pVal On UniqFileHandle, pVal
Then
Ans = 1
End
Else
Call DSLogWarn("Unable to write hash file " : NVL(pHashFile,"<NULL>"), "IsUnique")
Ans = -1
End
End
End
End
End
Code: Select all
Use this routine to ensure that all values of a given column in a transformation are unique.
How to use it:
Create two Stage Variables in a Transformer:
Variable Default Expression
----------- ------------ -----------------------
XXXIsUnique CheckUnique(#$TEMPDIR#, hashfilename, Initialize, LK_Link.Column)
Initialize 1 0
Since "Initialize" is below "XXXIsUnique" in the Stage Variables, the first time "XXXIsUnique" is evaluated, "Initialize" will be 1.
After the first row, "Initialize" is set to "0" and is therefore 0 for each subsequent call.
In this way, the hash file in IsUnique() will be initialized only once.
IsUnique() will return 1 if the column value has not yet been seen, or 0 if it has been seen.