rejecting duplicate rows

rafidwh · Post by **rafidwh** » Fri May 12, 2006 1:19 am

Hi,
I have sequentila file as source and oracle oci as target and its just one to one mapping,now i need to reject the duplicates.
i dont wnat to use hashfile or aggregator .

my doubt is if i use another flat file as target and if any duplicates records exist then oracle is not going to accept it ,will those records will automatically go to the reject flat file or any condition need to be given,
if any condition is required please provide that.

awaiting for the suggestion

Thanks in advance
Saik

rleishman · Post by **rleishman** » Fri May 12, 2006 1:57 am

If you don't want to use a hased file or aggregator, then I hope your source file is pre-sorted.

Here's some ideas:
- If your source is sorted, you can use Stage Variables in a transformer to save the PK of the previous row, and check it against the PK of the current row. If it is the same, reject it.
- If your source is not sorted, I think there is a plug-in Sorter stage. Don't know how efficient it is. Otherwise, you could use an OS sort command (eg. Unix sort) and place it in the Filter Command of your SEQ file stage.
- Without sorting, you can implement a hashed file using a library routine rather than having the hashed file stages cluttering your job design. The benefits of this are very arguable - the hashed file stages may handle large volumes more efficiently. Despite this, I developed such a routine. The source code is below:

FUNCTION IsUnique(pTEMPDIR, pHashFile, pInitialize, pVal)

Code: Select all

      DEFFUN NVL(A,B) calling "DSU.NVL"
      DEFFUN IsEquivalent(A,B) calling "DSU.IsEquivalent"

      Common /CDMUniq/ Initialized, UniqFileName, UniqFileHandle


      EQUATE RoutineName TO 'IsUnique'

      Ans = 0
      HashPath = pTEMPDIR : "/" : pHashFile



      * If the Initialize parameter is set, remove and rebuild the hash file
      If pInitialize Then
         Initialized = 0

         * The Unix command does 3 things (and fails immediately if any of them fail):
         * : Go to the TEMP Directory
         * : Remove the existing hash file if it exists
         * : Create a new hash file with mkdbfile
         * I had to use this method, because it would fail with "unable to create ..." error if I created it with
         * full path name from another directory. Dont know why.

         StrCommand = "cd " : pTEMPDIR : " && rm -rf " : pHashFile : " && $DSHOME/bin/mkdbfile " : pHashFile : " 30 1 4 20 50 80 1628"

         Call DSExecute("UNIX", StrCommand , OutPut, RetCode)
         if RetCode <> 0 then
            Call DSLogWarn("Unix error (" : RetCode : ") in ": StrCommand : " ::::: " : OutPut, "IsUnique")
         end

         * Attempt to open the file.
         Openpath HashPath TO UniqFileHandle
         Then
            Initialized = 1
            UniqFileName = HashPath
         End
         Else
            Call DSLogWarn("Unable to open hash file " : NVL(HashPath,"<NULL>"), "IsUnique")
            Ans = -1
         End
      End


      If Initialized Then

         * If the currently open file is a different file, then open the correct file

         If Not(IsEquivalent(UniqFileName, HashPath)) Then
            Openpath HashPath TO UniqFileHandle
            Then
               Initialized = 1
               UniqFileName = HashPath
            End
            Else
               Call DSLogWarn("Unable to open hash file " : NVL(HashPath,"<NULL>"), "IsUnique")
               Ans = -1
            End
         End




         If Ans = 0 Then

		* NULL values are always unique
            If IsNull(pVal) Then
               Ans = 1
            End
            Else
               * Check if Val already exists
               Read ValExists From UniqFileHandle, pVal
               Then
                  * Return FALSE if exists
                  Ans = 0
               End
               Else
                  * Return TRUE and add to file if not exists
                  Write pVal On UniqFileHandle, pVal
                  Then
                     Ans = 1
                  End
                  Else
                     Call DSLogWarn("Unable to write hash file " : NVL(pHashFile,"<NULL>"), "IsUnique")
                     Ans = -1
                  End
               End
            End
         End
      End

Code: Select all

Use this routine to ensure that all values of a given column in a transformation are unique.

How to use it:

Create two Stage Variables in a Transformer:
Variable		Default	Expression
-----------		------------	-----------------------
XXXIsUnique		CheckUnique(#$TEMPDIR#, hashfilename, Initialize, LK_Link.Column)
Initialize		1	0

Since "Initialize" is below "XXXIsUnique" in the Stage Variables, the first time "XXXIsUnique" is evaluated, "Initialize" will be 1. 
After the first row, "Initialize" is set to "0" and is therefore 0 for each subsequent call.
In this way, the hash file in IsUnique() will be initialized only once.

IsUnique() will return 1 if the column value has not yet been seen, or 0 if it has been seen.

chulett · Post by **chulett** » Fri May 12, 2006 6:05 am

Or you could simplify things tremendously by 'cluttering up your job design' with a hashed file between the sequential and OCI stages.

Out of curiousity, why the restriction?

sanjeev_sithara · Post by **sanjeev_sithara** » Sat May 13, 2006 12:24 am

I think your best bet wud be either:

1)Use a hashed file stage or
2) Sort the records in Unix using sort command (before job sub-routine)
and then using stage variables in the transformer, reject the duplicate
records.The addition of Sort stage will surely slow down your job.

HTH

ajith · Post by **ajith** » Sat May 13, 2006 12:44 am

sanjeev_sithara wrote:I think your best bet wud be either:

1)Use a hashed file stage or
2) Sort the records in Unix using sort command (before job sub-routine)
and then using stage variables in the transformer, reject the duplicate
records.The addition of Sort stage will surely slow down your job.

HTH

Yeah , Thats the way we achieved it and it is working very fine.

mauherga · Post by **mauherga** » Mon May 15, 2006 11:50 am

Hi,

If you use sort command in UNIX, you can reject duplicated records using -u parameter and in that way you avoid the routine.