rejecting duplicate rows

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rafidwh
Participant
Posts: 179
Joined: Mon Oct 10, 2005 11:30 pm

rejecting duplicate rows

Post by rafidwh »

Hi,
I have sequentila file as source and oracle oci as target and its just one to one mapping,now i need to reject the duplicates.
i dont wnat to use hashfile or aggregator .

my doubt is if i use another flat file as target and if any duplicates records exist then oracle is not going to accept it ,will those records will automatically go to the reject flat file or any condition need to be given,
if any condition is required please provide that.

awaiting for the suggestion

Thanks in advance
Saik
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

If you don't want to use a hased file or aggregator, then I hope your source file is pre-sorted. :wink:

Here's some ideas:
- If your source is sorted, you can use Stage Variables in a transformer to save the PK of the previous row, and check it against the PK of the current row. If it is the same, reject it.
- If your source is not sorted, I think there is a plug-in Sorter stage. Don't know how efficient it is. Otherwise, you could use an OS sort command (eg. Unix sort) and place it in the Filter Command of your SEQ file stage.
- Without sorting, you can implement a hashed file using a library routine rather than having the hashed file stages cluttering your job design. The benefits of this are very arguable - the hashed file stages may handle large volumes more efficiently. Despite this, I developed such a routine. The source code is below:

FUNCTION IsUnique(pTEMPDIR, pHashFile, pInitialize, pVal)

Code: Select all

      DEFFUN NVL(A,B) calling "DSU.NVL"
      DEFFUN IsEquivalent(A,B) calling "DSU.IsEquivalent"

      Common /CDMUniq/ Initialized, UniqFileName, UniqFileHandle


      EQUATE RoutineName TO 'IsUnique'

      Ans = 0
      HashPath = pTEMPDIR : "/" : pHashFile



      * If the Initialize parameter is set, remove and rebuild the hash file
      If pInitialize Then
         Initialized = 0

         * The Unix command does 3 things (and fails immediately if any of them fail):
         * : Go to the TEMP Directory
         * : Remove the existing hash file if it exists
         * : Create a new hash file with mkdbfile
         * I had to use this method, because it would fail with "unable to create ..." error if I created it with
         * full path name from another directory. Dont know why.

         StrCommand = "cd " : pTEMPDIR : " && rm -rf " : pHashFile : " && $DSHOME/bin/mkdbfile " : pHashFile : " 30 1 4 20 50 80 1628"

         Call DSExecute("UNIX", StrCommand , OutPut, RetCode)
         if RetCode <> 0 then
            Call DSLogWarn("Unix error (" : RetCode : ") in ": StrCommand : " ::::: " : OutPut, "IsUnique")
         end

         * Attempt to open the file.
         Openpath HashPath TO UniqFileHandle
         Then
            Initialized = 1
            UniqFileName = HashPath
         End
         Else
            Call DSLogWarn("Unable to open hash file " : NVL(HashPath,"<NULL>"), "IsUnique")
            Ans = -1
         End
      End


      If Initialized Then

         * If the currently open file is a different file, then open the correct file

         If Not(IsEquivalent(UniqFileName, HashPath)) Then
            Openpath HashPath TO UniqFileHandle
            Then
               Initialized = 1
               UniqFileName = HashPath
            End
            Else
               Call DSLogWarn("Unable to open hash file " : NVL(HashPath,"<NULL>"), "IsUnique")
               Ans = -1
            End
         End




         If Ans = 0 Then

		* NULL values are always unique
            If IsNull(pVal) Then
               Ans = 1
            End
            Else
               * Check if Val already exists
               Read ValExists From UniqFileHandle, pVal
               Then
                  * Return FALSE if exists
                  Ans = 0
               End
               Else
                  * Return TRUE and add to file if not exists
                  Write pVal On UniqFileHandle, pVal
                  Then
                     Ans = 1
                  End
                  Else
                     Call DSLogWarn("Unable to write hash file " : NVL(pHashFile,"<NULL>"), "IsUnique")
                     Ans = -1
                  End
               End
            End
         End
      End

Code: Select all

Use this routine to ensure that all values of a given column in a transformation are unique.

How to use it:

Create two Stage Variables in a Transformer:
Variable		Default	Expression
-----------		------------	-----------------------
XXXIsUnique		CheckUnique(#$TEMPDIR#, hashfilename, Initialize, LK_Link.Column)
Initialize		1	0

Since "Initialize" is below "XXXIsUnique" in the Stage Variables, the first time "XXXIsUnique" is evaluated, "Initialize" will be 1. 
After the first row, "Initialize" is set to "0" and is therefore 0 for each subsequent call.
In this way, the hash file in IsUnique() will be initialized only once.

IsUnique() will return 1 if the column value has not yet been seen, or 0 if it has been seen.
Ross Leishman
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Or you could simplify things tremendously by 'cluttering up your job design' with a hashed file between the sequential and OCI stages. :wink:

Out of curiousity, why the restriction?
-craig

"You can never have too many knives" -- Logan Nine Fingers
sanjeev_sithara
Participant
Posts: 15
Joined: Wed May 26, 2004 6:30 am

Post by sanjeev_sithara »

I think your best bet wud be either:

1)Use a hashed file stage or
2) Sort the records in Unix using sort command (before job sub-routine)
and then using stage variables in the transformer, reject the duplicate
records.The addition of Sort stage will surely slow down your job.

HTH
ajith
Participant
Posts: 86
Joined: Thu Nov 10, 2005 11:10 pm

Post by ajith »

sanjeev_sithara wrote:I think your best bet wud be either:

1)Use a hashed file stage or
2) Sort the records in Unix using sort command (before job sub-routine)
and then using stage variables in the transformer, reject the duplicate
records.The addition of Sort stage will surely slow down your job.

HTH
Yeah , Thats the way we achieved it and it is working very fine.
mauherga
Participant
Posts: 31
Joined: Thu Mar 02, 2006 9:47 am
Location: Mexico

Post by mauherga »

Hi,

If you use sort command in UNIX, you can reject duplicated records using -u parameter and in that way you avoid the routine.
maurik
Post Reply