Duplicates/Sorting question

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Emilio
Participant
Posts: 17
Joined: Wed Jan 28, 2004 2:18 pm
Location: Frederick, MD

Post by Emilio »

Ross,
Thank you for your code!!......This sounds what I need.....!! :D
:oops: Sorry for taking a long time to contact you but I was directed into another project and just now I am able to work on this one. I have a couple questions for your, pardon for my ignorance but I am new with Ascential and my questions might be very, very, basic. :?:
1) The Unix command that needs to be translated to dos - Can you describe what that command(s) do or intends to do?
2) The temp hashfiles that are created, if I have 10 columns in my file to delete duplicate values, does it create 10 hashfiles? and are the hashfiles deleted after the program is finished?
Thanks,
Emilio :)


rleishman wrote:Emilio,

I can't believe the most interesting topic of the week was posted when I was asleep. Well now you Yanks are going home to bed and it's Australia's turn...

Here's my understanding of your requirement: given an input file with n columns, you want to produce an output file where no single value of any of the n columns is repeated.
What is not clear to me is whether the number of output rows is important. For example, would you accept the following as an alternative output for your example above?

Code: Select all

NONE, 100, 
, 200, 
,,AUCTION 
If so, then I think I have a solution. I have attached a routine below that I used for a different problem; I just wanted a generic method of removing duplicates. Here's what you do:
- Copy the code below and create a DS Routine called IsUnique with 4 parameters: TEMPDIR, HashFile, Initialize, and Val
- In your DS Job, create a stage variable for each output column with expressions of the form:

Code: Select all

If IsUnique("X:\xxx\yyy\temp", DSJOB : "<col_name>", Initialize, In_link.col) Then 
    In_link.col 
Else 
    @NULL
- Create another stage variable called Initialize BELOW the other stage variables. Set the default value to 1, and the expression to 0. This means that it will be 1 for the first row processed (causing the IsUnique hash file mechanism to be initialized) and then 0 for each subsequent row.
- Map each of the stage variables to their respective columns.

How does it work? For each column, a different hash file is created in a temp directory of your choosing. You pass a value into the routine (and tell it which hash file to look in) and it tells you whether it has seen the value before (success) or not (failure). If the routine returns success, then you send the column value to the output link, otherwise send @NULL.

A couple of notes:
- I wrote this for Unix only. There is a Unix command I have highlighted with stars that you will have to translate into DOS. It is basically this:

Code: Select all

CD <TEMPDIR>
DEL <TEMPDIR>\<HashFile>
%DSHOME%\bin\mkdbfile <HashFile> 30 1 4 20 50 80 1628
although in Unix I separated the commands in such a way that if one failed the next would not run.
- I'm a UV BASIC beginner. Hopefully some bigger brains than mine will critique this routine (nicely of course :) )so that I can improve it - listen to them. :!:
- I originally designed it without the ability handle NULLs in the Val param. I made a change in this version that should work but didn't test it.
- I had to remove some references to other custom routines in our library. It should still compile, but...

Good luck.

Code: Select all


      Common /CDMUniq/ Initialized, UniqFileName, UniqFileHandle


      EQUATE RoutineName TO 'IsUnique'

      Ans = 0
      HashPath = TEMPDIR : "" : HashFile


      * If the Initialize parameter is set, remove and rebuild the hash file
      If Initialize Then

         Initialized = 0

         * The Unix command does 3 things (and fails immediately if any of them fail):
         * : Go to the TEMP Directory
         * : Remove the existing hash file if it exists
         * : Create a new hash file with mkdbfile
         * I had to use this method, because it would fail with "unable to create ..." error if I created it with
         * full path name from another directory. Dont know why.

****************************
         StrCommand = "cd " : TEMPDIR : " && rm -rf " : HashFile : " && $DSHOME/bin/mkdbfile " : HashFile : " 30 1 4 20 50 80 1628"
****************************

         Call DSExecute("UNIX", StrCommand , OutPut, RetCode)
         if RetCode <> 0 then
            Call DSLogWarn("Unix error (" : RetCode : ") in ": StrCommand : " ::::: " : OutPut, "IsUnique")
         end

         * Attempt to open the file.
         Openpath HashPath TO UniqFileHandle
         Then
            Initialized = 1
            UniqFileName = HashPath
         End
         Else
            Call DSLogWarn("Unable to open hash file " : HashPath, "IsUnique")
            Ans = -1
         End
      End


      If Initialized Then

         * If the currently open file is a different file, then open the correct file

         If UniqFileName <> HashPath Then
            Openpath HashPath TO UniqFileHandle
            Then
               Initialized = 1
               UniqFileName = HashPath
            End
            Else
               Call DSLogWarn("Unable to open hash file " : HashPath, "IsUnique")
               Ans = -1
            End
         End



         If Ans = 0 And Not(IsNull(Val)) Then

            * Check if Val already exists
            Readu ValExists From UniqFileHandle, Val
            Then
               * Return FALSE if exists
               Ans = 0
            End
            Else
               * Return TRUE and add to file if not exists
               Writeu Val On UniqFileHandle, Val
               Then
                  Ans = 1
               End
               Else
                  Call DSLogWarn("Unable to write hash file " : HashFile, "IsUnique")
                  Ans = -1
               End
            End
         End
      End
[/code]
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The UNIX command in Ross's code will probably work in DOS also. At worst you may need to change the pathname delimters from "/" to "\".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

Since my earlier post, I have made some changes to the code of this routine. The new code is below.

The unix command goes to the directory where the hash files are stored, removes the hash file if it exists (ie. from a previous run) and re-creates it. It sould be pretty simple to do in DOS, but you might need to put them all together in a .BAT file.
A word of warning - this code comes with no guarantees. If you cannot understand it, don't use it. Cos if it goes wrong, you're on your own.

If you have 10 column that you want to capture duplicates on INDIVIDUALLY, then you will have 10 hash files. If you want to detect a single duplicate based on the combination of the 10 columns, then you will have to concatenate the 10 columns somehow, but you'll have only 1 hash file.

Good luck.

Code: Select all

      DEFFUN NVL(A,B) calling "DSU.NVL"
      DEFFUN IsEquivalent(A,B) calling "DSU.IsEquivalent"

      Common /CDMUniq/ Initialized, UniqFileName, UniqFileHandle


      EQUATE RoutineName TO 'IsUnique'

      Ans = 0
      HashPath = pTEMPDIR : "/" : pHashFile



      * If the Initialize parameter is set, remove and rebuild the hash file
      If pInitialize Then
         Initialized = 0

         * The Unix command does 3 things (and fails immediately if any of them fail):
         * : Go to the TEMP Directory
         * : Remove the existing hash file if it exists
         * : Create a new hash file with mkdbfile
         * I had to use this method, because it would fail with "unable to create ..." error if I created it with
         * full path name from another directory. Dont know why.

         StrCommand = "cd " : pTEMPDIR : " && rm -rf " : pHashFile : " && $DSHOME/bin/mkdbfile " : pHashFile : " 30 1 4 20 50 80 1628"

         Call DSExecute("UNIX", StrCommand , OutPut, RetCode)
         if RetCode <> 0 then
            Call DSLogWarn("Unix error (" : RetCode : ") in ": StrCommand : " ::::: " : OutPut, "IsUnique")
         end

         * Attempt to open the file.
         Openpath HashPath TO UniqFileHandle
         Then
            Initialized = 1
            UniqFileName = HashPath
         End
         Else
            Call DSLogWarn("Unable to open hash file " : NVL(HashPath,"<NULL>"), "IsUnique")
            Ans = -1
         End
      End


      If Initialized Then

         * If the currently open file is a different file, then open the correct file

         If Not(IsEquivalent(UniqFileName, HashPath)) Then
            Openpath HashPath TO UniqFileHandle
            Then
               Initialized = 1
               UniqFileName = HashPath
            End
            Else
               Call DSLogWarn("Unable to open hash file " : NVL(HashPath,"<NULL>"), "IsUnique")
               Ans = -1
            End
         End




         If Ans = 0 Then

		* NULL values are always unique
            If IsNull(pVal) Then
               Ans = 1
            End
            Else
               * Check if Val already exists
               Read ValExists From UniqFileHandle, pVal
               Then
                  * Return FALSE if exists
                  Ans = 0
               End
               Else
                  * Return TRUE and add to file if not exists
                  Write pVal On UniqFileHandle, pVal
                  Then
                     Ans = 1
                  End
                  Else
                     Call DSLogWarn("Unable to write hash file " : NVL(pHashFile,"<NULL>"), "IsUnique")
                     Ans = -1
                  End
               End
            End
         End
      End
Ross Leishman
Post Reply