Pattern matching

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
in_finity307
Participant
Posts: 20
Joined: Sat Aug 09, 2008 1:53 pm

Pattern matching

Post by in_finity307 »

Hi,

I have the following requirement.

There is a table containing some patterns like

_ _ _ A B C
A _ _ * * *
A B C * * *


Now, from the input stream, lets say we get an input field value as
X Y Z A B C.

We have to do pattern matching for this record using the following rules.

1) We check if the length of the input string matches with the length of any of the patterns in the pattern table. We select the first pattern from the patterns table for which the length matches.
2) Then we check that the corresponding characters in the input string and the pattern string match. Here an '_' or '*' in the pattern string mean that any character at the corresponding position in the input string is acceptable.
For ex - input string 'X Y Z A B C' will match with the first pattern in the patterns table '_ _ _ A B C'
3) Now, once a match is found in the patterns table, we implement the following logic to derive the output field.
a) For every '_' found in the pattern, we replicate the same character at that position as found in the input string. " For every '*' found in the matched pattern, we output a '*' only in the output string.

For example,

1)

Input : - 'X Y Z A B C'
Matched Pattern : - ' _ _ _ A B C'
Output : - 'X Y Z A B C'

2)

Input : - 'A Y Z R S T'
Matched Pattern : - 'A _ _ * * * '
Output : - 'A Y Z * * *'

3)

Input : - 'A B C X Y Z'
Matched Pattern : - 'A B C * * *'
Output : - 'A B C * * *'


The patterns are stored in a patterns table in the database. While the input stream is read through a sequentail file.

Could you please suggest the best way to implement this logic in datastage? Are Routines, SQL Procedures or Unix scripting the best option or is there some other easy way?

Thanks a lot for your help.
    chulett
    Charter Member
    Charter Member
    Posts: 43085
    Joined: Tue Nov 12, 2002 4:34 pm
    Location: Denver, CO

    Post by chulett »

    You need to do this in a Parallel job, as you marked the job type, or a Server job based on the forum you posted in? For the former, I'll move the post, for the latter I'll... adjust... your job type appropriately.
    -craig

    "You can never have too many knives" -- Logan Nine Fingers
    in_finity307
    Participant
    Posts: 20
    Joined: Sat Aug 09, 2008 1:53 pm

    Post by in_finity307 »

    I am sorry, this is for a parallel job. My mistake, i didn't see the forum name. I don't know how to move it to the parallel forum.

    Thanks if you could move it to the right forum.
    chulett
    Charter Member
    Charter Member
    Posts: 43085
    Joined: Tue Nov 12, 2002 4:34 pm
    Location: Denver, CO

    Post by chulett »

    No worries, you can't move a post... I, however, can. Welcome to the PX forum. :wink:
    -craig

    "You can never have too many knives" -- Logan Nine Fingers
    in_finity307
    Participant
    Posts: 20
    Joined: Sat Aug 09, 2008 1:53 pm

    Post by in_finity307 »

    Okay. Thanks again :-)

    Could you please suggest a way to implement the above logic?
    chulett
    Charter Member
    Charter Member
    Posts: 43085
    Joined: Tue Nov 12, 2002 4:34 pm
    Location: Denver, CO

    Post by chulett »

    If I had one, I'd post it. However, it's been a long day and it's way past my bedtime, so I'll leave in the hands of folks in other timezones.
    -craig

    "You can never have too many knives" -- Logan Nine Fingers
    Sainath.Srinivasan
    Participant
    Posts: 3337
    Joined: Mon Jan 17, 2005 4:49 am
    Location: United Kingdom

    Post by Sainath.Srinivasan »

    Did you try something like

    Code: Select all

    SELECT yourData
    FROM yourDataTable, yourPatternTable
    WHERE yourData LIKE yourPattern
    MT
    Premium Member
    Premium Member
    Posts: 198
    Joined: Fri Mar 09, 2007 3:51 am

    Re: Pattern matching

    Post by MT »

    in_finity307 wrote: Could you please suggest the best way to implement this logic in datastage? Are Routines, SQL Procedures or Unix scripting the best option or is there some other easy way?

    Thanks a lot for your help.

      Hi,

      you are looking for an easy way - well this depends....

      I think it is important to split your problem into two:
      1. pattern mating
      2. output modification

      Doing both in one step is just complicating things.

      I think you are free to add some more columns to your pattern table -
      so the real pattern is ABC on position 4-6 for example.
      So you could even specify the exact substring statement there which you need to compare to the real pattern (i.e. "ABC").
      Also you could add the length of the string to filter on the length as first step - these things depend very much on the number of rows you are going to process and number of patterns....

      For the output format only the "*" are interesting because they overwrite the other string. You could for example easliy do a substring of the real text and then just concatenate the "***" to it.
      The "_" are unimportant and should not be part of the logic in my eyes.

      I did not think of the exact details and what if more than one pattern matches etc. but I hope this might help you to find an impementation.
      Last edited by MT on Thu Jul 08, 2010 2:35 am, edited 1 time in total.
      ray.wurlod
      Participant
      Posts: 54607
      Joined: Wed Oct 23, 2002 10:52 pm
      Location: Sydney, Australia
      Contact:

      Post by ray.wurlod »

      Let's nail that last point before seeking a solution. Why does "A B C D E F" match "A B C * * *" and not match "A _ _ * * *"?

      Assuming there is a small number of patterns, I'd be using stage variables to determine the matches (one variable for each pattern with a six-way AND expression to test the individual elements in each case). Another set of stage variables could accomplish the substitutions.
      IBM Software Services Group
      Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
      Post Reply