Page 1 of 1

Pattern matching

Posted: Tue Jul 06, 2010 9:44 pm
by in_finity307
Hi,

I have the following requirement.

There is a table containing some patterns like

_ _ _ A B C
A _ _ * * *
A B C * * *


Now, from the input stream, lets say we get an input field value as
X Y Z A B C.

We have to do pattern matching for this record using the following rules.

1) We check if the length of the input string matches with the length of any of the patterns in the pattern table. We select the first pattern from the patterns table for which the length matches.
2) Then we check that the corresponding characters in the input string and the pattern string match. Here an '_' or '*' in the pattern string mean that any character at the corresponding position in the input string is acceptable.
For ex - input string 'X Y Z A B C' will match with the first pattern in the patterns table '_ _ _ A B C'
3) Now, once a match is found in the patterns table, we implement the following logic to derive the output field.
a) For every '_' found in the pattern, we replicate the same character at that position as found in the input string. " For every '*' found in the matched pattern, we output a '*' only in the output string.

For example,

1)

Input : - 'X Y Z A B C'
Matched Pattern : - ' _ _ _ A B C'
Output : - 'X Y Z A B C'

2)

Input : - 'A Y Z R S T'
Matched Pattern : - 'A _ _ * * * '
Output : - 'A Y Z * * *'

3)

Input : - 'A B C X Y Z'
Matched Pattern : - 'A B C * * *'
Output : - 'A B C * * *'


The patterns are stored in a patterns table in the database. While the input stream is read through a sequentail file.

Could you please suggest the best way to implement this logic in datastage? Are Routines, SQL Procedures or Unix scripting the best option or is there some other easy way?

Thanks a lot for your help.

    Posted: Tue Jul 06, 2010 10:27 pm
    by chulett
    You need to do this in a Parallel job, as you marked the job type, or a Server job based on the forum you posted in? For the former, I'll move the post, for the latter I'll... adjust... your job type appropriately.

    Posted: Tue Jul 06, 2010 10:32 pm
    by in_finity307
    I am sorry, this is for a parallel job. My mistake, i didn't see the forum name. I don't know how to move it to the parallel forum.

    Thanks if you could move it to the right forum.

    Posted: Tue Jul 06, 2010 10:38 pm
    by chulett
    No worries, you can't move a post... I, however, can. Welcome to the PX forum. :wink:

    Posted: Tue Jul 06, 2010 10:41 pm
    by in_finity307
    Okay. Thanks again :-)

    Could you please suggest a way to implement the above logic?

    Posted: Tue Jul 06, 2010 11:01 pm
    by chulett
    If I had one, I'd post it. However, it's been a long day and it's way past my bedtime, so I'll leave in the hands of folks in other timezones.

    Posted: Wed Jul 07, 2010 3:24 am
    by Sainath.Srinivasan
    Did you try something like

    Code: Select all

    SELECT yourData
    FROM yourDataTable, yourPatternTable
    WHERE yourData LIKE yourPattern

    Re: Pattern matching

    Posted: Thu Jul 08, 2010 12:32 am
    by MT
    in_finity307 wrote: Could you please suggest the best way to implement this logic in datastage? Are Routines, SQL Procedures or Unix scripting the best option or is there some other easy way?

    Thanks a lot for your help.

      Hi,

      you are looking for an easy way - well this depends....

      I think it is important to split your problem into two:
      1. pattern mating
      2. output modification

      Doing both in one step is just complicating things.

      I think you are free to add some more columns to your pattern table -
      so the real pattern is ABC on position 4-6 for example.
      So you could even specify the exact substring statement there which you need to compare to the real pattern (i.e. "ABC").
      Also you could add the length of the string to filter on the length as first step - these things depend very much on the number of rows you are going to process and number of patterns....

      For the output format only the "*" are interesting because they overwrite the other string. You could for example easliy do a substring of the real text and then just concatenate the "***" to it.
      The "_" are unimportant and should not be part of the logic in my eyes.

      I did not think of the exact details and what if more than one pattern matches etc. but I hope this might help you to find an impementation.

      Posted: Thu Jul 08, 2010 2:21 am
      by ray.wurlod
      Let's nail that last point before seeking a solution. Why does "A B C D E F" match "A B C * * *" and not match "A _ _ * * *"?

      Assuming there is a small number of patterns, I'd be using stage variables to determine the matches (one variable for each pattern with a six-way AND expression to test the individual elements in each case). Another set of stage variables could accomplish the substitutions.