Check for Substring in a string - substring in reference fil

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
devidotcom
Participant
Posts: 247
Joined: Thu Apr 27, 2006 6:38 am
Location: Hyderabad

Check for Substring in a string - substring in reference fil

Post by devidotcom »

Hi All,

I have a requirement to count the number of occurance of a substring in a string. I am aware of the count function which does this for us.

The issue is that the substring values are 8 and I have to use this function 8 times... with OR condition between them. This is fine also.

But these 8 substring values are in a sequential file. I have to access these values from there. We may have a requirement of these values extending to more than 8 in future.

The input file is a dataset.

Any help on this.

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The best function is Index(). Its task is to search for (the nth occurrence of) a substring within a string.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
devidotcom
Participant
Posts: 247
Joined: Thu Apr 27, 2006 6:38 am
Location: Hyderabad

Post by devidotcom »

Thanks Ray.

I am not sure if you understand the question:

Reference file contains: ACCT,ACCOUNT,TRUST

Input records are
BANK ACCOUNT
CREDIT ACCT
TRUST ACCT
HAPPY WORLD


Now we need to check the occurance of the words in the reference to the input file...

Output
Column name, Occurance
BANK ACCOUNT, 1
CREDIT ACCT,1
TRUST ACCT,1
HAPPY WORLD,0
How to implement the job by reading the reference file contents.
First issue is how to read the file and use the Index() function if I have to use it...
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Code: Select all

Index(RefLink.ColName, "TRUST", 1) > 0 Or Index(RefLink.ColName, "ACCOUNT", 1) > 0 Or Index(RefLink.ColName, "ACCT", 1) > 0 
Put the most likely one first and the least likely one last, for maximum efficiency.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

Post by dsusr »

Bit of crude logic on top of my mind.....

1) Just create your reference file as each word in a separate row with same key value asume it to be 1 so that your file wud look like:-

1, ACCT
1, ACCOUNTANT
1, TRUST

Even if you have a file with single row you can do such kind of transformation in the job as well.

2) In the input link add the field with key value 1 so that you can join with each record of refence file using this key value

3) After the join you will get the folowing output:-

key, Input Field, Reference field
1, BANK ACCOUNT, ACCT
1, BANK ACCOUNT, ACCOUNTANT
1, BANK ACCOUNT, TRUST
1, CREDIT ACCT, ACCT
1, CREDIT ACCT, ACCOUNTANT
and so on

3) Just use a tranformer to count the staring in Reference field in the Input Field data using Count() function.

4) Put an aggregator after the trasnsformer to group the data on Key field and Input field data and sum the count values.

Although it seems a bit crude logic as the number of records would become very large at the input to the transformer.

In this logic you can in future add all the values in refernce files.

Will try to think of some other logic as well since I dont have any work to do at my client location ;)

--------
dsusr
Post Reply