Compare column and give the variation in %

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
suresh_dsx
Participant
Posts: 160
Joined: Tue May 02, 2006 7:49 am

Compare column and give the variation in %

Post by suresh_dsx »

Hi,
i need compare the data and give the output in terms of percentage

Code: Select all

Source data

empid,ename,eaddess,sal
"1000","abcd","road7 stree2","1000"
"2000","dddd","roadno2 streeno12","2000"

Reference data

empid,ename,eaddess,sal
"1000","abcd","road7 stree2","1000"
"2000","dddd","road2 stree12","2000"

Output 

empid,ename,eaddess,sal,percentageofeaddress
"1000","abcd","road7 stree2","1000",100%
"2000","dddd","road2 stree12","2000",80%
Join based on the empid and percentage variation on the eaddress
the percentageofeaddress column should give the difference of the column eaddress.

Based on the forums, i have tried with function compare, but I am unable to get the direct function which give the percentage of variation of the column

Any help greatly appreciated.

Thanks -suresh
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

Would the Levenshtein Distance be what you are after?
https://secure.wikimedia.org/wikipedia/ ... n_distance
The article has source code to calculate it, which you could port to DataStage BASIC.
Phil Hibbs | Capgemini
Technical Consultant
suresh_dsx
Participant
Posts: 160
Joined: Tue May 02, 2006 7:49 am

Post by suresh_dsx »

Hi,

instead of writing a code any direct function is avaliable in the transformer stage. it will be easy to manage.

Searched in the forums did not get the relevent information.
Any help greatly appriciated.

Thanks -Suresh
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Re: Compare column and give the variation in %

Post by pandeesh »

the percentageofeaddress column should give the difference of the column eaddress.
i am curious how the difference between "roadno2" and "road2" has 80% variation? can you explain in detail?
pandeeswaran
vishal_rastogi
Participant
Posts: 47
Joined: Thu Dec 09, 2010 4:37 am

Post by vishal_rastogi »

please elaborate little more how you are calculating the percentage in output?
for first row it is 100%
in second row it is 80 %(the keys are matching but values are not matching)
Vish
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

Yes, I'd have thought that "road2 stree12" is 72% the same as "roadno2 streeno12", and the other way around, to go from "roadno2 streeno12" to "road2 stree12" you need to change 22% of the characters, so that would make 78%.

I'm fairly sure that there is no built-in function for calculating this difference. You might be able to calculate it in a Transformer Loop, but it would be tricky if at all possible.
Phil Hibbs | Capgemini
Technical Consultant
Post Reply