Page 1 of 1

Compare column and give the variation in %

Posted: Wed Jul 13, 2011 3:34 am
by suresh_dsx
Hi,
i need compare the data and give the output in terms of percentage

Code: Select all

Source data

empid,ename,eaddess,sal
"1000","abcd","road7 stree2","1000"
"2000","dddd","roadno2 streeno12","2000"

Reference data

empid,ename,eaddess,sal
"1000","abcd","road7 stree2","1000"
"2000","dddd","road2 stree12","2000"

Output 

empid,ename,eaddess,sal,percentageofeaddress
"1000","abcd","road7 stree2","1000",100%
"2000","dddd","road2 stree12","2000",80%
Join based on the empid and percentage variation on the eaddress
the percentageofeaddress column should give the difference of the column eaddress.

Based on the forums, i have tried with function compare, but I am unable to get the direct function which give the percentage of variation of the column

Any help greatly appreciated.

Thanks -suresh

Posted: Wed Jul 13, 2011 6:04 am
by PhilHibbs
Would the Levenshtein Distance be what you are after?
https://secure.wikimedia.org/wikipedia/ ... n_distance
The article has source code to calculate it, which you could port to DataStage BASIC.

Posted: Thu Jul 14, 2011 12:26 am
by suresh_dsx
Hi,

instead of writing a code any direct function is avaliable in the transformer stage. it will be easy to manage.

Searched in the forums did not get the relevent information.
Any help greatly appriciated.

Thanks -Suresh

Re: Compare column and give the variation in %

Posted: Thu Jul 14, 2011 2:18 am
by pandeesh
the percentageofeaddress column should give the difference of the column eaddress.
i am curious how the difference between "roadno2" and "road2" has 80% variation? can you explain in detail?

Posted: Thu Jul 14, 2011 3:17 am
by vishal_rastogi
please elaborate little more how you are calculating the percentage in output?
for first row it is 100%
in second row it is 80 %(the keys are matching but values are not matching)

Posted: Thu Jul 14, 2011 5:30 am
by PhilHibbs
Yes, I'd have thought that "road2 stree12" is 72% the same as "roadno2 streeno12", and the other way around, to go from "roadno2 streeno12" to "road2 stree12" you need to change 22% of the characters, so that would make 78%.

I'm fairly sure that there is no built-in function for calculating this difference. You might be able to calculate it in a Transformer Loop, but it would be tricky if at all possible.