Page 1 of 1

calculating MEDIAN (instead of AVERAGE)

Posted: Wed Sep 15, 2004 4:26 am
by goethami
Hi,

I do have a table with 30 key indicators for all companies in Belgium for 10 years. I also included in the table their community.

Now, I want to calculate per year per community the median for those 30 key indicators. So, I need to do a "group by" by community and "group by" by year. But I do not have a function for the MEDIAN (only the AVERAGE, which is not statisticaly correct in my case).

Please can someone help me ? Probably a SDK-routine, ... ?

Posted: Wed Sep 15, 2004 4:48 am
by HSBCdev
I can imagine some row processing function to read in relevant records and work out what is the middle record having put the results into a temporary file? But this sounds like overkill. There is probably a better way.

Posted: Wed Sep 15, 2004 4:15 pm
by ray.wurlod
If you think about it, you cannot calculate median without sorting the data set. You also need the total row count (let's call it N) so that you can determine the (N/2)th position. Slight variations depend on whether N is odd or even, but you can use Int(N/2+1) with no loss of generality.

A DataStage solution would be to sort the column into - for example - a text file, then read from this through a Transformer stage with its constraint expression set to @INROWNUM = svMiddleRowNumber, where the stage variable is preset to Int(RowCount/2+1) and remains unchanging.

If you're in the same job, RowCount can be obtained using DSGetStageInfo or DSGetLinkInfo from a stage or link in the earlier part of the job, perhaps the link that wrote to the text file.

Posted: Wed Sep 15, 2004 4:34 pm
by mhester
Ray has defined one method for determining median when the count of elements/rows is odd. If the number of elements/rows is even then you "should" find the middle pair of numbers, add them together and divide by two and this value then becomes your median. As Ray pointed out - the data must be sorted for this to work correctly.

Regards,

Posted: Wed Sep 15, 2004 8:21 pm
by vmcburney
You need to find multiple row counts, one for each community id, before you can process your community rows. Consider saving row counts off to a hash file.

Do an aggregation on your community value table where you retrieve the community id and a count of rows for that id, write this to a hash file. This hash file contains a lookup for retrieving row counts for each community.

Following Ray and Mike's method you still need to read in your source sorted by community id and value column. In a transformer lookup the row count for that community id. Using a stage variable maintain a counter that counts the number of rows for that community. Use other stage variables to find out when you have reached the median and to calculate what it should be.

The following stage variables are used to calculate medians:

Code: Select all

CommunityCount:  IF Input.Community_Id = SavedCommunity_Id THEN CommunityCount + 1 ELSE 1
SavedCommunityId:  Input.Community_Id
IsEvenMedian:  (CommunityCount-1) * 2 = Lookup.CommunityCount
IsOddMedian:  CommunityCount * 2 = Lookup.CommunityCount + 1
EvenMedianFirstValue: IF CommunityCount  * 2 = Lookup.CommunityCount THEN Input.Value ELSE 0.
EvenMedianValue: if IsEvenMedian THEN (EvenMedian1 + Input.Value)/ 2 ELSE 0
OddMedianValue: if IsOddMedian THEN Input.Value ELSE 0
A constraint ensures rows are only written out if IsEvenMedian or IsOddMedian is equal to TRUE. The two output columns are "Input.CommunityId" and "EvenMedianValue + OddMedianValue" (one of them will be set, the other will be 0).

You will only output one row per community and you will only output a row when the median has been reached and calculated. This determines the median for each community. To get the median for each community for each year then just add year to the hash file aggregation and to the row count lookup and stage variable counting logic.

Code: Select all

CommunityCounter:  IF Input.CommunityId = SavedCommunityId AND Input.Year = SavedYear THEN CommunityCounter + 1 ELSE 1
SavedYear = Input.Year
SavedCommunityId = Input.CommunityId 
etc

Posted: Thu Sep 16, 2004 1:05 am
by ray.wurlod
You also need to make some decision about how to handle NULL if it is possible that one may occur. Basically, you have to omit any NULL from the calculation.

Posted: Sun Jun 17, 2012 11:33 am
by hiral.chauhan
Hello Experts,

I realize this is a very old thread, but I will be really grateful if anyone can help me in some doubts I have.

So far, for Even number of rows, I have the 2 rows which will give me median. But I am not able to use the above stage variables to calculate median because when IsEvenMedian is TRUE at that time EvenMedianFirstValue is 0.

My question is:

Can I calculate EvenMedianFirstValue and EvenMedianValue in one transformer?

I appreciate your valuable input and time in advance.

Thanks,
Hiral