calculating MEDIAN (instead of AVERAGE)

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
goethami
Participant
Posts: 2
Joined: Wed Nov 19, 2003 10:43 am
Location: Belgium
Contact:

calculating MEDIAN (instead of AVERAGE)

Post by goethami »

Hi,

I do have a table with 30 key indicators for all companies in Belgium for 10 years. I also included in the table their community.

Now, I want to calculate per year per community the median for those 30 key indicators. So, I need to do a "group by" by community and "group by" by year. But I do not have a function for the MEDIAN (only the AVERAGE, which is not statisticaly correct in my case).

Please can someone help me ? Probably a SDK-routine, ... ?
HSBCdev
Premium Member
Premium Member
Posts: 141
Joined: Tue Mar 16, 2004 8:22 am
Location: HSBC - UK and India
Contact:

Post by HSBCdev »

I can imagine some row processing function to read in relevant records and work out what is the middle record having put the results into a temporary file? But this sounds like overkill. There is probably a better way.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you think about it, you cannot calculate median without sorting the data set. You also need the total row count (let's call it N) so that you can determine the (N/2)th position. Slight variations depend on whether N is odd or even, but you can use Int(N/2+1) with no loss of generality.

A DataStage solution would be to sort the column into - for example - a text file, then read from this through a Transformer stage with its constraint expression set to @INROWNUM = svMiddleRowNumber, where the stage variable is preset to Int(RowCount/2+1) and remains unchanging.

If you're in the same job, RowCount can be obtained using DSGetStageInfo or DSGetLinkInfo from a stage or link in the earlier part of the job, perhaps the link that wrote to the text file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Ray has defined one method for determining median when the count of elements/rows is odd. If the number of elements/rows is even then you "should" find the middle pair of numbers, add them together and divide by two and this value then becomes your median. As Ray pointed out - the data must be sorted for this to work correctly.

Regards,
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

You need to find multiple row counts, one for each community id, before you can process your community rows. Consider saving row counts off to a hash file.

Do an aggregation on your community value table where you retrieve the community id and a count of rows for that id, write this to a hash file. This hash file contains a lookup for retrieving row counts for each community.

Following Ray and Mike's method you still need to read in your source sorted by community id and value column. In a transformer lookup the row count for that community id. Using a stage variable maintain a counter that counts the number of rows for that community. Use other stage variables to find out when you have reached the median and to calculate what it should be.

The following stage variables are used to calculate medians:

Code: Select all

CommunityCount:  IF Input.Community_Id = SavedCommunity_Id THEN CommunityCount + 1 ELSE 1
SavedCommunityId:  Input.Community_Id
IsEvenMedian:  (CommunityCount-1) * 2 = Lookup.CommunityCount
IsOddMedian:  CommunityCount * 2 = Lookup.CommunityCount + 1
EvenMedianFirstValue: IF CommunityCount  * 2 = Lookup.CommunityCount THEN Input.Value ELSE 0.
EvenMedianValue: if IsEvenMedian THEN (EvenMedian1 + Input.Value)/ 2 ELSE 0
OddMedianValue: if IsOddMedian THEN Input.Value ELSE 0
A constraint ensures rows are only written out if IsEvenMedian or IsOddMedian is equal to TRUE. The two output columns are "Input.CommunityId" and "EvenMedianValue + OddMedianValue" (one of them will be set, the other will be 0).

You will only output one row per community and you will only output a row when the median has been reached and calculated. This determines the median for each community. To get the median for each community for each year then just add year to the hash file aggregation and to the row count lookup and stage variable counting logic.

Code: Select all

CommunityCounter:  IF Input.CommunityId = SavedCommunityId AND Input.Year = SavedYear THEN CommunityCounter + 1 ELSE 1
SavedYear = Input.Year
SavedCommunityId = Input.CommunityId 
etc
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You also need to make some decision about how to handle NULL if it is possible that one may occur. Basically, you have to omit any NULL from the calculation.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
hiral.chauhan
Premium Member
Premium Member
Posts: 45
Joined: Fri Nov 07, 2008 12:22 pm

Post by hiral.chauhan »

Hello Experts,

I realize this is a very old thread, but I will be really grateful if anyone can help me in some doubts I have.

So far, for Even number of rows, I have the 2 rows which will give me median. But I am not able to use the above stage variables to calculate median because when IsEvenMedian is TRUE at that time EvenMedianFirstValue is 0.

My question is:

Can I calculate EvenMedianFirstValue and EvenMedianValue in one transformer?

I appreciate your valuable input and time in advance.

Thanks,
Hiral
Thanks,
Hiral Chauhan
Post Reply