Page 1 of 3

Calculating top vaues

Posted: Mon Mar 05, 2007 7:53 am
by pravin1581
Hi all,

How can we extract out top 10 % value of any field in DataStage ?

thanx in advance .

Pravin

Posted: Mon Mar 05, 2007 7:55 am
by ray.wurlod
DataStage is not a reporting tool.

You do this in SQL. Hopefully in some variant of SQL that gives you some display functions such as RANK in its query language, and some means to restrict a query based on rank.

Posted: Fri Mar 09, 2007 3:02 pm
by Smeitei
What is your reporting tool . I think it is more of a reporting requirement. OLAP tool like MicroStrategy, BO will handle that very easily or write custom sql in DS stages to get it

Posted: Fri Mar 09, 2007 11:50 pm
by pravin1581
Smeitei wrote:What is your reporting tool . I think it is more of a reporting requirement. OLAP tool like MicroStrategy, BO will handle that very easily or write custom sql in DS stages to get it
I need to extract the top 10% values from a flat file , hence a user defined SQL will be of no help.

Posted: Sat Mar 10, 2007 12:00 am
by kumar_s
You could have used Sample stage if you were in PX. Here you need to count the total number of rows using wc and use split command to read first 10%.

Posted: Sat Mar 10, 2007 7:14 am
by chulett
In Server get the record count and Sort the data descending on the value in question. Constrain your output to the 'top 10%' of the records.

Posted: Sat Mar 10, 2007 8:41 am
by ray.wurlod
Of course you CAN use SQL - via an ODBC driver for text file. Just don't expect it to be fast. But for sufficiently small text files performance may be acceptable.

Posted: Sat Mar 10, 2007 5:09 pm
by DSguru2B
If the top 10% is to be calculated from a bunch of values (column values) in a single row, then you can enhance the functionlity that's been provided in your median query.

Posted: Sun Mar 11, 2007 7:32 pm
by kumar_s
Hi Pravin,
Here the response given are for two different purposes.
Is it the top 10% of rows from the file or top 10% of values from each field.
Pls do clarify.

Posted: Sun Mar 11, 2007 11:22 pm
by narasimha
You can write a script to find the total number of lines in the field in question.
Calculate the 10 % from that field.
You might have to use some sed or awk for this.

Posted: Mon Mar 12, 2007 1:06 am
by pravin1581
kumar_s wrote:Hi Pravin,
Here the response given are for two different purposes.
Is it the top 10% of rows from the file or top 10% of values from each field.
Pls do clarify.
It is top 10% of a particular column in a file.

Posted: Mon Mar 12, 2007 1:08 am
by pravin1581
chulett wrote:In Server get the record count and Sort the data descending on the value in question. Constrain your output to the 'top 10%' of the records. ...
I am facing problems in record count also as DSLINKINFo always returns 0 . How can we restrict the data to top 10 % .

Posted: Mon Mar 12, 2007 1:12 am
by kumar_s
Bit confused again.
If you file has 100 rows with column a,b,c, do you expect first 10rows or you expect first 10 values of a,b,c. Which in case may not be the same as first 10 rows.

Posted: Mon Mar 12, 2007 6:53 am
by DSguru2B
pravin1581 wrote: It is top 10% of a particular column in a file.
You need to sort these values ascending, line them up in a single row, get the max value (last value), calculate the 10 percent breaking point, filter all the records equal to or less than that breaking point.
If we are talking about Top 10% on values then you have to follow the method I gave above. If its Top 10% positional values, then all you need to do is get the number of records, sort again and get the 10% of the number of records. Get all records before that breaking point.

Posted: Mon Mar 12, 2007 7:10 am
by chulett
pravin1581 wrote:
chulett wrote:In Server get the record count and Sort the data descending on the value in question. Constrain your output to the 'top 10%' of the records. ...
I am facing problems in record count also as DSLINKINFo always returns 0 . How can we restrict the data to top 10 % .
Get the record count before the job runs. For UNIX or MKS 'wc -l' was suggested earlier so I didn't repost that. Pass that number in as a job parameter. After you sort, use a constraint where @OUTROWNUM <= 10% of the input rows.