Page 1 of 1

collecting top 200 upc

Posted: Fri Sep 23, 2005 6:59 am
by pavan_test
Hello Everyone.,


i have a file ( it is a retailer file ). say costco. the file has the information about the various units sold in the different stores belong to costco across USA.
eg: the file is costco

store location: upc sold
denver 500
dallas 10000
houston 100
miami 5000
atlanta 700
california 3500

the size of the file is around 1 million records. in the file i am concerned only about the above 2 columns and i am not concerned about other data.

how do i pick up top 200 upc sold from the file. can anyone please help me in the process.

initially picking up top 200 upc can be hard coded but later on it should be a configurable variable.

thanks
pavan

Posted: Fri Sep 23, 2005 7:04 am
by chulett
Generic solution - sort descending by upc sold. A job parameter for the threshold can then pull the first X records out of the stream afterwards via a constraint as the 'top X'... unless you need to worry about ties. In that case stage variables could be employed to check for changes and increment only when the value changes, tracking the 'rank' accordingly.

selecting top 200 upc

Posted: Fri Sep 23, 2005 12:31 pm
by pavan_test
[quote="chulett"]Generic solution - sort descending by upc sold. A job parameter for the threshold can then pull the first X records out of the stream afterwards via a constraint as the 'top X'... unless you need to worry about ties. In that case stage variables could be employed to check for changes and increment only when the value changes, tracking the 'rank' accordingly.[/quote]

Sorry, my fault, i did not phrase my question correctly ....
In the above example, let's say from denver, i have a lot of upc(items) starting from 500 and are sorted in descending order. The next (venue) is dallas and say again i have some UPC's in descending order, my goal is to pick the top 200 from every venue(dallas, denver, miami.,etc)

for eg
Venue: upc sold
denver 500
denver 300
denver 200
..
..
dallas 10000
dallas 800
dallas 600
dallas 300
..
..
..

houston 100
miami 5000
atlanta 700
california 3500

Similarly for all other venues i have a descending order of UPC's, i cannot use an external filter stage since in the stage if i pass a unix command of head -200 i will get the top 200 from the entire file, but not from every venue. My idea is to get the top 200 items(UPC) from each and every venue and the number of venues in my file will be atleast 5000...
I cannot think of a way of doing it, can you suggest anything...please

Thanks
pawan

Posted: Fri Sep 23, 2005 12:36 pm
by pnchowdary
Hi Pavan,

In that case, you can still use Craig's solution, but instead of just sorting on UPC alone, you need to sort on both venue column and then the UPC column.

Posted: Fri Sep 23, 2005 5:00 pm
by ray.wurlod
Change your expectations. DataStage is NOT a reporting tool.

Most business intelligence tools have this kind of capability out of the box. Some databases have a RANK function.

If your client insists, resist. Tell them that ETL and reporting are different. They are wrong.

Further, as soon as you move to more than one processing node, how will you ensure that your ranks are being calculated correctly? You will have to sort/hash on the grouping columns.