Page 1 of 1

Finding the value that repeats the maximum in a file

Posted: Tue Aug 23, 2005 5:21 am
by manojmathai
Hi

I have an input file that contains fields ProcessingMonth, Customer Name ,Place ... I have to assign the Actual Processing Month as the most repeated Processing Month in the input file

For Eg.
Source
-------
Proc Mth,Name,Place
-------------------------------
200405,aaaa,abc
200406,bbbb,xyz
200405,cccc,sdf
200405,dddd,lkj
200404,eeee,rst
200404,ffff,wer

The Output should be

Act ProcMth,Proc Mth,Name,Place
----------------------------------------------------------------------
200405,200405,aaaa,abc
200405,200406,bbbb,xyz
200405,200405,cccc,sdf
200405,200405,dddd,lkj
200405,200404,eeee,rst
200405,200404,ffff,wer

Can any body give a smart way to do this

Regards
Manoj

Posted: Tue Aug 23, 2005 5:34 am
by ArndW
Manoj,

you will have to do 2 passes through your source file no matter what solution path you take. If your file doesn't have millions of rows and thus lookup performance isn't of paramount importance then I would go about solving this with 3 jobs - a Sequencer and 2 server jobs.

(0) write a sequence to call (a) then (b)

(a) get the most repeated month from the file. You could use an aggregation stage or transform stage variables to get this. Write the single value to a hashed file with the key = 1 and the Data = your string.

(b) use this hashed file as a lookup, putting it into memory and always using the constant "1" to read the lookup data value.


If runtime performance is hugely important, the I would modify (a) to write to a sequential file and write my own function to return this value from the file, which I would use as a parameter value passed to job (b), which no longer needs a lookup as it has the value as a parameter.

Posted: Tue Aug 23, 2005 6:20 am
by Sainath.Srinivasan
As ArndW suggested, you need to identify the most occurring period using sort or agg or any external method.

I will suggest using that as a parameter in your next job or use Unix command(s) to prefix to each record.