Finding the value that repeats the maximum in a file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
manojmathai
Participant
Posts: 23
Joined: Mon Jul 04, 2005 6:25 am

Finding the value that repeats the maximum in a file

Post by manojmathai »

Hi

I have an input file that contains fields ProcessingMonth, Customer Name ,Place ... I have to assign the Actual Processing Month as the most repeated Processing Month in the input file

For Eg.
Source
-------
Proc Mth,Name,Place
-------------------------------
200405,aaaa,abc
200406,bbbb,xyz
200405,cccc,sdf
200405,dddd,lkj
200404,eeee,rst
200404,ffff,wer

The Output should be

Act ProcMth,Proc Mth,Name,Place
----------------------------------------------------------------------
200405,200405,aaaa,abc
200405,200406,bbbb,xyz
200405,200405,cccc,sdf
200405,200405,dddd,lkj
200405,200404,eeee,rst
200405,200404,ffff,wer

Can any body give a smart way to do this

Regards
Manoj
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Manoj,

you will have to do 2 passes through your source file no matter what solution path you take. If your file doesn't have millions of rows and thus lookup performance isn't of paramount importance then I would go about solving this with 3 jobs - a Sequencer and 2 server jobs.

(0) write a sequence to call (a) then (b)

(a) get the most repeated month from the file. You could use an aggregation stage or transform stage variables to get this. Write the single value to a hashed file with the key = 1 and the Data = your string.

(b) use this hashed file as a lookup, putting it into memory and always using the constant "1" to read the lookup data value.


If runtime performance is hugely important, the I would modify (a) to write to a sequential file and write my own function to return this value from the file, which I would use as a parameter value passed to job (b), which no longer needs a lookup as it has the value as a parameter.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

As ArndW suggested, you need to identify the most occurring period using sort or agg or any external method.

I will suggest using that as a parameter in your next job or use Unix command(s) to prefix to each record.
Post Reply