Generic job for tranforming multiple files

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
chitrangadsingh
Participant
Posts: 12
Joined: Mon Jul 18, 2005 4:07 am

Generic job for tranforming multiple files

Post by chitrangadsingh »

Hi all,

We have a requirement wherein 17 files each with different metadata are to be read and some generic transformations/ data cleansing is to be done. These majorly include stripping white spaces from char fields and removing leading/trailing zeroes from decimal fields.
The idea is to design a generic job which can do this for any file passed as a parameter, along with its metadata. Designing a single job for each file is the obvious approach but its not preferred since no. of files and layout may change in future.
But the roadblock I foresee is how can I generalize the transformations. Because to perform trimming etc one need to know the column name and its datatype.

Is there any way through which it can be achieved in PX. Will appreciate any pointers towards it.

Thanks in advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Design a single job for each file, and maintain them into the future.

As soon as you wish to apply any kind of transformation, even a Trim() function, you must name the input column and output column explicitly. That rules out runtime column propagation.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chitrangadsingh
Participant
Posts: 12
Joined: Mon Jul 18, 2005 4:07 am

Post by chitrangadsingh »

Thanx Ray for the prompt response as always..

but guess m unlucky not to have a premium membership till yet!
Anyways, watever i cud gather from ur visible reply is: i should build separate jobs for each file and add transformations and col names as n when requirement comes? Isn't there a workaround in terms of routines or unix scripting? Can any non-premium poster help me :oops: ?
will b getting a premium id soon to read ur entire reply..thanx anyways.
s_boyapati
Premium Member
Premium Member
Posts: 70
Joined: Thu Aug 14, 2003 6:24 am
Contact:

I think you know your solution....

Post by s_boyapati »

You have work around for this being used by our gurus over years in UNIX with Awk and perl. Develop scripts in such away to take meta data file ( contains field name and definitions) . Use one metadata file per file you want to process. Your main script should read that metadata file and use that information to apply generic functions as required. Scripts might look ugly after tons of code in them and difficult to maintain. So group the files based on definitions and type of funtions to run on them. develop template based scripts to process them. I say one script for one group of files( or family). I did that kind of work before EE was on board.

Sree
chitrangadsingh wrote:Thanx Ray for the prompt response as always..

but guess m unlucky not to have a premium membership till yet!
Anyways, watever i cud gather from ur visible reply is: i should build separate jobs for each file and add transformations and col names as n when requirement comes? Isn't there a workaround in terms of routines or unix scripting? Can any non-premium poster help me :oops: ?
will b getting a premium id soon to read ur entire reply..thanx anyways.
novneet
Participant
Posts: 28
Joined: Tue Jan 17, 2006 2:19 pm
Location: PUNE(INDIA)

Need more Details...

Post by novneet »

Hi,

Can you tell me whether the metadata for all the files will be same or they will differ?
If the metadata is going to differ then, is it that all the char field will only store alphabates(not numbers).
Also is the source file going to be a delimited or fixed width and is the character field going to be in double quotes?
Please give me the above details and I might be able to help you up.
Regards,
Novneet Jain
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

I would opt for sed/awk script. That would be much easy and the code wouldnt be huge either. There are many sed one liners that perform similar tasks. The script would even be independent of the files metadata. Look into it.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
chitrangadsingh
Participant
Posts: 12
Joined: Mon Jul 18, 2005 4:07 am

Post by chitrangadsingh »

Thanks all for your time and suggestions.
Novneet: Metadata will differ. Files are fixed-length, no quote char. Let's assume char field will have only char data and numeric only number.
Let me know if you have any suggestions on this.

Thanks.


[quote="DSguru2B"]I would opt for sed/awk script. That would be much easy and the code wouldnt be huge either. There are many sed one liners that perform similar tasks. The script would even be independent of the files metadata. Look into it.[/quote]
novneet
Participant
Posts: 28
Joined: Tue Jan 17, 2006 2:19 pm
Location: PUNE(INDIA)

Post by novneet »

Sorry, My DataStage server is down :shock: , as soon as it is up I will test and post the code.
Regards,
Novneet Jain
Post Reply