Best approach for unknown varied source

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Amit Jaiswal
Premium Member
Premium Member
Posts: 38
Joined: Fri Apr 22, 2005 6:07 am

Best approach for unknown varied source

Post by Amit Jaiswal »

Hi All,

We have some critical requirement in which any new file can come in any format and we have to just plug that file with the existing transformation logic. Source file can be anything like fixed width flat file, csv file, XML, SOAP, etc. I am thinking on following approach:
Creation of datastage job by manipulating the template export xml dump which is having transformation logic. Manipulation means adding source part based on the configuation information users will enter like file type, source to target mapping, etc.
I am thinking to create a perl script which will manipulate the exported job xml and add the source information into it. I will then import it in my project using perl and compile it using the script itself

Can anyone tell me whether this approach is feasible.

Thanks in advance.
-Amit
nick.bond
Charter Member
Charter Member
Posts: 230
Joined: Thu Jan 15, 2004 12:00 pm
Location: London

Post by nick.bond »

WOW - I would love to see that!

would it not be easier to manipulate the file into a 'Standard Format' which the same job can process each time, rather than re-writing the job to fit the file???
Regards,

Nick.
nick.bond
Charter Member
Charter Member
Posts: 230
Joined: Thu Jan 15, 2004 12:00 pm
Location: London

Post by nick.bond »

Out of interest, why do you not know what format the file will come in?

Is there a finite set of file formats? If there is create a small job that will reformat each file into standard format and then run that through your main job that has transformations in it. Users can choose which formating job based on the format of the file?
Regards,

Nick.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: Best approach for unknown varied source

Post by chulett »

Amit Jaiswal wrote:We have some critical requirement in which any new file can come in any format and we have to just plug that file with the existing transformation logic. Source file can be anything like fixed width flat file, csv file, XML, SOAP, etc. I am thinking on following approach:<snip>
Wow... that's just crazy talk. I would be thinking: go take a long walk on a short pier. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Amit Jaiswal
Premium Member
Premium Member
Posts: 38
Joined: Fri Apr 22, 2005 6:07 am

Post by Amit Jaiswal »

Hi,

Requirement is to make everything flexible. As per that any new vendor can be added and he can send the file in his own format.
Since data may come in millions of records (1-2 gig file) I am thinking it will be regular heavy overhead to reshuffle the columns and bring those into some common standard format. Converting any other type of file into sequential file format will be another overhead if we have to use single job for processing.

Thanks,
-Amit[/img]
nick.bond
Charter Member
Charter Member
Posts: 230
Joined: Thu Jan 15, 2004 12:00 pm
Location: London

Post by nick.bond »

That is why you split it into 2 processes,

1) Re-format input file into common structure - 1 job per customer which will be quick to build.

2) Process all reformated files through your complex logic - Only one multi-instance job.

It will be quicker to build a new re-format job for each customer than build that script you were talking about! :?:
Regards,

Nick.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I agree with Nick but you need to identify the format first. Write something which figures out what the format is then transform that format into your base.
Mamu Kim
sud
Premium Member
Premium Member
Posts: 366
Joined: Fri Dec 02, 2005 5:00 am
Location: Here I Am

Post by sud »

Regarding compiling on the fly ... it is not as nice as it sounds. The commandline compilation tool dscc is available only with the windows client, hence if your scripts are running on any non-windows platform you cannot compile there.
It took me fifteen years to discover I had no talent for ETL, but I couldn't give it up because by that time I was too famous.
Amit Jaiswal
Premium Member
Premium Member
Posts: 38
Joined: Fri Apr 22, 2005 6:07 am

Use of Java Pack in Datastage?

Post by Amit Jaiswal »

Hi All,

Thanks for all this valueable information and suggestions.

I'm thinking on the alternate solutions which is use of two jobs. One job to change the source file format to predefined format and another multi-instance job is to do actual transformation.

I have another query related to same topic. I would like to know what are all the things we can achieve from Java Pack?

In short the requirement is as below:
We need to create a web based application/package/service which will do following tasks in batch as per the scheduler and configuration information:
1. Fetch files from various vendor locations using Java using various protocols.
2. Uncompress it, decrypt it and store the file on ETL server
3. To process this data faster use Datastage EE for transformation and loading data in oracle target
In the proposed solution jobs will be executed in the batch and on demand basis. On demand means, we may deploy the whole package to various group so that they can take care of processing particular category vendor data. If job related to some feed/file fails that group should be able to re-execute only that job from web based User Interface.

My query is can we achieve this web-service or invocation of DS Job using Java/ Java Pack?
Can we invoke DS services through JMS and EJB client. So, if we have a JMS queue, couldnt the Java framework post messages to this queue and DS will read the messages from this queue and invoke other jobs? Essentially, the service in DS listening for JMS messages will be the controller within Datastage and we can build a config for DS similar to the Java framework (to define what DS job to process for what kind of JMS message request).
If it is not possible using Java Pack can we achieve this using Datastge SOA Edn?

Thanks in advance.
-Amit Jaiswal
sud
Premium Member
Premium Member
Posts: 366
Joined: Fri Dec 02, 2005 5:00 am
Location: Here I Am

Re: Use of Java Pack in Datastage?

Post by sud »

I am not sure if you should open another post for this.

Well, to achieve job invocation through JAVA you will need the SOA edition. The JAVA packs allows you the opposite, invocation of JAVA programs from Datastage.
It took me fifteen years to discover I had no talent for ETL, but I couldn't give it up because by that time I was too famous.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Returning to the original topic, the "Best approach for unknown varied source" is, in my opinion to push back hard against such an insane requirement. At least demand a small and finite number of possible formats, and some easy means to detect the format in the first line of the source file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Mike3000
Participant
Posts: 24
Joined: Mon Mar 26, 2007 9:16 am

Post by Mike3000 »

Ray is 1000% correct, just don't allow your client "to kill" you.
Ray has an excellent advice about finite numbers of format and
easy detection. I have the hands-on experience with the same
type of "crazy" reqs. We followed the client and only in the middle
of the project when money and time have been lost everybody
(including a client) realized that "pure theory is a lot different
from harsh practices"...
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Hey, I said it first - but y'all probably thought I was joking. As noted, I too would push back hard on whomever decided this was a 'good' approach to take.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Nah, you just thought it. We said it. 8)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Wow, Ray. Give Craig a little credit once in a while. Besides I would be more impressed if you thought it and he wrote it.
Mamu Kim
Post Reply