Complex flat file to parent-child transforms

estevesm · Post by **estevesm** » Wed Jun 30, 2004 10:42 am

I'd like opinions/suggestions on the following problem:
I want to load two tables in a target database, let's call them PARENT and CHILD. I have a flat file generated from the mainframe containing records at the CHILD level. I'd like to load both PARENT and CHILD from this file, having the PARENT data either by aggregating (let's say summing up all numeric fields) or by derivation (from all child records, select the earliest item date for example).
It may sound stupid but here it goes...

1) Is it possible to do it in Datastage (I'm a new developer)
2) Any suggestions on how to do it? (different stages, how to divide work... etc.)
3) What if this information is sent thru MQ? What would be the design changes I'd have to take into consideration?

Thanks in advance for any help and sorry if the questions sound stupid...

vmcburney · Post by **vmcburney** » Wed Jun 30, 2004 6:06 pm

It's a smart question and this is the right place to ask it.

You need to identify the primary key of your parent record. You then de-duplicate your child records based on this key to create your parent records. Here is one design:
Read in your child records and write them out to the target child database table. Within this same job write out to a hash file which matches the definition of your parent table. Use key fields on the parent hash file to remove duplicates and leave you with unique parent records.

This job has the flat file input stage, a transformer and outputs to a database (child) and hash file (parent) stage. An additional job loads the hash file into the parent table. A sequence job runs the two jobs in order.

If you run this multiple times then you may find incoming child records already have parents loaded into the parent table. If this is the case you can load your parent database records into a hash file and use this as a lookup in your transformer to determine whether the parent already exists. Then you only output new parents to the output parent hash file.

If your source is MQ then you have a couple options. There is a MQ plug-in for DataStage which is available at an additional cost. There is also RTI which lets you turn a DataStage job into a web service which may be compatible with MQ. Without these you can probably get MQ to write records to a flat file for batch processing.

estevesm · Post by **estevesm** » Thu Jul 01, 2004 10:35 am

vmcburney wrote:It's a smart question and this is the right place to ask it.

You need to identify the primary key of your parent record. You then de-duplicate your child records based on this key to create your parent records. Here is one design:
Read in your child records and write them out to the target child database table. Within this same job write out to a hash file which matches the definition of your parent table. Use key fields on the parent hash file to remove duplicates and leave you with unique parent records.

This job has the flat file input stage, a transformer and outputs to a database (child) and hash file (parent) stage. An additional job loads the hash file into the parent table. A sequence job runs the two jobs in order.

If you run this multiple times then you may find incoming child records already have parents loaded into the parent table. If this is the case you can load your parent database records into a hash file and use this as a lookup in your transformer to determine whether the parent already exists. Then you only output new parents to the output parent hash file.

If your source is MQ then you have a couple options. There is a MQ plug-in for DataStage which is available at an additional cost. There is also RTI which lets you turn a DataStage job into a web service which may be compatible with MQ. Without these you can probably get MQ to write records to a flat file for batch processing.

Thanks for the reply Vincent!

I'll be running it multiples times a day, so I'll probably be doing something similar to your second suggestion.
I'm not sure if I'll be required to supply data in MQ. If so, my company bought the MQ plug-in already so that won't be a problem. I'm only thinking that the design may change if I use MQ. For example, if I use a file, I can consider that a unit of work. However, in MQ each message is a unit of work (correct me here if I'm wrong). If so, then I'll have to come up with a way to load both child and parent tables based on 0..n messages containing child data. Of course, if I know I'll get the messages in "bursts" I can use your idea of creating a hash of the parent table and use it to lookup when loading the child records. I'm not sure how I would update the parent table, since aggregation and derivation is necessary (maybe a link to an aggregation stage?). Let's say if I get one msg I will create a parent and child record. If I get a second msg for another child of the same parent, then I'd create the child record and update the parent, adding to totals and maybe updating some other columns.

Sorry for the long post, but this thing is costing me some brain cells...

Regards
Marcelo