Parsing XML into two dB tables, creating keys

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mellidav
Premium Member
Premium Member
Posts: 11
Joined: Mon Oct 13, 2014 4:50 am
Location: manchester

Parsing XML into two dB tables, creating keys

Post by mellidav »

Hi,
I have an XML file with records that have a header, then multiple lines (repeated).
I am creating a job to push this into two seperate database tables, header and detail.
However, the data in the XML does not have any keys to link the detail lines to the header lines.
Is there a way that Datastage can create keys linking the detail lines to the header lines as it processes the job?
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Can you outline this with more detail?

You mention "you have an xml file with header and detail". That doesn't really make sense, unless perhaps you have a "file" that contains "many" xml documents (one full document per line), and the first xml document (first physical record of the file) is a "header" xml document and the next ones are independent xml documents, each with "detail" ?

A single "xml file" would be complete...perhaps with header type info and detail type info, but it would be described as such according to its tags, not by physical record number.

Please describe the file with more detail and we can suggest various approaches.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Moved to the proper forum.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mellidav
Premium Member
Premium Member
Posts: 11
Joined: Mon Oct 13, 2014 4:50 am
Location: manchester

Post by mellidav »

Sure sorry,
The XML file is roughly in the format of:

Code: Select all

<Purchase Order 1>
     <PO Line 1>
     <PO Line 2>
<Purchase Order 2>
     <PO Line 1>
     <etc>
I am bringing all the <Purchase Order> information into one dB table (i.e. date, customer, store, etc), and all the 'PO Line' information into a seperate dB table (i.e. Product, Qty, Price, etc)

However, in the XML document there are no key identifiers to link the PO lines to the PO headers when they are in seperate tables.

I wondered if there was a way for DS to create parent/child keys as it populates the dB tables to ensure the relationship between the <Purchase Order> and <PO Line> information can be maintained?

Thanks,
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Yes...absolutely. The xml structure itself represents the "parent" relationships here.....

...and DataStage will read it in that way.....

PO line 1...."belongs" to the first Purchase Order by nature of how xml works (provided that this is a valid XML document with appropriate open and close tags).

This is not a place for an xml tutorial, but to get you started, first try opening the document in your favorite browser. Find the first "PO".....see if it has a dash next to it....click it and see if the "whole PO" collapses and now has a "+"......that is a fairly simple way to help illustrate the relationship that all the "lower level" data items have with that parent.

More formally, the main purchase order element should have a starting tag like <purchaseOrder> ...and then somewhere else, after all the line items and before the next purchase order, you will see a closing tag for that purchase order .....</purchaseOrder>. That's a quick and overly simplistic description, but should get you started. Look on the web for xml tutorials --- you'll find many....and then come back and search thru the forum here for how to get started with DataStage and xml......

DataStage will ultimately read the line items as multiple rows, with the correct parent purchase order as one of the initial columns.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
mellidav
Premium Member
Premium Member
Posts: 11
Joined: Mon Oct 13, 2014 4:50 am
Location: manchester

Post by mellidav »

Thanks, but that's not what I see when I have imported the PO Header and PO Line information into two tables.

e.g. the xml is presenting this (but in xml format)

PO1, Customer1, Address1
POline-a, Qty-a, Product-a, Cost-a
POline-b, Qty-b, Product-b, Cost-b
PO2, Customer2, Address2
POline-c, Qty-c, Product-c, Cost-c
PO3 etc

But when I import it into two dB tables 1. PO Header and 2. PO Line I get

PO Header table:
PO1, Customer1, Address1
PO2, Customer2, Address2
PO3, etc

and

PO Line table:
POline-a, Qty-a, Product-a, Cost-a
POline-b, Qty-b, Product-b, Cost-b
POline-c, Qty-c, Product-c, Cost-c

The problem is the original XML does not have an actual key linking the PO headers to the PO Line, they are clear when they are in the XML structure, but when imported into to seperate dB tables there is no way I can now tell what PO lines are for what PO headers.

So is there a way that datastage can somehow add parent/child keys into the dB tables when running the job, so the database can maintain the relationships?

If so could you please explain how?
Many thanks,
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

How are you reading in the data?

You need to be using the xml stages....sounds like maybe you are reading this file using only the Sequential Stage and then parsing out the pieces in a Transformer?

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
mellidav
Premium Member
Premium Member
Posts: 11
Joined: Mon Oct 13, 2014 4:50 am
Location: manchester

Post by mellidav »

Hi Ernie,

No I'm definitely using the XML stage, it's reading an XML file (correctly) and importing two database tables...below is an example of the XML (it has an associated xsd also)

<batch PO>
--<PO Header>
------<PO number>...</PO number>
------<Customer>...</Customer>
------<Address>...</Address>
------<Items>
-----------<Item>
--------------<Line number>...</Line number>
--------------<Item ID>...</Item ID>
--------------<Cost>...</Cost>
-----------</Item>
-----------<Item>
--------------<Line number>...</Line number>
--------------<Item ID>...</Item ID>
--------------<Cost>...</Cost>
------------</Item>
-------</Items>
----</PO Header>
</batch PO>

There are two 'lists' therefore created in DS, the 'PO Header' and the 'Item'.
I need to seperate these out into two database tables for the Headers and the Items.
The DS extract is working and I have the data pulled into two database tables, but as you can see from the data there is no key in the data that links the items to the PO Headers. It's clear when they are in the XML, but when I pull the data into two separate tables I cannot reference back which Items belong to which PO Header.
So was wondering if there is a way that DS can apply a primary key to the tables on export, as it knows that they are parent/child items, but on extract we lose that?

Thanks,
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

ah. Ok. "Nested" lists can be retrieved on the same link.... The only time you absolutely _have_ to use separate links is when they are independent lists.... Imagine if you had line items and also "multiple addresses" for each PO. You'd need a link for PO header info and addresses.......and another for PO header info and line items...

...but in your example below, PO, while being a list, is the "parent" list for items....nested within it is a repeating set of items. So they can both be on the same output link.

Build a new Job...and this time, in the Assembly, BEFORE YOU DO ANYTHING ELSE....map your "items" list to a single output link. Once you do that, you can then map any of the PO header fields also to that same link, and your parent/child relationships will be maintained. Each line item detail value will appear on a row along with its corresponding PO header info.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
mellidav
Premium Member
Premium Member
Posts: 11
Joined: Mon Oct 13, 2014 4:50 am
Location: manchester

Post by mellidav »

Hi Ernie,

Sorry but I don't know if that helps me. Can you comment on how I can pass parent/child keys on to the two separate database tables?

Can you provide any instruction on how to do so?
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

If you review and implement the notes above, you will have one output link...one set of rows.......with columns, on each row, for Purchase Order, other header info, the line item, line item details, etc. Everything all together. Just like if the PO and its details came in from a single flat file.

Now you can do all kinds of things --- like create unique counters that change whenever the PO changes, or store the PO number with each of the line item details.....whatever...just like you would if this was a flat file.

Is this still an xml question?

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
mellidav
Premium Member
Premium Member
Posts: 11
Joined: Mon Oct 13, 2014 4:50 am
Location: manchester

Post by mellidav »

Thanks Ernie, i've joined the premium account so that I can now read the thread, but it's taking an age to process my upgrade.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

It does, I'm afraid. Scroll down to the Site/Forum forum and read other people's experience before you start to worry too much.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mellidav
Premium Member
Premium Member
Posts: 11
Joined: Mon Oct 13, 2014 4:50 am
Location: manchester

Post by mellidav »

Thanks Ernie, got it sorted now,
David
Post Reply