XML output generation

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Post by VCInDSX »

eostic wrote:....
On the output side, have a column called myNewXML....give it varchar and a long length mostly for doc purposes, with a single "/" in the Description property (also without quotes).

Should work like a charm.

Ernie
Hi Ernie,
As I am not an expert on the XML stages in DS, i would appreciate it if you could help me understand this a bit more.
When the XML doc chunk(s) is passed to the final XML output stage in the above manner, would it be an overhead to the overall processing?
Also, if one were to extrapolate this to a case where millions of records are being output to the file, would that cause any issue?

Thanks in advance for your time,
-V
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Not exactly sure of your question. Creating complex, multi-node documents is not a simple process, nor great on the performance side --- but it can work.... it just requires that you build each "relational" node (that could easily be represented as a normalized table or single set of rows and columns) individually. These are the "chunks" referred to up above. None are complete documents -- just pieces. They you bring them all together at the end. There's no real question about "overhead"...this is the only way to do it if you want to get it done via DataStage and not write anything external.

Ernie
VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Post by VCInDSX »

Hi Ernie,
I think that answers my question. I was not sure if there was some other way to generate the document without subjecting the chunks through additional stages. It appears that Datastage job designs/designers have to keep this in mind ( I will, as I continue to work on XML :D ).
Now I can relate it to a conventional DOM object building process, where one has to work from different "fragments" and finally build the document.

Thanks again, for your time and response.
-V
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If you go to Kim Duke's website, there is a loverly older Ascential document on XML 'Best Practices' that does a great job of illustrating these techniques. Free for the downloading. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Post by VCInDSX »

I already grabbed it Craig. It serves well, especially, when I consult it when i run into these issues.
The other question I had on this topic is about the buffer length that Datastage can handle when one tries to bring together huge chunks and assemble them into one final master document. Any benchmarks, guidelines on that.....?
I have a few other queries on the XML output stage, but I will post those in a separate thread :idea: and not hijack this one :)

Thanks,
Last edited by VCInDSX on Thu May 24, 2007 7:39 am, edited 1 time in total.
-V
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

kinda the same as what we've referenced before in several threads....if you let XMLOutput write the content to disk, you can probably go higher than if you send the output further down the job (especially if you are using EE).....but you should be able to get fairly high with Server edition.......avoid validation on output unless you must....

Ernie
VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Post by VCInDSX »

Pardon my amateur queries....
If I understand this correctly, if the xml chunks/fragments are persisted to file(s) before they are combined into the final document that should help with performance, correct?
I wonder if that would be additional file I/O time that adds up to the total processing time for the job?

In case of a simple XML generation job,
Stage 1. Read data from 1 or more (if joins) tables. (Yields 2Mil records)
Stage 2. Apply transforms (timestamps, null validations et al)
Stage 3. Write out to XML Output stage (With schema validation)

In this job, Stage 2 waits for completion of Stage 1 and Stage 3 waits for completion of Stage 2.

Does this job stand to gain anything special if it were to be designed as Parallel as opposed to Server?

Thanks again for your time and input,
-V
velagapudi_k
Premium Member
Premium Member
Posts: 142
Joined: Mon Jun 27, 2005 5:31 pm
Location: Atlanta GA

Post by velagapudi_k »

Hi Guys,
Is there a way to get rid of the following when generating XML using datastage XML output stage.

<?xml version="1.0" encoding="UTF-8"?>.

Appreciate your help.
Venkat Velagapudi
velagapudi_k
Premium Member
Premium Member
Posts: 142
Joined: Mon Jun 27, 2005 5:31 pm
Location: Atlanta GA

Post by velagapudi_k »

Hi Guys,
Is there a way to get rid of the following when generating XML using datastage XML output stage.

<?xml version="1.0" encoding="UTF-8"?>.

Appreciate your help.
Venkat Velagapudi
velagapudi_k
Premium Member
Premium Member
Posts: 142
Joined: Mon Jun 27, 2005 5:31 pm
Location: Atlanta GA

Post by velagapudi_k »

Absolutely. Just pass it thru "one more" XMLOutput Stage (which is what you'd do if it were all in a single job. You can't just "paste" the xml snippets together without wrapping them in a higher level element(s).

So... have an XMLOutput stage with an input link and an output link.... one column on each.

On the input side, feed it your XML.... and have (minimally) a Description property for that column that is just "/Orgapiload " (without the quotes).

On the output side, have a column called myNewXML....give it varchar and a long length mostly for doc purposes, with a single "/" in the Description property (also without quotes).

Should work like a charm.
Hi ernie, I am trying to do the above mentioned by you.

My input is
<entry org_cd="USSTD" org_level_cd="DIV" name="US Stores Division" locationname="USSTO">
<orgrel org_cd="USSTO" org_level_cd="AREA" eff_date="01/01/1900"/>
<status code="Active" eff_date="01/01/1900"/>
</entry>
<entry org_cd="00010" org_level_cd="REG" name="REGION 10 SOUTHEAST" locationname="USSTD">
<orgrel org_cd="USSTD" org_level_cd="DIV" eff_date="01/01/1900"/>
<status code="Active" eff_date="01/01/1900"/>
</entry>
which I want to wrap a higher level element of orgapiload.
I am using sequential file stage, xml ouput stage and sequential file stage respectively. On the input side, feeding my XML.... and having description property for that column as "/Orgapiload " .
On the output side, I named a column called myXML.... with a single "/" in the Description.

The output I am expecting is
<orgapiload>
<entry org_cd="USSTD" org_level_cd="DIV" name="US Stores Division" locationname="USSTO">
<orgrel org_cd="USSTO" org_level_cd="AREA" eff_date="01/01/1900"/>
<status code="Active" eff_date="01/01/1900"/>
</entry>
<entry org_cd="00010" org_level_cd="REG" name="REGION 10 SOUTHEAST" locationname="USSTD">
<orgrel org_cd="USSTD" org_level_cd="DIV" eff_date="01/01/1900"/>
<status code="Active" eff_date="01/01/1900"/>
</entry>
</orgapiload>


But the output I am getting is

<?xml version="1.0" encoding="UTF-8"?>
<orgapiload>
<entry org_cd="USSTD" org_level_cd="DIV" name="US Stores Division" locationname="USSTO">

</orgapiload>
<orgapiload>
<orgrel org_cd="USSTO" org_level_cd="AREA" eff_date="01/01/1900"/>

</orgapiload>
<orgapiload>
<status code="Active" eff_date="01/01/1900"/>

</orgapiload>
<orgapiload>
</entry>

</orgapiload>
<orgapiload>
<entry org_cd="00010" org_level_cd="REG" name="REGION 10 SOUTHEAST" locationname="USSTD">

</orgapiload>
<orgapiload>
<orgrel org_cd="USSTD" org_level_cd="DIV" eff_date="01/01/1900"/>

</orgapiload>
<orgapiload>
<status code="Active" eff_date="01/01/1900"/>

</orgapiload>
<orgapiload>
</entry>
</orgapiload>


Please help me with this.
Venkat Velagapudi
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

There is a check box called "Generate XML Chunk" on one of the output tabs..... check that and it will leave off the header...

Ernie
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

...this is assuming that your "input" xml chunk defined above has already been collapsed or built as a single column and single row coming into the XMLOutput Stage...

Ernie
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

I just realized that we never discussed VCInDSX's last entry.... thoughts on this interlaced below... [ernie]


Pardon my amateur queries....
If I understand this correctly, if the xml chunks/fragments are persisted to file(s) before they are combined into the final document that should help with performance, correct?

[evo]...typically but that's probably only because lookups make it fairly simple to pick up the chunks "later" in the job....there may be more creative ways to "carry" the xml content forward after it is created....

I wonder if that would be additional file I/O time that adds up to the total processing time for the job?

[evo]. XML to XML is already slow....what's a little more I/O :)

In case of a simple XML generation job,
Stage 1. Read data from 1 or more (if joins) tables. (Yields 2Mil records)
Stage 2. Apply transforms (timestamps, null validations et al)
Stage 3. Write out to XML Output stage (With schema validation)

In this job, Stage 2 waits for completion of Stage 1 and Stage 3 waits for completion of Stage 2.

[evo]...this is not entirely correct. Saying that Stage 2 "waits" for completion of "Stage 1" implies that the data is staged somewhere before the first row is transformed in Stage 2..... That is not true, as Stage 2 will start performing Transforms immediately upon receiving the first row. Stage 3 is not waiting for all the Transforms either, although it may "appear" that way because XMLOutput is a naturally blocking Stage. But it will be receiving rows continually as they are transformed.

Does this job stand to gain anything special if it were to be designed as Parallel as opposed to Server?

[evo] ...this is more difficult. A Parallel job will typically only be as fast as it's slowest piece. Ultimately, the XMLOutput stage at the end is going to take it's time to create the final document. Depending on the transforms being performed, or the degree of parallelism exploited at the sources, it could be very possible that EE will deliver rows more quickly to the XMLOutput Stage (which would be running sequential), and the framework itself will do a better job getting those rows thru the links....but the XMLOutput Stage may not be able to keep up anyway, and the benefits would be lost..... (for that job anyway --- who knows what else might be going on in the system, or the added flexibility you would get if you were running the Transform on another node, thus freeing up some processes on "this" box for other things, etc.).


Thanks again for your time and input,

_________________
-V
velagapudi_k
Premium Member
Premium Member
Posts: 142
Joined: Mon Jun 27, 2005 5:31 pm
Location: Atlanta GA

Post by velagapudi_k »

...this is assuming that your "input" xml chunk defined above has already been collapsed or built as a single column and single row coming into the XMLOutput Stage...
Ernie is there any way what I want to acheive.
Venkat Velagapudi
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Hi....

This whole thing might simply be because you are using the sequential stage to read this xml input. I was assuming that your display below was for readability.....but looking at the output, it appears that you are reading in the xml, with CRLF's and probably getting 8 rows from the sequential file to the XMLOutput Stage. If so, that's the problem. Use the Folder Stage to read in this xml document.... IT MUST BE IN ONE COLUMN....THE ENTIRE CONTENT, whether from this Stage or from another XMLOutput Stage somewhere, before you send it into the XMLOutput Stage for the final wrapper.

This works perfectly as described in the other responses above, with your XML string.

Ernie
Post Reply