Process concatenated XML data

santosh · Post by **santosh** » Fri Feb 05, 2016 6:37 am

Background:
we are trying to process XML data (present in text column in sql server) using Datastage 11.3 (windows), Hierarchical stage. Objective is to convert XMLs into columnar data. There is another columnn in source table that tells us what is source of this data (XML type). As of now we have not got a XSD for each XML type, hence we are generating XSD using XMLspy and using it. Current jobs are having 3 stages i.e.

ODBC_connector --> Hierarchical_stage (input, parset and output step) --> Netezza Connector

Problem:
For below mentioned xml type, we can see multiple xmls are stacked one above another. so effectively they are made up multiple concatenated xmls (All having same definations). Is there a way to split XMLs on fly before it reaches Hierarchical_stage. As volumes are huge, we are looking to process without landing data

Sample XML below:
...... signifies, there are more tags in XML
------ signifies, there are multiple xmls. overall there are 6 to 7 of them. Numbers vary.

Code: Select all

<?xml version="1.0"?>
<SchemeResult Ref="COPC2" Completed="Y" ErrorCount="0">
  <PolData Type="Output">
    <Vehicle>
      <Vehicle_VehiclePrn Val="1.0"/>
      <Ncd>
        <Ncd_GrantedEntitlementReason Val="11"/>
        <Ncd_GrantedPct Val="5.0"/>
        <Ncd_GrantedYears Val="9.0"/>
      </Ncd>
    </Vehicle>
    <Cover>
      <Cover_VolXsAllowed Val="150.0"/>
      <Cover_VehPrn Val="1.0"/>
    </Cover>
	.......
  </PolData>
</SchemeResult><?xml version="1.0"?>
<SchemeResult Ref="ADPC2" Completed="Y" ErrorCount="0">	
  <PolData Type="Output">
    <Vehicle>
      <Vehicle_VehiclePrn Val="1.0"/>
      <Vehicle_Count Val="13.0"/>
      <Ncd>
        <Ncd_GrantedEntitlementReason Val="11"/>
        <Ncd_GrantedPct Val="1.0"/>
        <Ncd_GrantedYears Val="9.0"/>
      </Ncd>
    </Vehicle>
    <Cover>
      <Cover_VolXsAllowed Val="150.0"/>
      <Cover_VehPrn Val="1.0"/>
    </Cover>
	........
	  </PolData>
</SchemeResult>
--------

eostic · Post by **eostic** » Fri Feb 05, 2016 11:59 am

If these are all in "one row" from the relational table, then yes, you will need to split them out into individual rows..... lots of potential ways to do that...you need to come up with a pivot that is based on each occurrence of the <?xml header.

How big are "one row" (one string) of these xml's? I would probably look at something simple, like writing these rows out to a sequential file after using a set of ereplace() functions and changing all <?xml to char(13):char(10), which is a CRLF. ...then pick up reading the sequential file downstream (server Job) or in another Job (parallel), sending each "row" as an independent document for "content" to one of the xml Stages.

Ernie