Parsing Complex XML in Hierarchical Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Larz
Premium Member
Premium Member
Posts: 7
Joined: Thu Jan 17, 2013 6:28 pm

Parsing Complex XML in Hierarchical Stage

Post by Larz »

I am having a heck of a time correctly parsing an XML file which is very similar to a problem that ArndW posted in Dec.

I have tried what was suggested by moving the innermost or lowest data element, in this case the element called 'FIELD", to the top of my output and mapping it first which resulted in all 400,000 rows getting the same parent elements assigned.

Take the following data sample

Code: Select all

<BULK_DATA_EXPORTS>
  <BULK_DATA_EXPORT>
    <BULK_DATA_EXPORT_ID>548499</BULK_DATA_EXPORT_ID>
    <PATIENT>
      <PATIENT_NAME>BOSS HOGG</PATIENT_NAME>
      <PATIENT_AGENCY_CODE>1</PATIENT_AGENCY_CODE>
      <PATIENT_CODE>40</PATIENT_CODE>
      <PATIENT_INSURANCE_1>MEDIC</PATIENT_INSURANCE_1>
      <PATIENT_INSURANCE_2></PATIENT_INSURANCE_2>
      <PATIENT_INSURANCE_3></PATIENT_INSURANCE_3>
      <PATIENT_INSURANCE_4></PATIENT_INSURANCE_4>
      <PATIENT_INSURANCE_5></PATIENT_INSURANCE_5>
    </PATIENT>
    <BULK_DATA_FIELDS>
      <FORM_NAME>OTHER Nurse Start of Care</FORM_NAME>
      <FORM_CODE>99999SOC</FORM_CODE>
      <FORM_DATE>20120615</FORM_DATE>
      <FIELD NAME='M1020cVE'></FIELD>
      <FIELD NAME='M1020c'>DMII WO CMP NT ST UNCNTR</FIELD>
      <FIELD NAME='M1030_THH_ENT_NUTRITION'></FIELD>
      <FIELD NAME='M1020b'>OTHER POSTOP INFECTION</FIELD>
      <FIELD NAME='M1020e'>HYPERTENSION NOS</FIELD>
      <FIELD NAME='M1020d'>PARAPLEGIA NOS</FIELD>
      <FIELD NAME='M1020g'>ATTEN TO CYSTOSTOMY</FIELD>
      <FIELD NAME='M1020iOE'></FIELD>
      <FIELD NAME='M1020f'>ANXIETY STATE NOS</FIELD>
      <FIELD NAME='M1020i'></FIELD>
      <FIELD NAME='M1020h'>LONG-TERM (CURRENT) USE OF ASP</FIELD>
      <FIELD NAME='M0150_CPAY_MCAID_HMO'></FIELD>
      <FIELD NAME='M1020jOE'></FIELD>
      <FIELD NAME='M0032_ROC_DT'></FIELD>
      <FIELD NAME='M1023_OTH_DIAG1_SEVERITY'></FIELD>
      <FIELD NAME='M1021gDate'></FIELD>
      <FIELD NAME='M1011_14_DAY_INP5_EICD'></FIELD>
      <FIELD NAME='M1020a'>Cellulitis and abscess of butt</FIELD>
      <FIELD NAME='M1025_OPT_DIAG_ICD_F3'></FIELD>
      <FIELD NAME='M1023_OTH_DIAG4_SEVERITY'></FIELD>
      <FIELD NAME='M1025_OPT_DIAG_ICD_F4'></FIELD>
      <FIELD NAME='M0150_CPAY_MCARE_HMO'></FIELD>
      <FIELD NAME='M1016_CHGREG_ICD_NA'></FIELD>
      <FIELD NAME='M0040City'>DAYTON</FIELD>
      <FIELD NAME='M1033_HOSP_RISK_HSTRY_FALLS'></FIELD>
      <FIELD NAME='M1016_CHGREG_EICD5'></FIELD>
      <FIELD NAME='M1308_NSTG_CVRG'></FIELD>
      <FIELD NAME='M0032_ROC_DT_NA'>on</FIELD>
      <FIELD NAME='M1016_CHGREG_ICD3'></FIELD>
      <FIELD NAME='M0072UK'></FIELD>
      <FIELD NAME='M1016_CHGREG_ICD4'></FIELD>
      <FIELD NAME='M0150_CPAY_MCARE_FFS'>on</FIELD>
      <FIELD NAME='M1016_CHGREG_ICD5'></FIELD>
      <FIELD NAME='M1016_CHGREG_ICD6'></FIELD>
      <FIELD NAME='M1016_CHGREG_ICD1'>998.59</FIELD>
      <FIELD NAME='M1016_CHGREG_ICD2'>682.5</FIELD>
      <FIELD NAME='M1010_14_DAY_INP5_ICD'></FIELD>
      <FIELD NAME='M1322_NBR_PRSULC_STG1'>0</FIELD>
      <FIELD NAME='M1011_14_DAY_INP_NA'></FIELD>
      <FIELD NAME='M1025_OPT_DIAG_ICD_E3'></FIELD>
      <FIELD NAME='M1022_OTH_DIAG9_ICD'></FIELD>
      <FIELD NAME='M1018_PRIOR_UR_INCON'></FIELD>
      <FIELD NAME='M1025_OPT_DIAG_ICD_E4'></FIELD>
      <FIELD NAME='M1023_OTH_DIAG7_SEVERITY'></FIELD>
      <FIELD NAME='M1022_OTH_DIAG10_ICD'></FIELD>
      <FIELD NAME='M1000_DC_NONE_14_DA'>on</FIELD>
      <FIELD NAME='M1024_PMT_DIAG_ICD_L4'></FIELD>
      <FIELD NAME='M1022_OTH_DIAG12_ICD'></FIELD>
      <FIELD NAME='M1410_RESPTX_OXYGEN'></FIELD>
      <FIELD NAME='M1024_PMT_DIAG_ICD_L3'></FIELD>
      <FIELD NAME='M1025_OPT_DIAG_ICD_D3'></FIELD>
      <FIELD NAME='M1021bVE2'></FIELD>
      <FIELD NAME='M1025_OPT_DIAG_ICD_D4'></FIELD>
      <FIELD NAME='M1016_CHGREG_CICD3'></FIELD>
      <FIELD NAME='M1023_OTH_DIAG8_SEVERITY'></FIELD>
      <FIELD NAME='M1020fDate'>01/01/2012</FIELD>
      <FIELD NAME='M0069_PAT_GENDER'></FIELD>
      <FIELD NAME='M1610_UR_INCONT'>2</FIELD>
    </BULK_DATA_FIELDS>
  </BULK_DATA_EXPORT>
</BULK_DATA_EXPORTS>
here is my XSD...

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
   <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
     <xs:element name="BULK_DATA_EXPORTS">
       <xs:complexType>
         <xs:sequence>
           <xs:element name="BULK_DATA_EXPORT">
                 <xs:complexType>
               <xs:sequence>
                 <xs:element name="BULK_DATA_TYPE" type="xs:string"></xs:element>
                 <xs:element name="BULK_DATA_EXPORT_ID" type="xs:int"></xs:element>
                 <xs:element name="PATIENT">
                       <xs:complexType>
                     <xs:sequence>
                       <xs:element name="PATIENT_NAME" type="xs:string"></xs:element>
                       <xs:element name="PATIENT_AGENCY_CODE" type="xs:int"></xs:element>
                       <xs:element name="PATIENT_CODE" type="xs:int"></xs:element>
                       <xs:element name="PATIENT_INSURANCE_1" type="xs:string"></xs:element>
                       <xs:element name="PATIENT_INSURANCE_2"></xs:element>
                       <xs:element name="PATIENT_INSURANCE_3"></xs:element>
                       <xs:element name="PATIENT_INSURANCE_4"></xs:element>
                       <xs:element name="PATIENT_INSURANCE_5"></xs:element>
                     </xs:sequence>
                       </xs:complexType>
                 </xs:element>
                 <xs:element name="BULK_DATA_FIELDS">
                       <xs:complexType>
                     <xs:sequence>
                       <xs:element name="FORM_NAME" type="xs:string"></xs:element>
                       <xs:element name="FORM_CODE" type="xs:string"></xs:element>
                       <xs:element name="FORM_DATE" type="xs:int"></xs:element>
                       <xs:element name="FIELD" maxOccurs="unbounded">
                             <xs:complexType>
                               <xs:attribute name="NAME" type="xs:string"></xs:attribute>
                             </xs:complexType>
                       </xs:element>
                     </xs:sequence>
                       </xs:complexType>
                 </xs:element>
               </xs:sequence>
                 </xs:complexType>
           </xs:element>
         </xs:sequence>
       </xs:complexType>
     </xs:element>
   </xs:schema>
Each element with attribute name="NAME" needs to be output as a column with it's corresponding value. Using as an example from the XML sample above...
<FIELD NAME='M1016_CHGREG_ICD1'>998.59</FIELD>

So far I've found 14 distinct values for FORM_NAME with each of those having a different number of 'FIELD' elements.

How can I correctly parse these varying number of 'FIELD' elements into columns for each row. My file is around 380,000 rows which should result in 1000 output rows.

Any help is greatly appreciated.
Dislexic, Agnostic, Insomniac - lays awake at night wondering if there is a dog.
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Laz,
This is a different XML structure. The lists in this XML are completely independent

As per your requirement, from the list BULK DATA FIELDS you will be able to generate records for each of the values in the field FIELD and as far as I know you won't be able to transpose those records into columns inside the hierarchical stage, but you definitely could do it in other stages downstream

Regards
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Yeah...this is one of those poorly designed xml structures. It is a lazy design...far better would be a schema that has M1023_OTH_DIAG1_SEVERITY (for example) as an "element" and not just the "name" attribute of an element called "FIELD".

DataStage, like any other ETL tool or xml parsing mechanism, is shredding the document into its individual rows...and those rows are defined by the lowest level repeating element for each node path. In this case, that is "FIELD". The parent is "bulk data fields". Usually what is required in these situations is to dump the rows according to a higher level key value (such as the patient) and then use various Stages downstream to perform a pivot. You likely are looking for one "row" with each of the Mxxx data values in separate columns.

There is a pivot step inside of the hierarchical stage, but you are better off doing that downstream, with the obvious Stages that support this function and make it visually apparent and provide the best support from a parallel processing perspective.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Larz
Premium Member
Premium Member
Posts: 7
Joined: Thu Jan 17, 2013 6:28 pm

Post by Larz »

eostic wrote:Yeah...this is one of those poorly designed xml structures. It is a lazy design...far better would be a schema that has M1023_OTH_DIAG1_SEVERITY (for example) as an "element" and not just the "name" attribute of an element called "FIELD".
Ernie
My thoughts exactly though I am not a web developer. Appreciate your insights and assistance. Do you think an XSLT would help in formatting a better output for parsing? Also, I had considered passing the Bulk data as a large chunk to parse downstream but with the varying number of 'FIELD' columns would have to make the output a ridiculous size to account for the maximum known number of occurrences.

On a related note, being able to call a restful API from DataStage is incredibly powerful in my onion but not necessarily straight forward given the wild west nature of restful API standards, or rather lack of standards. I plan to write up a post detailing this experience as there is not a lot detailed help specific to the requirements of my current project.

Again, your help and insights are greatly appreciated!
Dislexic, Agnostic, Insomniac - lays awake at night wondering if there is a dog.
cyclogenisis
Premium Member
Premium Member
Posts: 48
Joined: Wed Jan 07, 2015 3:30 pm

Larz

Post by cyclogenisis »

I am trying to complete the exact same task for JSON off a RESTful get. I ran into this same issue this week... and the V-Pivot is failing me.
Larz
Premium Member
Premium Member
Posts: 7
Joined: Thu Jan 17, 2013 6:28 pm

Post by Larz »

Ernie, begging your pardon for my learning curve with XML, do you think an XSLT would assist here in making the 'FIELD' into a propagated column name that the value can then be parsed into? What is the best approach in your opinion for this scenario?

In this example...
<FIELD NAME='M1016_CHGREG_ICD1'>998.59</FIELD>

COLUMN = 'M1016_CHGREG_ICD1' value = 998.59
Dislexic, Agnostic, Insomniac - lays awake at night wondering if there is a dog.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

That depends on many things.......

Who is designing the code?
Who is supporting it? tomorrow? next week? five years from now?
How well do you know xslt?
What are the performance requirements?

I spend a lot of time with xml, so I know enough xslt to be dangerous --- but I'm a DataStage person, so in the time it would take me to review my xslt and test/build/deploy an xslt solution, I will have finished this Job and ten others..... and then a DataStage person in the future will be able to pick it up and maintain it without thinking.

Of course, I might re-think that if the solution I quickly construct runs into issues in performance or limits, etc. etc. etc. , but that's where "I" am going to start..... everyone's site and conditions are different.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Post Reply