XML File Transformations

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
collabxchange
Premium Member
Premium Member
Posts: 34
Joined: Thu Aug 28, 2014 8:48 pm
Location: United States

XML File Transformations

Post by collabxchange »

Hi,
Being not very familiar with XML files, I am struggling a bit to achieve the following objective. It will be very helpful if someone can please direct me to right approach or design.

Objective: Read a XML file, apply some transformations to its data (e.g. switch cases, change some values, etc) and output it to another XML file with the same layout.

A sample record from the input file:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<import_data>
<so_product company="XYZ-US" part_number="81001074" category="Product Associations.Packaging" description="Ring,Crimp,1/2",Pkg/10 " >
<so_product_attribute name="Brand-XYZ" value="ABCDRD"/>
<so_product_attribute name="BrandName" value="ABCD Radiant"/>
<so_product_attribute name="Description-PriceGroup-XYZ" value="Onix/Hydro"/>
<so_product_attribute name="Description-ProductLine-XYZ" value="PEX Parts"/>
<so_product_attribute name="Description1-XYZ" value="Ring,Crimp,1/2",Pkg/10"/>
<so_product_attribute name="EANUCCType" value="UK"/>
<so_product_attribute name="EnvironmentalIdentifier" value="NO_ENVIRONMENTAL_ATTRIBUTE_SPECIFIED"/>
<so_product_attribute name="FinishedGood-XYZ" value="YES"/>
<so_product_attribute name="FunctionalName_en-US" value=""/>
<so_product_attribute name="GtinName_en-US" value=""/>
<so_product_attribute name="HazardousMaterialClassification" value="NO_MSDS_AND_NOT_REGULATED_BY_DOT_(CFR49)"/>
<so_product_attribute name="IsItemAvailableForDirectToConsumerDelivery" value="false"/>
<so_product_attribute name="IsItemAvailableForSpecialOrder" value="false"/>
<so_product_attribute name="IsItemSubjectToUSPatent" value="false"/>
<so_product_attribute name="IsNetContentDeclarationIndicated" value="false"/>
<so_product_attribute name="IsNonSoldTradeItemReturnable" value="false"/>
<so_product_attribute name="IsPackagingMarkedAsRecyclable" value="false"/>
<so_product_attribute name="IsPackagingMarkedReturnable" value="false"/>
<so_product_attribute name="IsSecurityTagPresent" value="false"/>
<so_product_attribute name="IsTradeItemABaseUnit" value="false"/>
<so_product_attribute name="IsTradeItemAConsumerUnit" value="false"/>
<so_product_attribute name="IsTradeItemADespatchUnit" value="true"/>
<so_product_attribute name="IsTradeItemAnInvoiceUnit" value="false"/>
<so_product_attribute name="IsTradeItemAnOrderableUnit" value="false"/>
<so_product_attribute name="IsTradeItemMarkedAsRecyclable" value="false"/>
<so_product_attribute name="IsTradeItemRecalled" value="false"/>
<so_product_attribute name="IsWoodAComponentOfThisItem" value="false"/>
<so_product_attribute name="ItemStatus-XYZ" value="AC"/>
<so_product_attribute name="ItemType-XYZ" value="cring"/>
<so_product_attribute name="Kit-Ferguson" value="N"/>
<so_product_attribute name="LegacyPartNumber-XYZ" value="PCCR3X-10"/>
<so_product_attribute name="ListPriceUnitOfMeasure" value="EACH CONSUMER UNIT"/>
<so_product_attribute name="LongDescription_en-US" value="ABCD Radiant "/>
<so_product_attribute name="ManufacturerName" value="ABCD"/>
<so_product_attribute name="MFGName" value="ABCD"/>
<so_product_attribute name="NSFLeadFreeListed" value="N"/>
<so_product_attribute name="NSFSafetyListing-XYZ" value="N"/>
<so_product_attribute name="OrderableUnitIndicator" value="EA"/>
<so_product_attribute name="OrderingLeadTimeUnit" value="DA"/>
<so_product_attribute name="OrderQuantityMinimum" value="1"/>
<so_product_attribute name="OrderQuantityMultiple" value="1"/>
<so_product_attribute name="PackageDepth-XYZ" value="5" unit_name="inch"/>
<so_product_attribute name="PackageGTIN-XYZ" value="20840213013657"/>
<so_product_attribute name="PackageHeight-XYZ" value="4" unit_name="inch"/>
<so_product_attribute name="PackageQtyPer-XYZ" value="10"/>
<so_product_attribute name="PackageWeight-XYZ" value="1.5" unit_name="pound"/>
<so_product_attribute name="PackageWidth-XYZ" value="5" unit_name="inch"/>
<so_product_attribute name="PiecesPerTradeItem" value="1"/>
<so_product_attribute name="PiecesPerTradeItemUnit" value="EA"/>
<so_product_attribute name="PIMComplete-XYZ" value="NO"/>
<so_product_attribute name="PriceGroup-XYZ" value="211"/>
<so_product_attribute name="PrimarySite-XYZ" value="A09"/>
<so_product_attribute name="PriorityCode-XYZ" value="C"/>
<so_product_attribute name="ProductLine-XYZ" value="K722"/>
<so_product_attribute name="ProductManager-XYZ" value="BARRETSR"/>
<so_product_attribute name="ProductTypeCode" value="CA"/>
<so_product_attribute name="PurMfg-XYZ" value="M"/>
<so_product_attribute name="ReturnGoodsPolicy" value="CALL_FOR_AUTHORIZATION"/>
<so_product_attribute name="StockStatus-XYZ" value="STK"/>
<so_product_attribute name="TargetAudience-XYZ" value="NS"/>
<so_product_attribute name="Unit of Measure" identifying="true" value="MASTER"/>
<product_association name="Packaging" company="XYZ-US" part_number="81001074" >
<product_association_attribute name="Quantity" value="10">
</product_association_attribute>
</product_association>
</so_product>
<import_data>


Thank you very much in advance.
Last edited by collabxchange on Tue Feb 03, 2015 3:46 pm, edited 1 time in total.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Do you have an XSD that describes that layout? If yes, import the "table definition" from that XSD. Then using any of the XML stage types will be a doddle.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
collabxchange
Premium Member
Premium Member
Posts: 34
Joined: Thu Aug 28, 2014 8:48 pm
Location: United States

Post by collabxchange »

Yes, I used the above data to generate the XSD which looks like below

<?xml version="1.0"?>
<xs:schema id="import_data" xmlns="" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata">
<xs:element name="import_data" msdata:IsDataSet="true" msdata:UseCurrentLocale="true">
<xs:complexType>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="so_product">
<xs:complexType>
<xs:sequence>
<xs:element name="so_product_attribute" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:attribute name="name" type="xs:string" />
<xs:attribute name="value" type="xs:string" />
<xs:attribute name="unit_name" type="xs:string" />
<xs:attribute name="identifying" type="xs:string" />
</xs:complexType>
</xs:element>
<xs:element name="product_association" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="product_association_attribute" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:attribute name="name" type="xs:string" />
<xs:attribute name="value" type="xs:string" />
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="name" type="xs:string" />
<xs:attribute name="company" type="xs:string" />
<xs:attribute name="part_number" type="xs:string" />
<xs:attribute name="description" type="xs:string" />
</xs:complexType>
</xs:element>
<xs:element name="product_company_association" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:attribute name="company_name" type="xs:string" />
<xs:attribute name="type" type="xs:string" />
<xs:attribute name="part_number" type="xs:string" />
<xs:attribute name="status" type="xs:string" />
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="company" type="xs:string" />
<xs:attribute name="part_number" type="xs:string" />
<xs:attribute name="category" type="xs:string" />
<xs:attribute name="description" type="xs:string" />
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>
</xs:schema>


When I import it using the "table Definition", do I import the child elements like "Product Association"? Or just the elements underneath them?

Thank you!
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

First goal --- break this down into its individual components. Read the file and get the data out into rows...... once that is working perfectly, then it will be time to review how to get it back into this structure. You have already stated that you want to perform transformations on the data after it is pulled from the xml. Work on all of that first. Parsing xml has a learning curve in any tooling --- so does construction, each with their own challenges.

As for this xml, it is fairly simple --- share with us how much volume you have (like "2 gigabytes of this every night" or "a few hundred of these, each 1 meg in size", or something else). That could impact which Stage(s) you choose for the Job.

The xml Stage and the use of xsd's is more efficient, but is vastly more difficult to learn when using it the first time. Your xml below would work just fine with the xmlInput Stage also, but if it is really HUGE, I might recommend that you go with the new xml Stage from the beginning.

If you choose the xmlInput Stage, and use the XML importer (import...table definitions....xml...), as an initial test you will want to identify the inner attributes of the so_product_attriubute element. I would check ONLY those two (little reddish orange icons indicate the attributes). Make one of them a key. That resulting table definition will go onto the output link of your xml Input Stage. Also, spend some time in this forum doing searches on the xml stages -- there are hundreds of useful posts, and also review the posts at my blog at www.dsrealtime.com (go to the table of contents for posts on xml, especially how to define your sources).

Your xml has a challenge. It isn't a very nice design. Far better would be if the "metadata" was truly the names of the attributes. For example MFGName is a "value" of an attribute --- having it as <MFGName>..some name</MFGName> would have been a much better xml design. But it's still ok, but you will have to do some creative pivoting for the most functionality. Right now you will get "n" rows, one for each of the different so_product_attribute lines, when really this is a mostly flat record with multiple columns. On the other hand, you could just retrieve the rows as they are and then have a giant if-then-else statement that reviews the value of the "name" attribute before making transformations on the "value" attribute going downstream. This is hard to appreciate until you get your first successful retrieval of the document. Concentrate on that first.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
collabxchange
Premium Member
Premium Member
Posts: 34
Joined: Thu Aug 28, 2014 8:48 pm
Location: United States

Post by collabxchange »

Hello Ernie,
Thank you very much for the details. The XML file in question is of 1 GB size. Hence, I am using the HierarchicalData Stage (formerly XML Stage). I have already imported the XSD as XML Table Definition like Ray suggested. Now what should I use in the assembly palette to read the data? And on a high level, should the design look like

HD --> Transformer --> XML Output
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Ok...well... there is more to learn. ; )

With the xml stage, you don't import the xsd as a table definition, you import it via the library manager. You will need to do that first.

Then, when it imports successfully, open it (still inside the library manager). Look at the resulting tree. It will have a variety of colored icons. Your inner most repeating group (each so_product_attribute line) will have a blue "list" icon next to it.

Now you open a Stage instance and go to the Assembly editor....there are more details than I can place here, but the real key to getting this right is that when you get to the Output Step, be sure to map the blue "list" icon on the left (for your so_product lines) to the blue "list" icon that represents your output link.....then you can map the two columns and get the same multiple rows I was discussing above in the prior post.

As for how to source from somewhere, see my blog as noted above.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
collabxchange
Premium Member
Premium Member
Posts: 34
Joined: Thu Aug 28, 2014 8:48 pm
Location: United States

Post by collabxchange »

Not just more but way too much to learn. :)

So, I imported the XSD into the library manager and here is how it looks like.

Image

I have highlighted the the blue "list" icons in yellow.

In the assembly editor, I have 3 steps. Input Step; XML Parser Step; Output Step.

The problem is that in the output step, I can only map it to one of these blue "list" icons. The question is how do I output the columns from all the blue "list" icons list into the single output?

I am still reading your blog and its a great source of information. Thanks.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Three separate output links. Each list (unless nested) is its own independent set of rows. Of course, if they are nested, just map the "lowest one that you care about", and you can bring the others along for the ride.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
collabxchange
Premium Member
Premium Member
Posts: 34
Joined: Thu Aug 28, 2014 8:48 pm
Location: United States

Post by collabxchange »

Actually 4 output links as "Product_Association" has another one nested called "Product_Association_Attribute" as shown in the picture.

Now I have 4 output links containing the respective elements from each blue "list" icons. I apply the transformation, etc using the transformer stage(outside HD Stage) and data is cleaned. How do I merge them back together and output it to a XML format in the same layout as the source file?
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

You have to learn how to "compose". Suggest that you start with this redbook....

http://www.redbooks.ibm.com/redbooks/pdfs/sg247987.pdf

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
collabxchange
Premium Member
Premium Member
Posts: 34
Joined: Thu Aug 28, 2014 8:48 pm
Location: United States

Post by collabxchange »

I read the contents in the link and made some progress but its just not working and I am not sure where it is going wrong.

To summarize,

The XSD structure is below

SO_PRODUCT (Parent P1)
---->SO_PRODUCT_ATTRIBUTE (child C1)
---->PRODUCT_ASSOCIATION (child C2)
--------->PRODUCT_ASSOCIATION_ATTRIBUTE (child of child C2 CC1)
---->PRODUCT_COMPANY_ASSOCIATION (child C33)

My design looks like

Hierarchical Data(Inp) ---> Transformer ---> Hierarchical Data (Out)

I have 5 links coming out of Inp HD for respective parent and child links mentioned above.

In the Out HD, I am using the H-Pivot to join the Parent with each child. But when I run, it none of the fields from child elements in the hierarchical structure if being populated.

The Out HD looks like
Input Step
HJOIN_1 (P1 - C1)
HJOIN_2 (P1 - C2)
HJOIN_3 (P1 - C3)
HJOIN_4 (C2 - CC1)
XML_Composer
Output Step

Do you see anything that I not doing right here?
Post Reply