Reading data from an XML file

ThilSe · Post by **ThilSe** » Fri Nov 04, 2005 6:00 am

Hi,

I want to read the data from an XML file.

I created a job like the one below.

Folder------------->XMLInput--------->Transformer----------------->SeqFile
Stage

The input XML file is

Code: Select all

<?xml version="1.0" encoding="UTF-8" ?>
<root>
<a> ASK </a>
<b> BSK </b>
</root>

I need to extract the value enclosed in the tags and write into the file.
Op Reqd:

ASK
BSK

When I imported the metadata from this file, I got the following metadata:

Column --->SQLType----> Description
root--->Unknown--------->/root
a------>Varchar(255)---->/root/a/#PCDATA
b------>Varchar(255)---->/root/b/#PCDATA

I have set 'b' as key.

When i execute the job I get the following error.

Unexpected token!pattern = '#PCDATA'(Unknown URI, 50, 34)
Remaining tokens: ('#PCDATA')

Please guide me in this issue.

Thanks
Senthil

chulett · Post by **chulett** » Fri Nov 04, 2005 8:07 am

Not an XML expert by any stretch, but that XPath information it imported looks wrong. Try a couple of things.

Change the XPath bits in the Description field from #PCDATA to just text() and see if that works. Also, only select a field as a Key if it is a repeating element, if you always just get simple pairs like that you shouldn't need to mark either of them as a key.

Give that a shot.

gpatton · Post by **gpatton** » Fri Nov 04, 2005 8:22 am

you should use the #PCDATA tag.

Make sure the tag is fully qualified.

Do not set the key until you write the file in the output of the transformer.

chulett · Post by **chulett** » Fri Nov 04, 2005 8:48 am

gpatton wrote:Make sure the tag is fully qualified.

You should probably explain what that means, g.

ThilSe · Post by **ThilSe** » Sat Nov 05, 2005 5:16 am

Chulett,

Change the XPath bits in the Description field from #PCDATA to just text()

I tried using text() instead of #PCDATA. It runs successfully.

I thank all of you for your inputs and time!

Thanks
Senthil

chulett · Post by **chulett** » Sat Nov 05, 2005 8:37 am

Still curious what the #PCDATA tag is supposed to mean.

ThilSe · Post by **ThilSe** » Sun Nov 06, 2005 9:49 pm

Hi,

PCDATA means parsed character data.

It is the text found between the start tag and the end tag of an XML element.This text will be parsed by a parser.

eg.
<Details>
<name>Senthil</name>
<address>
<street>10 th main road</street>
<city>Chennai</city>
</address>
</Details>

If <address> is defined as #PCDATA and the tags <city>,<Street> are defined, then the tags <city>,<Street> will be parsed by XML parser and expanded.

If <address> is defined as CDATA, then

<street>10 th main road</street><city>Chennai</city>

will be treated as text. The tags will <street> and <city> will not be identified by the XML parser.

Hope this clarifies.

More info can be found at
http://www.w3schools.com/dtd/dtd_building.asp

Thanks
Senthil

aartlett · Post by **aartlett** » Mon Nov 07, 2005 4:46 pm

Senthil,
People here may think I;m not a large advocate of the Datastage XML system, and they'd probably be right

. I like it for very little amounts of data, or from data coming in as a feed rather than a static source.

My preference in your situation would be a XSLT translator. This allows you to create your seq. files directly from the XML without datastage at all. Saxxon is one I have used and there are others out there for most platforms.

If you do need to use the D/S XML then the previous suggestions should get you going. The metadata handling is one of the reasons I really dislike the D/S XML.

<<end of transmission>>

vmcburney · Post by **vmcburney** » Mon Nov 07, 2005 6:13 pm

You are spot on, the XML Input and XML Output stages sit in the "Real Time" folder for a reason, they are better at handling small volumes. Good suggestion on the XSLT translator, I will have to check it out.

aartlett · Post by **aartlett** » Mon Nov 07, 2005 8:34 pm

Vince,
Have a look at the XSLT's on the apache web site. I think they were supplied in part by IBM. The licence allows comercial use so long as no money is charged further on (it's either a GPL or the apache one, I can't remember).

The XML stages are great for a MQ feed, like you said, real time.

My last gig I changed 45 jobs running D/S XML that ran for 2.5 - 3 hours to java based XSLT (used some awk scripts and the DDL to create the XSLT files) to run 15 at a time for and end to end of 20 minutes out of 3 XML files. CPU ran about 95% (a wasted cpu cycle is a lost cpu cycle). This could have been reduced if I used the C++ version, but I couldn't get the admins to load the libraries I needed, while the Java I could fo it myself.