Page 1 of 1

Reading data from an XML file

Posted: Fri Nov 04, 2005 6:00 am
by ThilSe
Hi,

I want to read the data from an XML file.

I created a job like the one below.

Folder------------->XMLInput--------->Transformer----------------->SeqFile
Stage

The input XML file is

Code: Select all

<?xml version="1.0" encoding="UTF-8" ?>
<root>
<a> ASK </a>
<b> BSK </b>
</root>
I need to extract the value enclosed in the tags and write into the file.
Op Reqd:
  • ASK
    BSK
When I imported the metadata from this file, I got the following metadata:
  • Column --->SQLType----> Description
    root--->Unknown--------->/root
    a------>Varchar(255)---->/root/a/#PCDATA
    b------>Varchar(255)---->/root/b/#PCDATA
I have set 'b' as key.

When i execute the job I get the following error.
Unexpected token!pattern = '#PCDATA'(Unknown URI, 50, 34)
Remaining tokens: ('#PCDATA')
Please guide me in this issue.

Thanks
Senthil

Posted: Fri Nov 04, 2005 8:07 am
by chulett
Not an XML expert by any stretch, but that XPath information it imported looks wrong. Try a couple of things.

Change the XPath bits in the Description field from #PCDATA to just text() and see if that works. Also, only select a field as a Key if it is a repeating element, if you always just get simple pairs like that you shouldn't need to mark either of them as a key.

Give that a shot.

Posted: Fri Nov 04, 2005 8:22 am
by gpatton
you should use the #PCDATA tag.

Make sure the tag is fully qualified.

Do not set the key until you write the file in the output of the transformer.

Posted: Fri Nov 04, 2005 8:48 am
by chulett
gpatton wrote:Make sure the tag is fully qualified.
You should probably explain what that means, g.

Posted: Sat Nov 05, 2005 5:16 am
by ThilSe
Chulett,
Change the XPath bits in the Description field from #PCDATA to just text()
I tried using text() instead of #PCDATA. It runs successfully.

I thank all of you for your inputs and time!

Thanks
Senthil

Posted: Sat Nov 05, 2005 8:37 am
by chulett
Still curious what the #PCDATA tag is supposed to mean. :?

Posted: Sun Nov 06, 2005 9:49 pm
by ThilSe
Hi,

PCDATA means parsed character data.

It is the text found between the start tag and the end tag of an XML element.This text will be parsed by a parser.

eg.
<Details>
<name>Senthil</name>
<address>
<street>10 th main road</street>
<city>Chennai</city>
</address>
</Details>

If <address> is defined as #PCDATA and the tags <city>,<Street> are defined, then the tags <city>,<Street> will be parsed by XML parser and expanded.

If <address> is defined as CDATA, then
<street>10 th main road</street><city>Chennai</city>
will be treated as text. The tags will <street> and <city> will not be identified by the XML parser.

Hope this clarifies.

More info can be found at
http://www.w3schools.com/dtd/dtd_building.asp


Thanks
Senthil

Posted: Mon Nov 07, 2005 4:46 pm
by aartlett
Senthil,
People here may think I;m not a large advocate of the Datastage XML system, and they'd probably be right :). I like it for very little amounts of data, or from data coming in as a feed rather than a static source.

My preference in your situation would be a XSLT translator. This allows you to create your seq. files directly from the XML without datastage at all. Saxxon is one I have used and there are others out there for most platforms.

If you do need to use the D/S XML then the previous suggestions should get you going. The metadata handling is one of the reasons I really dislike the D/S XML.

<<end of transmission>>

Posted: Mon Nov 07, 2005 6:13 pm
by vmcburney
You are spot on, the XML Input and XML Output stages sit in the "Real Time" folder for a reason, they are better at handling small volumes. Good suggestion on the XSLT translator, I will have to check it out.

Posted: Mon Nov 07, 2005 8:34 pm
by aartlett
Vince,
Have a look at the XSLT's on the apache web site. I think they were supplied in part by IBM. The licence allows comercial use so long as no money is charged further on (it's either a GPL or the apache one, I can't remember).

The XML stages are great for a MQ feed, like you said, real time.

My last gig I changed 45 jobs running D/S XML that ran for 2.5 - 3 hours to java based XSLT (used some awk scripts and the DDL to create the XSLT files) to run 15 at a time for and end to end of 20 minutes out of 3 XML files. CPU ran about 95% (a wasted cpu cycle is a lost cpu cycle). This could have been reduced if I used the C++ version, but I couldn't get the admins to load the libraries I needed, while the Java I could fo it myself.