Page 1 of 1
Reading data from an XML file
Posted: Fri Nov 04, 2005 6:00 am
by ThilSe
Hi,
I want to read the data from an XML file.
I created a job like the one below.
Folder------------->XMLInput--------->Transformer----------------->SeqFile
Stage
The input XML file is
Code: Select all
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<a> ASK </a>
<b> BSK </b>
</root>
I need to extract the value enclosed in the tags and write into the file.
Op Reqd:
When I imported the metadata from this file, I got the following metadata:
- Column --->SQLType----> Description
root--->Unknown--------->/root
a------>Varchar(255)---->/root/a/#PCDATA
b------>Varchar(255)---->/root/b/#PCDATA
I have set 'b' as key.
When i execute the job I get the following error.
Unexpected token!pattern = '#PCDATA'(Unknown URI, 50, 34)
Remaining tokens: ('#PCDATA')
Please guide me in this issue.
Thanks
Senthil
Posted: Fri Nov 04, 2005 8:07 am
by chulett
Not an XML expert by any stretch, but that XPath information it imported looks wrong. Try a couple of things.
Change the XPath bits in the Description field from #PCDATA to just text() and see if that works. Also, only select a field as a Key if it is a repeating element, if you always just get simple pairs like that you shouldn't need to mark either of them as a key.
Give that a shot.
Posted: Fri Nov 04, 2005 8:22 am
by gpatton
you should use the #PCDATA tag.
Make sure the tag is fully qualified.
Do not set the key until you write the file in the output of the transformer.
Posted: Fri Nov 04, 2005 8:48 am
by chulett
gpatton wrote:Make sure the tag is fully qualified.
You should probably explain what that means, g.
Posted: Sat Nov 05, 2005 5:16 am
by ThilSe
Chulett,
Change the XPath bits in the Description field from #PCDATA to just text()
I tried using text() instead of #PCDATA. It runs successfully.
I thank all of you for your inputs and time!
Thanks
Senthil
Posted: Sat Nov 05, 2005 8:37 am
by chulett
Still curious what the
#PCDATA tag is supposed to mean.
Posted: Sun Nov 06, 2005 9:49 pm
by ThilSe
Hi,
PCDATA means parsed character data.
It is the text found between the start tag and the end tag of an XML element.This text will be parsed by a parser.
eg.
<Details>
<name>Senthil</name>
<address>
<street>10 th main road</street>
<city>Chennai</city>
</address>
</Details>
If <address> is defined as #PCDATA and the tags <city>,<Street> are defined, then the tags <city>,<Street> will be parsed by XML parser and expanded.
If <address> is defined as CDATA, then
<street>10 th main road</street><city>Chennai</city>
will be treated as text. The tags will <street> and <city> will not be identified by the XML parser.
Hope this clarifies.
More info can be found at
http://www.w3schools.com/dtd/dtd_building.asp
Thanks
Senthil
Posted: Mon Nov 07, 2005 4:46 pm
by aartlett
Senthil,
People here may think I;m not a large advocate of the Datastage XML system, and they'd probably be right
. I like it for very little amounts of data, or from data coming in as a feed rather than a static source.
My preference in your situation would be a XSLT translator. This allows you to create your seq. files directly from the XML without datastage at all. Saxxon is one I have used and there are others out there for most platforms.
If you do need to use the D/S XML then the previous suggestions should get you going. The metadata handling is one of the reasons I really dislike the D/S XML.
<<end of transmission>>
Posted: Mon Nov 07, 2005 6:13 pm
by vmcburney
You are spot on, the XML Input and XML Output stages sit in the "Real Time" folder for a reason, they are better at handling small volumes. Good suggestion on the XSLT translator, I will have to check it out.
Posted: Mon Nov 07, 2005 8:34 pm
by aartlett
Vince,
Have a look at the XSLT's on the apache web site. I think they were supplied in part by IBM. The licence allows comercial use so long as no money is charged further on (it's either a GPL or the apache one, I can't remember).
The XML stages are great for a MQ feed, like you said, real time.
My last gig I changed 45 jobs running D/S XML that ran for 2.5 - 3 hours to java based XSLT (used some awk scripts and the DDL to create the XSLT files) to run 15 at a time for and end to end of 20 minutes out of 3 XML files. CPU ran about 95% (a wasted cpu cycle is a lost cpu cycle). This could have been reduced if I used the C++ version, but I couldn't get the admins to load the libraries I needed, while the Java I could fo it myself.