Dealing with XHTML Files

Aquilis · Post by **Aquilis** » Tue Jun 08, 2010 6:58 am

Hello All,
Has anybody worked with XHTML data files. I was working with hierarchical XML before but the client has come out with XHTML stuff. i never worked on these before.
So can anybody elaborate what would be the possible issues with XHTML comapred to XML.
1. I can see that the size of the XHTML files are very bulky compared to simple XML files since it's combination of XML & HTML.

I developed a simple job to explore XHTML but i am ending up with following error.

Code: Select all

XML input document parsing failed. Reason: Xalan fatal error (publicId: , systemId: , line: 0, column: 0): An exception occurred! Type:RuntimeException, Message:The primary document entity could not be opened.

I have tried most of the stuffs but couldn't able to make it. can any body has any suggestions around it ?

chulett · Post by **chulett** » Tue Jun 08, 2010 7:06 am

Can't imagine it is any kind of supported. I wonder if there are any kind of XHTML -> XML converters out there?

eostic · Post by **eostic** » Tue Jun 08, 2010 10:47 am

From everything I can tell by looking at samples on the web, it's nothing but pure XML.....the tags are very html-like, meaning that they don't convey "metadata," they convey formatting......but you should very definitely be able to pull out the bits and pieces that you want.

The real problem is that the "repeating" elements are format tags.....no structural consistency, and their ordering and repetition is simply based on how the author wanted things to "look" not how they are related to each other.....so it may be a messy Job with lots of output links, for each of the deeper repeating format elements that you might like to pull.

It's pretty ugly though....I can't imagine why anyone would use this over a stylesheet with xml for their data.

Your error above is a normal xml error. Unlikely to have anything to do with the fact that you will be reading an XHTML document vs a regular XML document. How are you picking up the document from disk?

Ernie

Aquilis · Post by **Aquilis** » Wed Jun 09, 2010 8:22 am

Ernie,
Thanks for sharing the information.

You were right, I am using reggular XML Table definitions approach to import. Was it wrong?

eostic · Post by **eostic** » Thu Jun 10, 2010 6:10 am

Absolutely....but as I said, be careful...you are going to end up with all kinds of table definitions, depending on the creativity of the author of your xhtml document. Every repeating list or other item within the "body" might need its own link and its own repeating element. Could get ugly, but is certainly do-able. The problem is that xhtml isn't going to care if you have "employees" at the top of the page and "automobiles" listed at the bottom....if it uses the same xhtml tags to treat this as "bold lists" (for example), then they will simple be an unrelated pair of repeating groups with formatting. Pretty useless. I'd see if the data exists also in non-formatted regular xml.

Ernie