XML2 Reader and ">" Character

vmcburney · Post by **vmcburney** » Mon Jul 05, 2004 10:36 pm

It is a confusing definition, I read it as either being replaced by &gt or escaped by &gt and that a on its own > is invalid.

The only thing I can suggest is that you either process this file as a sequential file, parsing the XML into text strings, or pre-process it first where you find and replace embedded > characters with &gt. You could use a routine that reads the sequential file and replaces all instances of > that occur between <answer> </answer> brackets.

Or better yet change the application that generates the XML.

DaveBaumann · Post by **DaveBaumann** » Wed Jul 07, 2004 3:52 am

Unfortunatly this is a data file supplied to us, so we are at the mercy of the quality of data sent to us here. I'd thought about a preprocess routine, and we may eventually have to do this is we don't get an update from the supplier, however I was hoping there was some options in the XML2 reader stage that may be able to circumvent this, however I guess not.

Thanks,
Dave

wdudek · Post by **wdudek** » Thu Jul 08, 2004 11:38 am

The > character is definately not valid by itself as data in an xml file, open the xml file containing this in internet explorer and it will complain that it is not formed properly.

ray.wurlod · Post by **ray.wurlod** » Thu Jul 08, 2004 5:01 pm

... and, on that basis, you go back to the providers of the XML file and complain bitterly to them that what they're sending to you is not well-formed XML and therefore impossible to process, and require that they do send you well-formed XML.

It why we have standards!

DaveBaumann · Post by **DaveBaumann** » Tue Jul 13, 2004 11:32 am

wdudek wrote:The > character is definately not valid by itself as data in an xml file, open the xml file containing this in internet explorer and it will complain that it is not formed properly.

The issue being that some newer parsers don't complain about ">" being inside a tag as they can quite happily identify that the first ">" after "<" must have been the close of a tag and any others are just part of the data. "<" must always be written as ">" within data since any reader would obviously interpret that as an open tag.

In fact, if you try it in Internet Explorer you'll see that this is one such reader that will quite happily handle ">" within the data (well, with IE6 as I'm using here it seems to).

ray.wurlod wrote:... and, on that basis, you go back to the providers of the XML file and complain bitterly to them that what they're sending to you is not well-formed XML and therefore impossible to process, and require that they do send you well-formed XML.

It why we have standards!

We have. And n their case they are just using other WML writers that output in this fashion.

The curious thing is that if I got back to the XML 1 pack, the reader appears to handle ">" within data fine!!

DaveBaumann · Post by **DaveBaumann** » Fri Jul 16, 2004 7:59 am

In case anyone is interested and hits a similar issue, here's a routine I've written to circumvent the issue.

It takes in the the whole of the XML file data as an input parameter string (P_InputData), hence can be placed in a transformer following a folderstage, and returns a string with any ">" characters within element data as ">".

Code: Select all

L_InputDataLen = Len(P_InputData)
L_NoExtraGts = Count(P_InputData, ">") - Count(P_InputData, "<")
L_OutputData = Space(L_InputDataLen + (L_NoExtraGts * 3))

L_OutChar = 1
L_TagOpen = @FALSE
For i = 1 to L_InputDataLen

  L_CurrChar = P_InputData[i,1]

  If L_CurrChar = "<" Then
    L_TagOpen = @TRUE
  End
  
  If L_CurrChar = ">" AND L_TagOpen = @FALSE Then
    L_OutputData[L_OutChar,4] = ">"
    L_OutChar = L_OutChar + 4
  End Else
    L_OutputData[L_OutChar,1] = L_CurrChar
    L_OutChar = L_OutChar + 1
  End

  If L_CurrChar = ">" AND L_TagOpen = @TRUE Then
    L_TagOpen = @FALSE
  End

Next i

Ans = L_OutputData

(If anyone spots any bugs or potential optimisations, please point them out. Note that this doesn't cater for ">" within attribute data, although I would hope that wouldn't occur in most XML specifications)

kduke · Post by **kduke** » Fri Jul 16, 2004 11:10 am

Why not

Code: Select all

NewLine = change("<", "&lt", OldLine)
NewLine = change(">", "&gt", NewLine)

Some people prefer ereplace. Change is the same function.

tonystark622 · Post by **tonystark622** » Fri Jul 16, 2004 12:06 pm

Kim,

Will that leave the '<' and '>' in the tags alone? Or change them too? It looks to me like your method will replace the tags.

Tony

kduke · Post by **kduke** » Fri Jul 16, 2004 1:53 pm

I do this before I turn it into XML or HTML. I have BASIC code which generates HTML. How do you separate valid tags from code generated tags.

If you look at my KgdGenHtmlRoutines job posted on my web site and ADN then you can see how it was implemented it.

tonystark622 · Post by **tonystark622** » Fri Jul 16, 2004 2:25 pm

Ah. You are, as always, a gentleman and a scholar! :D

Thanks,
Tony

DaveBaumann · Post by **DaveBaumann** » Sun Jul 18, 2004 1:12 pm

kduke wrote:I do this before I turn it into XML or HTML.

In this instance we are using it because the XML2 reader can't cope with the XML data thats being sent to us.

kduke · Post by **kduke** » Mon Jul 26, 2004 3:37 pm

Tony

I looked and I did not post this code. So here it is.

Code: Select all

            done = @false
            SpaceCnt = 0
            for x=1 to len(tmpLine) until done
               check = tmpLine[x,1]
               if check=" " then
                  SpaceCnt += 1
               end else
                  done = @true
               end
            next x
            if SpaceCnt > 0 then
               tmpLime = trim(tmpLine,' ','L')
               for x=1 to SpaceCnt
                  tmpLine = "&nbsp" : tmpLine
               next x
            end
            tmpLine = change(tmpLine, "<", "&lt")
            tmpLine = change(tmpLine, ">", "&gt")
            tmpLine = change(tmpLine, char(13):char(10), "<br>")
            if field(trim(tmpLine), '.', 2)[1,3] = 'htm' then
               tmpLine = '<A href="':trim(tmpLine):'">':field(trim(tmpLine), '.', 1):'</A>'
            end
            if tmpLine="" then
               tmp := tmpLine:"&nbsp"
            end else
               tmp := tmpLine
            end

Dave, I was thinking only about output. You could do this before it turned it into XML or HTML but if it came in that way then you have broken XML or HTML because of the embedded tags.

This code is when I generate HTML documentation for routines. The jobs and routines are posted on my web site and ADN. It will loop through the code and wrap html around it. I duplicated what Ascential had done in the job documentation. They did not have anything for routines. I copied the logic out of genroutinedocs or something like that. It was posted on ADN as well. The ASCL post would only do one job. I wanted to document all jobs or just one category as well. I wanted a simple index with links to all the jobs or routines. So it creates a directory and generates html for all the jobs and then one html page called all_index.html with links to all the other pages. It also does the same for routines in a different directory. I think I made the directory KimD/Jobs/20040726 for today. So you could get a snapshot of how the jobs looked on a given day. The ASCL code is super fast. It will document hundreds of jobs in seconds. Same for routines.

If it had a where used section then you would not need DwNav or MetaStage. Not really but it is very nice html. It really does need a table lookup section. DwNav has generated similar documentation for a while now. It did not look as pretty until last week. Now it looks the same. I do not want to make money on DwNav. I just need it to fill a hole when I think DataStage or MetaStage lacks something that I want or need. So I wanted the html to all look the same so I could use both together. Anyway DwNav has an index by table name. So all table name or hash file names have a link to the job that reads or write to it. That is my idea of where used. The table names are sorted first by what Reporting Assistant calls OLETYPE. This is a stage type like CHashInput or CSeqInput. They are ugly names but I think most of us can figure it out. I thought about building a lookup to change this into something more readable. When I get time I guess. Next version.

On my web site there is a link to the html documentation generated by these jobs. There is a link right above the zip file link. So you can get an idea if you like the way it looks.

Since I downloaded the DataStageBackup.bat file, I like the YYYYMMDD folder concept. I am using it in this documentation stuff. The idea is this. You promote once a month or twice or whatever. You run the backup script, the DSaveAsBmp.bat and the documentation jobs and you take a snapshot of things before and after you change them. Sort of an ETL audit trail.

I am not sure if there is much more I could do to tie all these pieces together except to tie in Version Control. I think if you named your batches YYYYMMDD then that might help. I think we need Craig or Byron, one of the VC experts to figure that out. Trying to build some kind of best practices automated software lifecycle.

What do you think? Waste of time? Overkill?