XML2 Reader and ">" Character

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

It is a confusing definition, I read it as either being replaced by &gt or escaped by &gt and that a on its own > is invalid.

The only thing I can suggest is that you either process this file as a sequential file, parsing the XML into text strings, or pre-process it first where you find and replace embedded > characters with &gt. You could use a routine that reads the sequential file and replaces all instances of > that occur between <answer> </answer> brackets.

Or better yet change the application that generates the XML.
DaveBaumann
Participant
Posts: 7
Joined: Thu Oct 23, 2003 10:00 am

Post by DaveBaumann »

Unfortunatly this is a data file supplied to us, so we are at the mercy of the quality of data sent to us here. I'd thought about a preprocess routine, and we may eventually have to do this is we don't get an update from the supplier, however I was hoping there was some options in the XML2 reader stage that may be able to circumvent this, however I guess not. :(

Thanks,
Dave
wdudek
Participant
Posts: 66
Joined: Mon Dec 08, 2003 10:44 am

Post by wdudek »

The > character is definately not valid by itself as data in an xml file, open the xml file containing this in internet explorer and it will complain that it is not formed properly.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

... and, on that basis, you go back to the providers of the XML file and complain bitterly to them that what they're sending to you is not well-formed XML and therefore impossible to process, and require that they do send you well-formed XML.

It why we have standards!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DaveBaumann
Participant
Posts: 7
Joined: Thu Oct 23, 2003 10:00 am

Post by DaveBaumann »

wdudek wrote:The > character is definately not valid by itself as data in an xml file, open the xml file containing this in internet explorer and it will complain that it is not formed properly.
The issue being that some newer parsers don't complain about ">" being inside a tag as they can quite happily identify that the first ">" after "<" must have been the close of a tag and any others are just part of the data. "<" must always be written as ">" within data since any reader would obviously interpret that as an open tag.

In fact, if you try it in Internet Explorer you'll see that this is one such reader that will quite happily handle ">" within the data (well, with IE6 as I'm using here it seems to).
ray.wurlod wrote:... and, on that basis, you go back to the providers of the XML file and complain bitterly to them that what they're sending to you is not well-formed XML and therefore impossible to process, and require that they do send you well-formed XML.

It why we have standards!
We have. And n their case they are just using other WML writers that output in this fashion.

The curious thing is that if I got back to the XML 1 pack, the reader appears to handle ">" within data fine!!
DaveBaumann
Participant
Posts: 7
Joined: Thu Oct 23, 2003 10:00 am

Post by DaveBaumann »

In case anyone is interested and hits a similar issue, here's a routine I've written to circumvent the issue.

It takes in the the whole of the XML file data as an input parameter string (P_InputData), hence can be placed in a transformer following a folderstage, and returns a string with any ">" characters within element data as ">".

Code: Select all

L_InputDataLen = Len(P_InputData)
L_NoExtraGts = Count(P_InputData, ">") - Count(P_InputData, "<")
L_OutputData = Space(L_InputDataLen + (L_NoExtraGts * 3))

L_OutChar = 1
L_TagOpen = @FALSE
For i = 1 to L_InputDataLen

  L_CurrChar = P_InputData[i,1]

  If L_CurrChar = "<" Then
    L_TagOpen = @TRUE
  End
  
  If L_CurrChar = ">" AND L_TagOpen = @FALSE Then
    L_OutputData[L_OutChar,4] = ">"
    L_OutChar = L_OutChar + 4
  End Else
    L_OutputData[L_OutChar,1] = L_CurrChar
    L_OutChar = L_OutChar + 1
  End

  If L_CurrChar = ">" AND L_TagOpen = @TRUE Then
    L_TagOpen = @FALSE
  End

Next i

Ans = L_OutputData
(If anyone spots any bugs or potential optimisations, please point them out. Note that this doesn't cater for ">" within attribute data, although I would hope that wouldn't occur in most XML specifications)
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Why not

Code: Select all

NewLine = change("<", "&lt", OldLine)
NewLine = change(">", "&gt", NewLine)

Some people prefer ereplace. Change is the same function.
Mamu Kim
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

Kim,

Will that leave the '<' and '>' in the tags alone? Or change them too? It looks to me like your method will replace the tags.

Tony
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I do this before I turn it into XML or HTML. I have BASIC code which generates HTML. How do you separate valid tags from code generated tags.

If you look at my KgdGenHtmlRoutines job posted on my web site and ADN then you can see how it was implemented it.
Last edited by kduke on Fri Jul 16, 2004 4:40 pm, edited 1 time in total.
Mamu Kim
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

Ah. You are, as always, a gentleman and a scholar! :D

Thanks,
Tony
DaveBaumann
Participant
Posts: 7
Joined: Thu Oct 23, 2003 10:00 am

Post by DaveBaumann »

kduke wrote:I do this before I turn it into XML or HTML.
In this instance we are using it because the XML2 reader can't cope with the XML data thats being sent to us.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Tony

I looked and I did not post this code. So here it is.

Code: Select all

            done = @false
            SpaceCnt = 0
            for x=1 to len(tmpLine) until done
               check = tmpLine[x,1]
               if check=" " then
                  SpaceCnt += 1
               end else
                  done = @true
               end
            next x
            if SpaceCnt > 0 then
               tmpLime = trim(tmpLine,' ','L')
               for x=1 to SpaceCnt
                  tmpLine = "&nbsp" : tmpLine
               next x
            end
            tmpLine = change(tmpLine, "<", "&lt")
            tmpLine = change(tmpLine, ">", "&gt")
            tmpLine = change(tmpLine, char(13):char(10), "<br>")
            if field(trim(tmpLine), '.', 2)[1,3] = 'htm' then
               tmpLine = '<A href="':trim(tmpLine):'">':field(trim(tmpLine), '.', 1):'</A>'
            end
            if tmpLine="" then
               tmp := tmpLine:"&nbsp"
            end else
               tmp := tmpLine
            end

Dave, I was thinking only about output. You could do this before it turned it into XML or HTML but if it came in that way then you have broken XML or HTML because of the embedded tags.

This code is when I generate HTML documentation for routines. The jobs and routines are posted on my web site and ADN. It will loop through the code and wrap html around it. I duplicated what Ascential had done in the job documentation. They did not have anything for routines. I copied the logic out of genroutinedocs or something like that. It was posted on ADN as well. The ASCL post would only do one job. I wanted to document all jobs or just one category as well. I wanted a simple index with links to all the jobs or routines. So it creates a directory and generates html for all the jobs and then one html page called all_index.html with links to all the other pages. It also does the same for routines in a different directory. I think I made the directory KimD/Jobs/20040726 for today. So you could get a snapshot of how the jobs looked on a given day. The ASCL code is super fast. It will document hundreds of jobs in seconds. Same for routines.

If it had a where used section then you would not need DwNav or MetaStage. Not really but it is very nice html. It really does need a table lookup section. DwNav has generated similar documentation for a while now. It did not look as pretty until last week. Now it looks the same. I do not want to make money on DwNav. I just need it to fill a hole when I think DataStage or MetaStage lacks something that I want or need. So I wanted the html to all look the same so I could use both together. Anyway DwNav has an index by table name. So all table name or hash file names have a link to the job that reads or write to it. That is my idea of where used. The table names are sorted first by what Reporting Assistant calls OLETYPE. This is a stage type like CHashInput or CSeqInput. They are ugly names but I think most of us can figure it out. I thought about building a lookup to change this into something more readable. When I get time I guess. Next version.

On my web site there is a link to the html documentation generated by these jobs. There is a link right above the zip file link. So you can get an idea if you like the way it looks.

Since I downloaded the DataStageBackup.bat file, I like the YYYYMMDD folder concept. I am using it in this documentation stuff. The idea is this. You promote once a month or twice or whatever. You run the backup script, the DSaveAsBmp.bat and the documentation jobs and you take a snapshot of things before and after you change them. Sort of an ETL audit trail.

I am not sure if there is much more I could do to tie all these pieces together except to tie in Version Control. I think if you named your batches YYYYMMDD then that might help. I think we need Craig or Byron, one of the VC experts to figure that out. Trying to build some kind of best practices automated software lifecycle.

What do you think? Waste of time? Overkill?
Mamu Kim
Post Reply