Reading XML Data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

rsiem
Premium Member
Premium Member
Posts: 27
Joined: Thu Aug 02, 2007 10:31 am

Reading XML Data

Post by rsiem »

Hi, I've been following these instructions on how to read in XML data:

http://it.toolbox.com/wiki/index.php/Lo ... rallel_job

When I do that the sequential file stage seems to be parsing the XML file into 513 character blocks. When I hit view data on the sequential file stage it shows this split, and this 513 character split also appears when I run the sequential file stage to a peek stage.

I can't figure out what settings to change so that it'll read the whole thing in.

I'm reading in the terror watch list XML - the start of it looks like this:

<CONSOLIDATED_LIST dateGenerated="2009-12-03T00:00:00">
<INDIVIDUALS>
<INDIVIDUAL>
<DATAID>111345</DATAID>
<VERSIONNUM>11</VERSIONNUM>
<FIRST_NAME>MOUSTAFA</FIRST_NAME>
<SECOND_NAME>ABBES</SECOND_NAME>
<UN_LIST_TYPE>Al-Qaida</UN_LIST_TYPE>
<REFERENCE_NUMBER>QI.A.163.04.</REFERENCE_NUMBER>
<LISTED_ON>2004-03-17T00:00:00</LISTED_ON>
<COMMENTS1>Sentenced to three years and six months of imprisonment by the Tribunal of Naples on 19 May 2005. Released on 30 Jan. 2006 due to an order suspending the sentence. Returned to Algeria where he resides as at Nov. 2008. Review pursuant to Security Council resolution 1822 (2008) was concluded on 28 Sep. 2009.</COMMENTS1>
<NATIONALITY>
<VALUE>Algerian</VALUE>
</NATIONALITY>
<LIST_TYPE>
<VALUE>UN List</VALUE>
</LIST_TYPE>
<LAST_DAY_UPDATED>
<VALUE>2004-11-26T00:00:00</VALUE>
<VALUE>2007-12-21T00:00:00</VALUE>
<VALUE>2008-12-02T00:00:00</VALUE>
</LAST_DAY_UPDATED>
<INDIVIDUAL_ALIAS>
<QUALITY>Good</QUALITY>
<ALIAS_NAME>Mostafa Abbes</ALIAS_NAME>
</INDIVIDUAL_ALIAS>
<INDIVIDUAL_ADDRESS>
<COUNTRY>Algeria</COUNTRY>
<NOTE>(as at Nov. 2008)</NOTE>
</INDIVIDUAL_ADDRESS>
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I personally wouldn't use the Sequential File stage as a source for this but rather what Ernie talks about in his blog post here.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rsiem
Premium Member
Premium Member
Posts: 27
Joined: Thu Aug 02, 2007 10:31 am

Post by rsiem »

Same thing - it partitions ever 513 characters...
rsiem
Premium Member
Premium Member
Posts: 27
Joined: Thu Aug 02, 2007 10:31 am

Post by rsiem »

Just to add, I was looking at my transport environment variables and nothing points to this 513 character splitting.

I tried setting $APT_AUTO_TRANSFER_BLOCK_SIZE to true, and I tried increasing the $APT_FILE_IMPORT_BUFFER_SIZE to 4096 and that didn't make a difference either.

I've also raised up the $APT_DEFAULT_TRANSFER_BLOCK_SIZE by quite a bit and the splitting happens at the same 513 character mark.

The xml header is UTF-8 for the encoding so I have NLS set to UTF-8 for the job.

This file is only 900k - it should suck just fine into memory.
rsiem
Premium Member
Premium Member
Posts: 27
Joined: Thu Aug 02, 2007 10:31 am

Post by rsiem »

Lastly, here's a screenshot of my format settings:

http://imgur.com/rNuYo.png
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

If you are using "ls" in the External Source Stage, it is unlikely to split the document into individual strings. ....meaning that the problem, if you are truly sending in filenames, is somewhere else. Test it separately by just having a transformer in front of the XML stage and send the filename in a dumb string column.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
rsiem
Premium Member
Premium Member
Posts: 27
Joined: Thu Aug 02, 2007 10:31 am

Post by rsiem »

Sorry - I was using cat in my external source stage - I thought external source was the output of a program no?

Anyway, is there any settings I should be looking at for this 513 character splitting that's occurring?

This product is an extreme exercise in frustration; I would have been done by now if I parsed the xml in ruby or python or something...

I'm sure there's a logical reason for this 513 character split , I just can't figure it out :-(
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

'ls' is as much of a 'program' as cat is. The External Source stage just give you a mechanism to bring in data from an 'external source' which (as I understand it) can be anything that delivers information to standard out. And if you were using 'cat' there, to me that implies you weren't following Ernie's advice in his blog posting and that you are sending the XML contents to the XML Input stage.

Before you get too much more frustrated, make sure that all you are providing from the External Source stage are the full pathnames of the XML files to process and that the XML Input stage is set to 'URL/Filepath'. There's nothing in the stage itself that should be setting any kind of '513 character split' so I would imagine it's an artifact of the XML (is it 'formatted'?) compounded by cat'ing it to the stage. Hopefully the switch to URL will straighten that out for you.

Or have you already done that and you're still seeing this "split"? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

...exactly. Stop sending the "contents" and send just the fully qualified "name" of the document into the stage.....

....the cat is doing something strange, or another intermediate stage is doing something strange, or it's a combination of cat/another stage/and some strange characters such as LF or CRLF that are at exactly position 513.

...the problem ought to immediately go away once you use "ls" or other mechanism to send the "name" of the document (and don't forget to change the check box on the input link to URL as Craig notes).

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
pniland
Premium Member
Premium Member
Posts: 7
Joined: Tue Jan 26, 2010 9:00 am

Post by pniland »

Hi, I am new to datastage and I too have been suffering from the 512/3 character problem as above since following the tutorial at http://it.toolbox.com/wiki/index.php/Lo ... rallel_job

Following your advice I have replaced the sequential file stage with an External Source stage and when I right click on the stage and select view data the package just hangs.

I have used the following settings. Please can you point out where I have gone wrong.

Properties Tab
Source
Source Method = Specific Program(s)
Source Program = \\mymachine\subfolder\test.xml

Options
Keep File Partitions = False (default - Not sure what this means )
Reject Mode = Continue (default - Not sure what this means )

Format Tab
Record Level
Record type = Implicit (Taken from above tutorial)

Field Defaults
Delimiter = None (Taken from above tutorial)

Columns Tab
Column name = xml
SQL type = LongVarChar
Length = 99999

Sample XML
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<storeInformation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<s>
<BRANCH_NO>3061</BRANCH_NO>
<BRANCH_NAME>LEAVESDEN GREEN EXPRESS</BRANCH_NAME>
<ADDRESS_1>11 KATHERINE PLACE</ADDRESS_1>
<ADDRESS_2>COLLEGE ROAD</ADDRESS_2>
<TOWN_CITY>ABERDARE</TOWN_CITY>
<COUNTY>RHONDDA CYNON TAFF</COUNTY>
<POSTCODE>CF44 8DL</POSTCODE>
<COUNTRY>Wales</COUNTRY>
<TELEPHONE>0845 6778997</TELEPHONE>
<STORE_TYPE>Superstore</STORE_TYPE>
</s>
</storeInformation>

Thanks for your help in advance

Paul
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

View data has very little use or meaning in this context. Just send your filename via "ls" (unix list command) over to your XMLInput Stage and start parsing.....view things after they have been extracted.

You can also do some initial tests by sending over a hard coded filename from the RowGenerator or from another file or from a Transformer.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
pniland
Premium Member
Premium Member
Posts: 7
Joined: Tue Jan 26, 2010 9:00 am

Post by pniland »

Dear Ernie,

Thanks for your reply.

I was not entirely sure what you meant in your post being new to datastage.

I have spoken to IBM and they have shared me the following solution which I think may be exactly what you were explaining.

Step1: Create a job parameter of type pathname with the location of your xml file in the default value field.

Step2: Create a row genreator - In the output create a column called trigger with a type of char that is not nullable

Step3: Create a Transformer that that has the name of the paramemter defind in step 1 as the Derivation

Step4: create a xml_input stage

Stage Properties: - General tab
I have Validate input xml and enable grammer caching turned on
Stage Properties: - Transformation Settings tab
Include namesace declation is checked
Input Properties: - XML source tab
Column is the name of the column from the transformer task
URL/File Path is checked

Input Properties: - Columns tab
create column "path" not a key of type VarChar tha is not nullable
output Properties: - Transformation settings tab
Inherit stage properties is checked
output Properties: - columns tab


name - BRANCH_NO
derivation - test.BRANCH_NO
key - yes (other columns set to no)
SQL Type VarChar
Extended - Unicode
Length - 10
Nullable = no
Display = 10
Description /storeInformation/s/BRANCH_NO/text()

The rest should be self explanitory as per the tutorial

Cheers,

Paul
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Yes...that's another way to send in filenames to the xmlInputStage....
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

It looks like you need to run the 'dos2unix' command to remove the 'control M' characters

Regards
Sreeni
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Strange characters from "elsewhere" will give grief to XML. Fortuntaely, you generatlly don't have to worry about carriage returns and line feeds and blanks.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Post Reply