XML large files (to be or not to be)

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
alraaayeq
Participant
Posts: 35
Joined: Sun Apr 04, 2004 5:57 am
Location: Riyadh,Saudi Arabia

XML large files (to be or not to be)

Post by alraaayeq »

Hi all;

after trying to deal with large XML files 2 gig and plus. I realised that I am going to use Java/C++ instead of DS.
tell you why:

1- Memory limitations and leakage
2- very slow
3- or can't even process the file successfuly

yet, I could not find any confirmation form Ascential that large XML files is not in DS capabilities
but still I could not think of any tricks to pass over this situation.


so it will not be DS, it will be Java/c++.... Any comments!
:?: :?: :?:
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If you specifically mean processing large XML files with the XML (or Folder) stages in DataStage, then there is an acknowledged problem with processing 'large' files. And from what I understand, large means more like a couple of hundred megabytes or so... 2 gigs would be huge and well beyond the capabilities of the stages. :?

That being said, I don't think this limitation is documented. However, there has been a number of posts here and on ADN on the subject. From what I recall, there is even a semi-official Ascential answer over on ADN acknowledging the issue.

You might want to post this over there and see what comes of it.
-craig

"You can never have too many knives" -- Logan Nine Fingers
jzparad
Charter Member
Charter Member
Posts: 151
Joined: Thu Apr 01, 2004 9:37 pm

Post by jzparad »

Can anyone confirm that there is a 2G limitation on files read by DataStage server?

Is this only for the XML stage?

The initial post on this topic seemed to imply that it was possible but slow.
Jim Paradies
aartlett
Charter Member
Charter Member
Posts: 152
Joined: Fri Apr 23, 2004 6:44 pm
Location: Australia

Post by aartlett »

I've processed standard flat files > 2gb with out problem on a Solaris DS 7.0.1 system.

As for the XML: we hit the same problem and decided to go with saxon java XSL. We wrote some xsl scripts and bingo, instant extract to pipe delimited files (well a few minutes :) ). the extract was much faster than what data stage was doing and we HAMMER the system with 15 extracts against 3 XML files going at once.

Let me know if you need more details. We are running under JAVA 1.2, bt I want to upgrade to 1.4+ as it has better memory management (the main problem).

You can also go with the C or java verision of xerces from apache. The C is a little faster, but I found more robust.

Both of these software are free.

Now if only the osurce would stop sending through CNTL-V and CNTL-K in the CDATA fields (both illegal) I'd be much happier.


Andrew
jzparad
Charter Member
Charter Member
Posts: 151
Joined: Thu Apr 01, 2004 9:37 pm

Post by jzparad »

Thanks Andrew.

What I'm trying to do is read data from an Oracle database (9.2) and produce one XML file which will be measured in tens of gigabytes. I know that DataStage uses the Xalan XSLT processor for reading and transforming but I don't know anything about the way it creates XML documents. Have you tried to create very large files using DataStage XML pack 2. If so, any comments would be appreciated.
Jim Paradies
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

jzparad wrote:Thanks Andrew.

What I'm trying to do is read data from an Oracle database (9.2) and produce one XML file which will be measured in tens of gigabytes. I know that DataStage uses the Xalan XSLT processor for reading and transforming but I don't know anything about the way it creates XML documents. Have you tried to create very large files using DataStage XML pack 2. If so, any comments would be appreciated.
Jim, you're hijacking this thread away from the original poster. :shock: But on the note of your query, you're probably not using the right approach for your volume. You're probably dealing with hundreds of millions of rows of data, so choking that thru a Server job (can't tell your OS, release, etc because it's not your thread :cry: ) is probably not scalable. You're going to have to high-performance spool (ie NOT A SERVER JOB) the output and then convert to XML. For low volumes on a decent machine DS will be alright, but for high volumes you're going to need to get the volume data out of Oracle and that requires multiple output streams equivalent to what your system can handle (probably, an output stream per partition and not more than 2 streams per cpu). You're going to have to investigate scripted alternatives.

Of course if you're paid by the hour, and the customer doesn't mind waiting, get all you can. 8)
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
alraaayeq
Participant
Posts: 35
Joined: Sun Apr 04, 2004 5:57 am
Location: Riyadh,Saudi Arabia

Post by alraaayeq »

aartlett wrote: You can also go with the C or java verision of xerces from apache. The C is a little faster, but I found more robust.

Both of these software are free.

Hi Andrew,

I went to http://www.saxonica.com/ and I found that Saxon XSLT is not free product! am I going to the right page or the free software mentioned above was not including Saxon XSLT?

Another question please,
Can you give me rough estimating how many records/lines produced per seconds by these softwares? I 've hit the number 18000+ per second :wink:
jzparad
Charter Member
Charter Member
Posts: 151
Joined: Thu Apr 01, 2004 9:37 pm

Post by jzparad »

Jim, you're hijacking this thread away from the original poster.
Sorry about the "hijack" but thanks for your comments. Sounds like I'll be needing a plan B.
Jim Paradies
alraaayeq
Participant
Posts: 35
Joined: Sun Apr 04, 2004 5:57 am
Location: Riyadh,Saudi Arabia

Post by alraaayeq »

kcbland wrote:
Jim, you're hijacking this thread away from the original poster. :shock:
YES he did :lol: , but fortunately he asked the same question that I would like to have answer for.

Thanks kcbland for your comments.
alraaayeq
Participant
Posts: 35
Joined: Sun Apr 04, 2004 5:57 am
Location: Riyadh,Saudi Arabia

Post by alraaayeq »

jzparad wrote:Can anyone confirm that there is a 2G limitation on files read by DataStage server?

Is this only for the XML stage?

The initial post on this topic seemed to imply that it was possible but slow.
Personally, as an Ascential's "HAPPY" customer I've opened a new case "ticket" since 10th january and No single response coming to my mail before writing this message!!!!!!!!!!!!!!!!!!!!!!! :shock: :shock: :shock:

<------------------><-------------------->

PS: despite the sick and slow Ascential's web site, many information is not up-to-date !!!!
aartlett
Charter Member
Charter Member
Posts: 152
Joined: Fri Apr 23, 2004 6:44 pm
Location: Australia

Post by aartlett »

alraaayeq,
There's two saxon products, or were when I checked a few days ago. Saxon-A and Saxon-B. One is free, can't remember which :( but it works fine for the XML extration I'm doing. Don't know what speed it's running at though.

AA
Post Reply