Page 1 of 1

xml stage and pivot stage takes most of the job time

Posted: Sun Feb 11, 2007 7:07 am
by waklook
Hi Guys,

we implemented server job like this:

FolderStage -->transformerStage --> XML inputStage --> transformerStage --> PivotStage --> SequentialFiles + other transformerStages and SequentialFiles

the number of xml files around 500,000 per load,and the max file size <5MB ,

it takes around 24H to load all the xml files, 22H between FolderStage --> transformerStage --> XML inputStage --> transformerStage --> PivotStage
is there any way to reduce the loading time to less than 24H or to enhance the whole job performance?

Thanks and regards
waklook

Posted: Sun Feb 11, 2007 7:41 am
by chulett
Those are two notoriously slow stages, but 24 hours still sounds a wee bit long. However, that *is* a crapload of files to be processing all at once - we think we have a boatload when we get 1/10 that amount. :wink:

Without knowing anything about your server setup or hardware or anything, about all that pops into my head this early in the morning (other than throwing more hardware at it) is splitting the job in two. Have you tried breaking the flow, perhaps landing a file after the XML Input stage and then using a second job for the Pivot and whatnot that follows? Simple enough to try and could help isolate which of the two is the true bottleneck.

Otherwise you may want to have an SA monitor system performance for you during those loads. Since there doesn't seem to be any way to leverage a 'better' JDK for the XML processing (Ernie?) you need to look into other means to help it out. That might be more memory in the machine if it is swapping, that might be leveraging faster disks, or changing striping, journaling or cache options in the file partition you are working in. We did that some time ago on our H-PUX system and made some improvements in XML processing time. Not a magic bullet by any means, but it did help.

Not sure what tweakage would help the Pivot stage... perhaps just memory and CPU. :?

Posted: Sun Feb 11, 2007 12:04 pm
by eostic
Craig is right...those two stages indeed are slow. You "might" have some success by splitting up the XMLInput into multiple stages (take a "chunk" of xml by having a column that is an entire "node" instead of going all the way down to the /text() level), and then send that chunk to another XMLInput Stage for further processing..... and then work with inter-process row buffering to get the stages (assuming you are on Server) to run in their own processes....... but it's hard to tell for sure without lots of other details of your job and system. Those performance settings are not to be dealt with lightly --- they also change the behavior of your job in terms of stage processing order. With a perfectly straight path, you probably will be ok to experiement with it, but be careful and do a lot of testing.

Ernie

Posted: Sun Feb 11, 2007 4:25 pm
by chulett
Ernie, is it possible to use a different Java runtime with XML jobs? I'm curious how 'hard wired' in the bundled version is.

Thanks Craig and Ernie

Posted: Mon Feb 12, 2007 6:22 am
by waklook
Hi Guys

thanks alot Craig and Ernie,
this is little info about our system, the server is HP-UNIX 11.11 with 8 CPUs,the clients are WinXp,
i also have some correction on the flow:
22H between FolderStage --> transformerStage --> XML inputStage --> transformerStage --> SequentialFile,
by mean pivotStage not included,
and from pivotStage and all other transformations only takes less than 1H.

can it help using Container with Link_Partitioner&Link_Collector?, if so what is the best stages to include in the container?.

Ernie, i'm not that good with xml stages, can you explain little bit more,

thanks all of you guys