XML performance: increased volume running very long

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
FranklinE
Premium Member
Premium Member
Posts: 739
Joined: Tue Nov 25, 2008 2:19 pm
Location: Malvern, PA

XML performance: increased volume running very long

Post by FranklinE »

I'll try to be as brief as possible, so do expect me to respond to requests for more details.

Both jobs have the same general design: XML input, transformers to create mainframe-format rows including headers and trailers, a final funnel and FTP to mainframe.

The input formats are different, but follow the same pattern: single tags with file information (timestamp, record count, etc.) and transaction tag groups of data representing indiviual items that create the detail records on output. Repeating tags are reliably unique at both levels (file, transaction).

One job continues to process very fast, with increases in processing time of seconds. The other job jumped from seconds to minutes. The older data volume was in the few dozens (up to 200 or so) of transactions, and the recent data volume increase was ten-fold. The latest record count was just under 2,600.

I just can't find a difference between the still-fast job and the now-slow job. Ironically, the still-fast job has more branchings (two output files each with a header and trailer) than the now-slow job (one output, header only). They process essentially the same data (a before-after sort of thing), though they do have different XML layouts. I don't mind saying that I'm very frustrated that one job just keeps processing as before. :?

I've tried going to single-node and forcing the job to be sequential on every link, and I've spent a few hours with internal support. None of us have found the cause yet. The best advice I can think of is to be told where to look for possible causes. I'm experienced enough to move forward with suggestions.

Basic design:

Code: Select all

External source (filename) --> XMLInput stage (split outputs based on repeating tags) --> transformers to Cobol layouts (some editing for padding and length) --> Funnel (link ordering for header-details-trailer) --> FTP Enterprise
The problem surfaced when FTP was timing out because of the too-slow throughput, and the process was recently changed to create a file with FTP in a separate job.
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson

Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

For added detail, what is the physical size of the documents that are read by each Job?....the numbers of records it generates is important too, as you have noted, but it would be interesting to know the raw size of each document.

The xmlInput Stage is going to load that whole thing up into memory...one interesting comparison would be to compare dramatically edited versions of the two Jobs.....where they still have their External Source Stage and xmlInput, but their output links just go to dead-end-Copy Stages (with no output links).

It's still going to do parsing, but would eliminate most other variables.

Agree..it sounds like a puzzling one.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
FranklinE
Premium Member
Premium Member
Posts: 739
Joined: Tue Nov 25, 2008 2:19 pm
Location: Malvern, PA

Post by FranklinE »

Thanks for weighing in, Ernie.

File sizes: good job (no change in performance), bad job (fourth pair is where the volume increase performance change occurred).

33K 76K
65K 150K
45K 101K
451K 1000K
690K 1560K

For giggles, I checked the XMLInput stage's Input/Advanced tab and found:
Buffer mode -- default, the following are greyed-out
Max Mem buffer size -- 3145728
Buffer free run percent -- 50
Queue upper bound size -- 0
Disk write increment -- 1048576

That looks like the Designer default settings, and the Output/Advanced tabs are all set the same. After posting this, I'm going to increase the Input settings just to see what happens.
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson

Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Well..it increased by 50%, but the document sizes overall are absolutely tiny........

so...what else to look for? hard to say...I'd check for things like:

a) are you doing lots of your own detailed custom xpath or xslt work inside the Stage, or is it just /../../../../text() types of strings in your descriptions on the output link.

b) are the documents significantly different in structure "type" or style? ...in other words, is one "all attributes for detail values" (sometimes happens) ....or "all elements"...but mostly...are there odd characteristic differences between them?

c) are they all coming from the same place? (I think you said External source stage as a list of files...are they all on the same disk?)

Still, the documents are small...like the other author on one of these threads, I'd probably fire up a Server Job as a quick test just to see what it does by comparison, using a Folder Stage and the exact same Output Link for your xmlInput (just be sure to change the check box to "xml content" on the input side.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
FranklinE
Premium Member
Premium Member
Posts: 739
Joined: Tue Nov 25, 2008 2:19 pm
Location: Malvern, PA

Post by FranklinE »

Ernie, thanks for the input. I found the design problem finally as compared to the other job that continued to work properly.

Briefly put, the XML input has the file-level tag data (what I view as header and/or trailer data) and the individual transactions within the file. There is just the one instance of the file-level data (repeating tag only occurs once) and the multiple instances of the transaction repeating tag.

In the "good" job, I have multiple output links from the XMLInput stage, one for each repeating tag. In the "bad" job I forced the input to handle both file-level and transaction-level into one output link, attempting to split them in the transformer stage that came next. The analogy I came up with (while finally not tearing my hair out any more) was doing a full-table scan to find each subsequent transaction record. It bogged down at a fairly low threshold, something my internal support examined and confirmed. Our tests simply never got close to the threshold.

I learned a hard lesson with XML processing. I expect to need to learn quite a bit more. :oops:
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson

Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Ok....cool.......not sure exactly how you combined the "chunks" into one link, but let us know if you need thoughts on resolving it.

Certainly, with multiple output links, the document only gets loaded once, and then the xpath on each link will shred the rows for each particular node, as desired.......perhaps avoiding transformations that were looking thru tags manually?

Anyway, glad you found it!

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
FranklinE
Premium Member
Premium Member
Posts: 739
Joined: Tue Nov 25, 2008 2:19 pm
Location: Malvern, PA

Post by FranklinE »

The single output link showed the transaction repeating tag but also named the file-tag data. I'm not sure what the internal processing looked like, exactly, but the result was propagating the single instance of the file-tag data (company name, record count) to every instance of the transactions.

One way to look at it, I guess, was I set the table definition to expect those two file items to be inside the transaction tag data.

For security reasons, I can't show you the actual code. I also don't have time to redact it for posting. Sorry.
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson

Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
Post Reply