XML Assembly Performance-Unable to replace HJoin To Regroup

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
shank
Participant
Posts: 18
Joined: Wed Mar 25, 2009 3:11 am

XML Assembly Performance-Unable to replace HJoin To Regroup

Post by shank »

Hello DS Masters,

The Requirement is to create XML File with multiple hierarchies.
Everything worked perfectly till we run performance testing.
We are doing lot of H-Join Steps inside the Assembly and thats causing the issue now.
I have tried all possible options to use "In Memory", Increasing the threads and all but no luck.
I came across an IBM documentation which suggest Datastage Join outside the XML Assembly and Regroup step inside the XML Assembly.

When I try that, I am not able to map the required input columns to my output Layout as they are in different hierarchies.

I will explain in simple with one parent and one child hierarchy.

Dataset 1 (DS1) - EMPLOYEE Dataset
EMPLOYEE_ID, EMPLOYEE_NM, EMPLOYEE_NUMBER

Dataset 2 (DS2) - PHONE_NUMBER Dataset (EMPLOYEE_ID has duplicates)
EMPLOYEE_ID, EMPLOYEE_PH_NUMBER

Previous design :

H-Join:
Parent - DS1
Child - DS2

In the Output before the XML Composer step - I got DS2 H-join Link inside the DS1 Parent Link and was able to map correctly.

Current Design :

Regroup : DS2
Scope - top (Only thing I could select)
Parent List - EMPLOYEE_ID
Child List - EMPLOYEE_PH_NUMBER
Key - EMPLOYEE_ID

For this, In the Output before the XML Composer Step, I got DS1 and Regroup:result for DS2 in two different hierarchies.


What needs to be done to make DS2 come inside DS1 using Regroup Step ?
Please help.
Regards,
Shank
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Building xml is difficult --- using any technology. ...and it doesn't scream in performance --- using any technology. Using the xml Stage and its Assembly makes writing xml a bit more manageable than requiring the java skills to do it. It may be enough to be pleased that you are building the required xml successfully.

How much faster do you need it to go? How large is your xml that you are writing? There may need to be other things to consider.

That being said, the join inside the assembly allows you to bring together multiple independent lists....and those joins satisfy the requirements of the assembly editor. I can only imagine a few joins that would work better outside of the Stage, or even be possible....most times you want and need to do the join inside.

Maybe a reverse pivot, then join and then enter the Stage? That might work, but may need another pivot inside the Assembly to rebuild on of your lists. Seems a whole lot more complicated and may even be slower.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Reading this again, it looks like you are using the hjoin to join a regular parent to a repeating child? 1:many? Yes, for that just get the rows perfect in a relational set prior to entering the stage ....so that you have "n" employee phone rows, each with all of the employee "parent" detail repeated.

Then use a regroup for that input link.

Save the hjoin for when you are joining independent groups such as a list of employee phone numbers to a list of employee jobs history and then to a list of employee dependents, etc.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
shank
Participant
Posts: 18
Joined: Wed Mar 25, 2009 3:11 am

Post by shank »

Hi Ernie, Thanks for responding.

I have tried having only one input link to XML Assembly.

DS1
EMPLOYEE_ID, EMPLOYEE_NM, EMPLOYEE_NUMBER,EMPLOYEE_PH_NUMBER

then

Regroup : DS1
Scope : top (Again .. this is only option I am left to chose as Scope)

Parent List :
EMPLOYEE_ID, EMPLOYEE_NM, EMPLOYEE_NUMBER

Child List :
EMPLOYEE_PH_NUMBER

Keys : EMPLOYEE_ID


If I do that, in the Mapping Section,

The DS1 Comes under top-->InputLinks-->DS1
The Child List comes under top--> Regroup:result-->DS1

top
-InputLinks
--DS1
EMPLOYEE_ID
EMPLOYEE_NM
EMPLOYEE_NUMBER
EMPLOYEE_PH_NUMBER
-Regroup:result
EMPLOYEE_ID
EMPLOYEE_NM
EMPLOYEE_NUMBER
--DS1
EMPLOYEE_PH_NUMBER

However, the Parent record is at the top most hierarchy in my CFI Layout.

If I choose 'top' as List for Documentation in the XML Composer -> Mapping, I am able to map the child list without any issues.
but I am not able to map any of the Parent List since the hierarchy is different.

The CFI Layout demands me to choose 'DS1' (the input link itself) as the List for Documentation in the XML Composer -> Mapping.
But If I do that, I am not able to map Regroup:result-->DS1 for the Child List.

Hope I am clear in explaining my issue.
Please help.
Regards,
Shank
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

The XML Stage is hard to describe (meaning --- it is difficult to describe exactly what you are seeing or doing), but we'll get there. What is CFI ?

I have never used an hjoin with a single input link....I reserve its use for special techniques or when I have multiple incoming links, one for each independent sub-node.......so we should be able to get you through this without needing it.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
shank
Participant
Posts: 18
Joined: Wed Mar 25, 2009 3:11 am

Post by shank »

Sorry. CFI Layout is the XSD Metadata we configure in the XML Composer Library.
I am not using Regroup when I use Multiple Links.

Currently, the XML Assembly has multiple links and I am using H-join to create the XML struture I want.
It is not working if the record count is more than 140K in my parent Link.
The Record count in Child Links goes even upto 300 K.
The XML Assembly is now running more than 24 hours for creating the XML File.

now, to resolve this performance issue, I am trying to achieve it using Regroup.
I am encountering Mapping issue while doing that. I am not able to map appropriate hierarchies for Parent and Child.
Regards,
Shank
shank
Participant
Posts: 18
Joined: Wed Mar 25, 2009 3:11 am

Post by shank »

Also, Can you please let me know the significance of Heap Size (MB) , Stack Size (KB) parameters? Does it have something to do with Job performance?
Regards,
Shank
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Let's go back to the beginning.

Is this the only hierarchy that you have in your XML Stage? Maybe it has more columns, but do you only have one nested path to deal with in your output?

...meaning a list of employees.....and beneath that a list of their phone (or other details)?

...and right now, it sounds like you are bringing in two input links for this? One link for the employee parent information and another for the employee detail information? (one:many relationship)?

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
shank
Participant
Posts: 18
Joined: Wed Mar 25, 2009 3:11 am

Post by shank »

Hi Ernie,

No. I gave an example to post the issue in this forum.
There are 10 Child Input links (Employee-Detail information) and 1 Parent link (Employee Parent information) I am giving to XML Assembly.
10 H-Joins inside the XML Assembly.

Some more information.
I have given this in the XML Assembly stage
Heap size (MB) = 4000
Stack Size (KB) = 4000
Number of threads = 4
Average length of XML string we produce out of Assembly = 24000 bytes (the columns in XML layout is around 350)
Number of records for which we are facing the performance issue = 140 K in the parent Link

When I try with smaller stack size (around 500 KB) , the job seems to be perform somewhat better even with these many H-joins.

Do these parameters have something to do with performance ?
I don't want to just give something in random which works for now and blows up some time later for more volume of records.
Please throw me some light with those parameters and its significance.

Thanks for your help and valuable time.
Regards,
Shank
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Not a whole lot...those properties have only come into play most often when the Job dies --- and increasing it just to get it to run successfully.....

......based on what we've seen, it's possible that the heap size might assist with your H-Joins, but the stack and thread setting is probably best left alone. I'd probably at least start with some tests that leave those as the defaults and play with the heap setting.

Another thing to consider is to break up the Composing. I haven't done this in awhile, but there may be benefit in doing some of your Joins in one Stage, and the others in another...and then bringing them together in a third.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
shank
Participant
Posts: 18
Joined: Wed Mar 25, 2009 3:11 am

Post by shank »

Very sorry for the very late response.
we did trail and error method and determined the value for Heap size and Stack size. Now the job is running fine for the current production volume.
we are also working on a work around for even higher volumes which we expect to hit in a month or two. I will keep this post updated if we find a suitable work around.
Regards,
Shank
shank
Participant
Posts: 18
Joined: Wed Mar 25, 2009 3:11 am

Post by shank »

The Work around we got for this is replacing the XML Assembly with the XML Transformer stage. (We had to write lot of Transformations code to achieve it though !! )

Now the jobs are not at all failing even for much higher volumes.
Regards,
Shank
adamfrank321
Premium Member
Premium Member
Posts: 2
Joined: Thu Nov 04, 2010 2:29 pm

Post by adamfrank321 »

shank,

Would you mind sharing the values for heap and stack size that you found worked best?
Post Reply