Need best partitioning method for hierarchy mgmt

reachthiru · Post by **reachthiru** » Mon Jan 30, 2006 3:06 pm

Hi,

I have a data file with data like this

1,a
1-10,t
1-11,u
1-12,v
1-10-20,x
1-10-25,y
1-11-26,z

I need to convert this data as

1,a,1-10,t,1-10-20,x
1,a,1-10,t,1-10-25,y
1,a,1-11,u,1-11-26,z
1,a,1-12,v,,,

For this, I created a job like Sequential -> Sort -> Transformer -> Sequential.

In the Transformer stage, I used stage variables and storing the incoming data in different variables based on the # of occurances of hyphen(-) and writing only the final level data to output file with the stage variables.

My logic will work fine in a server job. But since it is a parallel job, I am not getting the desired output. If I change the partition method to 'Entire', then I am getting the proper output, but the results are duplicated due to the more # of nodes.

The other way we are thinking is using the data file as lookup as well and forming the hierarchy. It will work fine, but little complex.

Is there any way to get the result using the first method without changing the # of nodes?

Thanks in advance.

vmcburney · Post by **vmcburney** » Mon Jan 30, 2006 4:19 pm

Your input and your output are both sequential files so try making your parallel job remain in sequential mode and it wont run that much slower. Easiest way is to create a config file with just one node in it, add the $APT_CONFIG_FILE environment variable to your job and set it to the 1node config file. You should only get one instance of your sort and transformer stage.

Will run faster then your Entire option as instead of moving the data to multiple nodes it only processes it to one node.

kumar_s · Post by **kumar_s** » Mon Jan 30, 2006 11:06 pm

Hi,

As vincent suggested, use environmental variable.
Else, since the input is sequential file, maintail the same partion for the sort and transformer stage which will inturn act in sequential fashion.
Can you explain us what is the logic you handle in transformer, so that there can be a way where we can find some optimal partion for that scenario.

-Kumar

reachthiru · Post by **reachthiru** » Tue Jan 31, 2006 1:54 pm

Hi Vincent & Kumar,

Thanks for your suggestions. Actually I already did what is told by Vincent. But the sequence that I have mentioned is for testing purpose only, actually I need to write all my final data to a table and I may have to read a million records.

OK, this is my logic.

First I am sorting my data and so it will become like this:

1,a
1-10,t
1-10-20,x
1-10-25,y
1-11,u
1-11-26,z
1-12,v

Then in TRANSFOMER stage, I have defined stage variables for each level like lvl1,lvl1desc,lvl2,etc. When I am reading the data, I will check the # of hyphens(-) and if it 0, it will go to lvl1. if it is 1, it will go to lvl2 and like that. I am writing all my stage variables to output. I also added a constraint as count of hyphens should be 2 (ie my final level). So according to my data that output will be

1,a,,,,,
1,a,1-10,t,,,
1,a,1-10,t,1-10-20,x (Constraint True)
1,a,1-10,t,1-10-25,y (Constraint True)
1,a,1-11,u,,,
1,a,1-11,u,1-11-26,z (Constraint True)
1,a,1-12,v,,

Hope I explained my logic well. As I mentioned earlier, I am getting the output of what I am getting, only thing is I could not establish paralleism and looking for a best solution from gurus.

Thanks.

gpatton · Post by **gpatton** » Wed Feb 01, 2006 8:41 am

How many "root" levels will you have in your hierarchy ( in your example 1 )?

You could partition your data based upon values of the "root" level and then run subsets in parallel.

reachthiru · Post by **reachthiru** » Wed Feb 01, 2006 11:13 am

Hi gpatton,

I have only one root node.