Link Partitioner/Link Collector

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

JDionne
Participant
Posts: 342
Joined: Wed Aug 27, 2003 1:06 pm

Link Partitioner/Link Collector

Post by JDionne »

I have a job that uses a the partitioner/collector to speed up processing a large sequincial file. It has three partitions for the file. This job worked fine in the past untill I added extra data clensing to it. Now the only way that i can get the job to run with out posting an error "timeout waiting for mutex" is by deleting one of the three partitions. It dosent seem to mater wich one i delete. It seems that it can only process it with two now. Is this a true limitation? Am i doing soo much that it cant do it with three partitions or would there be some other technical thing that I am doing wrong?
Jim
Sure I need help....But who dosent?
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You would find that you actually run faster if you split your job into something like this:

Code: Select all

seq.in --->  xfm1 ---> seq1.out

seq.in --->  xfm2 ---> seq2.out

seq.in --->  xfm3 ---> seq3.out
Put a constraint in xfm1 of MOD(@INROWNUM,3)=0, for xfm2 use =1, and xfm3 use =2. This divides the source file into every third row offsetting for each stream. You'll use 3 cpus in this design.

Put an after job routine that does a "copy seq1.out + seq2.out + seq3.out seq_all.out" to recombine the data.

The link collector sucks.

Now, I've given you an alternative that has a fixed parallelism just like your original job design. Now, if you just used:

Code: Select all

seq.in --->  xfm ---> seq_#PartitionNumber#.out
with a constraint of

Code: Select all

MOD(@INROWNUM,PartitionCount)=PartitionNumber-1
you could use job instantiation and run gobs of parallel job clones (DIVIDE AND CONQUER) and simply recombine the result sets when they're done. Just set PartitionCount to the total number of instances you're going to run, and PartitionNumber to 1-PartitionCount for all of the instances you run.

If 3 isn't enough, and you've got lots of cpus to spare, then up the number. Think of Agent Smith from Matrix 2 (MORE ME).
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
JDionne
Participant
Posts: 342
Joined: Wed Aug 27, 2003 1:06 pm

Post by JDionne »

kcbland wrote:You would find that you actually run faster if you split your job into something like this:

Code: Select all

seq.in --->  xfm1 ---> seq1.out

seq.in --->  xfm2 ---> seq2.out

seq.in --->  xfm3 ---> seq3.out
Put a constraint in xfm1 of MOD(@INROWNUM,3)=0, for xfm2 use =1, and xfm3 use =2. This divides the source file into every third row offsetting for each stream. You'll use 3 cpus in this design.

Put an after job routine that does a "copy seq1.out + seq2.out + seq3.out seq_all.out" to recombine the data.

The link collector sucks.

Now, I've given you an alternative that has a fixed parallelism just like your original job design. Now, if you just used:

Code: Select all

seq.in --->  xfm ---> seq_#PartitionNumber#.out
with a constraint of

Code: Select all

MOD(@INROWNUM,PartitionCount)=PartitionNumber-1
you could use job instantiation and run gobs of parallel job clones (DIVIDE AND CONQUER) and simply recombine the result sets when they're done. Just set PartitionCount to the total number of instances you're going to run, and PartitionNumber to 1-PartitionCount for all of the instances you run.

If 3 isn't enough, and you've got lots of cpus to spare, then up the number. Think of Agent Smith from Matrix 2 (MORE ME).

Ill need time to run through this. Ill get back to you
Jim
Sure I need help....But who dosent?
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Ken's approach is the way to go. On the mutex problem this is something I see every time I combine QualityStages and collectors in the same job. I try to fix the problem by increasing the memory cache size on the project properties. It is better if you can avoid the combination.
JDionne
Participant
Posts: 342
Joined: Wed Aug 27, 2003 1:06 pm

Post by JDionne »

vmcburney wrote:Ken's approach is the way to go. On the mutex problem this is something I see every time I combine QualityStages and collectors in the same job. I try to fix the problem by increasing the memory cache size on the project properties. It is better if you can avoid the combination.
Im running through his way now...i hope to be able to claim success soon.
Jim
Sure I need help....But who dosent?
JDionne
Participant
Posts: 342
Joined: Wed Aug 27, 2003 1:06 pm

Post by JDionne »

JDionne wrote:
vmcburney wrote:Ken's approach is the way to go. On the mutex problem this is something I see every time I combine QualityStages and collectors in the same job. I try to fix the problem by increasing the memory cache size on the project properties. It is better if you can avoid the combination.
Im running through his way now...i hope to be able to claim success soon.
Jim
Well it almost worked....The copy command corupts the fille. Messes up a rows end of line statment. The combined file will not load to the db but the segmented files will. Looks like im going back to my partitioner/collectors :(
Jim
Sure I need help....But who dosent?
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

NOOO! You probably have omit last newline checked in the sequential stage definition! Just uncheck it. If you have two files, with no last newline delimiter, and concatenate, then the first/last row where the files meet becomes one huge line. Omitting the last newline should be a rarely used option in a Windows/Unix world.

You've got to trust me on this. Jeez, I've probably written 5000 jobs and trained 200 developers/consultants/clients.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
JDionne
Participant
Posts: 342
Joined: Wed Aug 27, 2003 1:06 pm

Post by JDionne »

kcbland wrote:NOOO! You probably have omit last newline checked in the sequential stage definition! Just uncheck it. If you have two files, with no last newline delimiter, and concatenate, then the first/last row where the files meet becomes one huge line. Omitting the last newline should be a rarely used option in a Windows/Unix world.

You've got to trust me on this. Jeez, I've probably written 5000 jobs and trained 200 developers/consultants/clients.
I have neaver doubted you...im just in a bit of a rush right now. Have a new month of data that i have to load. My target window is one bussines day. I dont have time to fight DS :( I have goten a good file after i rewrote the old job. Im gona get it loaded and in the down time look at what u suggested.
Jim
Sure I need help....But who dosent?
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

JDionne wrote:
kcbland wrote:My target window is one bussines day. I dont have time to fight DS :(
Well, on the bright side, one business day is far too small for developing a C/C++ version for this type of job.

Relax. When things doesn't look right, take a sip of your favorite beverage, and then take a look at your jobs again. Ensure that you are doing things right. More often than not, it would be one little stupid thing that was overlooked. It's harder to find that little stupid thing in C/C++ compared to DataStage, and there are usually more of them there.

Output data all wacked out? Check the output stage.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
JDionne
Participant
Posts: 342
Joined: Wed Aug 27, 2003 1:06 pm

Post by JDionne »

Teej wrote:
JDionne wrote:
kcbland wrote:My target window is one bussines day. I dont have time to fight DS :(
Well, on the bright side, one business day is far too small for developing a C/C++ version for this type of job.

Relax. When things doesn't look right, take a sip of your favorite beverage, and then take a look at your jobs again. Ensure that you are doing things right. More often than not, it would be one little stupid thing that was overlooked. It's harder to find that little stupid thing in C/C++ compared to DataStage, and there are usually more of them there.

Output data all wacked out? Check the output stage.

-T.J.

Sigh....I hate being the "Expert" in something and not have the training to do it. Its allso iritating when i have deadlines on top of that "Learning Curve" But I have made it through worse in less time. I did get a reprive this time..the data that was sent to me is incorrect. I dont have to load it now :)
ill keep banging my head against this. Im sure ill be in contact.
Jim
Sure I need help....But who dosent?
shawn_ramsey
Participant
Posts: 145
Joined: Fri May 02, 2003 9:59 am
Location: Seattle, Washington. USA

Re: Link Partitioner/Link Collector

Post by shawn_ramsey »

JDionne wrote:I have a job that uses a the partitioner/collector to speed up processing a large sequincial file. It has three partitions for the file. This job worked fine in the past untill I added extra data clensing to it. Now the only way that i can get the job to run with out posting an error "timeout waiting for mutex" is by deleting one of the three partitions. It dosent seem to mater wich one i delete. It seems that it can only process it with two now. Is this a true limitation? Am i doing soo much that it cant do it with three partitions or would there be some other technical thing that I am doing wrong?
Jim
Jim,

We have experienced the same timeout waiting for Mutex issue with DataStage 6 on windows. When I called support they said that they have seen this issue (and it is fixed in 7.0.1) when the transformations in each path are mismatched. Are these transformations in a shared container that is used in each split?
Shawn Ramsey

"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
JDionne
Participant
Posts: 342
Joined: Wed Aug 27, 2003 1:06 pm

Re: Link Partitioner/Link Collector

Post by JDionne »

shawn_ramsey wrote:
JDionne wrote:I have a job that uses a the partitioner/collector to speed up processing a large sequincial file. It has three partitions for the file. This job worked fine in the past untill I added extra data clensing to it. Now the only way that i can get the job to run with out posting an error "timeout waiting for mutex" is by deleting one of the three partitions. It dosent seem to mater wich one i delete. It seems that it can only process it with two now. Is this a true limitation? Am i doing soo much that it cant do it with three partitions or would there be some other technical thing that I am doing wrong?
Jim
Jim,

We have experienced the same timeout waiting for Mutex issue with DataStage 6 on windows. When I called support they said that they have seen this issue (and it is fixed in 7.0.1) when the transformations in each path are mismatched. Are these transformations in a shared container that is used in each split?
Nope not sophisticated enough to use shared containers yet. :) and maybe we will get our support aggrement signed with Ascential so that i can upgrade!!!! :)
thanx for the help
Jim
Sure I need help....But who dosent?
shawn_ramsey
Participant
Posts: 145
Joined: Fri May 02, 2003 9:59 am
Location: Seattle, Washington. USA

Re: Link Partitioner/Link Collector

Post by shawn_ramsey »

JDionne wrote:Nope not sophisticated enough to use shared containers yet. :) and maybe we will get our support aggrement signed with Ascential so that i can upgrade!!!! :)
thanx for the help
Jim
Jim,

Do yourself a favor and use a shared container, it will eliminate tons of work that you are doing in replicating the transformations in each path. It is a pretty simple process to include a Partitioner/Collector if you use a shared container.

Here is what I do:
1) Develop the job as a single data flow without a container or a splitter.
2) Test and debug the transformation logic to ensure that it doing what you intend for it to do.
3) In designer select the components that constitute the flow that you want to partition.
4) Select Edit -> Construct Container -> Shared. It will ask you for a name of the container. Now you have a single flow with a container.
5) Now drop in your Partitioner and Collector.
6) Grab the input link to the shared container and redirect it to the partitioner.
7) Grab the output link on the shared container and drop it on the Collector.
8) Now you have a shared container that is floating with no links.
9) Now you create a new link from the splitter to the shared container and another from the shared container to the collector.
10) Edit the shared container, select the input tab and drop down "The Map to Container Link" and click validate. It will ask you if you want to propagate the column information. Select yes.
11) Do the same for the output link.
12) Now you have a shared container with a splitter combiner.
13) Drag additional shared container from the repository and repeat steps 8-11 for each one.

I have found that you have to play around with the number of partitions since there is a point of diminishing returns on the number of parallel streams.
Shawn Ramsey

"It is a mistake to think you can solve any major problems just with potatoes."
-- Douglas Adams
JDionne
Participant
Posts: 342
Joined: Wed Aug 27, 2003 1:06 pm

Thought This was dead didnt you.....

Post by JDionne »

kcbland wrote:NOOO! You probably have omit last newline checked in the sequential stage definition! Just uncheck it. If you have two files, with no last newline delimiter, and concatenate, then the first/last row where the files meet becomes one huge line. Omitting the last newline should be a rarely used option in a Windows/Unix world.

You've got to trust me on this. Jeez, I've probably written 5000 jobs and trained 200 developers/consultants/clients.
Well i finaly got back to this and i have narowed down my problem...i know exatly whats going on but i dont know how to fix it. Seeing how it has been such a long time and I dont want all to have to re read this post lets recap:

I was having problems with the link partitioners and collectors with multex timeouts. I was instructed to forgo the partitioners and collectors and proccess the files into differnt files and use a dos copy command to combine them into one load file. I then started getting the error:
JOCDEVLoadStage..Sequential_File_0.DSLink3: nls_read_delimited() - row 627743, column LINE, required column missing

After counting up how many lines should be in this file i find that the total number should be 627742. This indicates that an end of file character has been missed or missplaced after the copy command joined the files together. That is my problem....does anyone have a solution?
Jim
Sure I need help....But who dosent?
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You have an entry line in the file is what you are saying. You have an extra carriage return/line feed somewhere. Either the data has it embedded in it, or the concatenation is somehow introducing it.

You can configure the sequential stage to throw away incomplete rows, or you can configure the sequential stage to complete incomplete rows where by you have to introduce a transformer constraint to throw them away. This at least moves you forward. You still have to find out which file is bad. You can look at the link statistics from the job that produces the individual files to get the row count, then go to the produced file and verify which file is wrong. If one of them is wrong, you have just provided that concatentation is not the problem, but its the data. If the files match the link statistics, then the concatenation of the files is the issue (sooo very not likely).
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply