Partioning of Data.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
chetan.c
Participant
Posts: 112
Joined: Tue Jan 17, 2012 2:09 am
Location: Bangalore

Partioning of Data.

Post by chetan.c »

Hi,

My job design looks like this

Code: Select all

externalsourcestage--->transformer---sequential file.
The script in External source stage give output in on single column like below.
sample1.txt
100
sample2.txt
150
..
.
In the transformer I am deriving filename by the below expression in stage variable.
index(DSLink31.read,".txt",1)<>0 then DSLink31.read else svfilename
In the derivation im using this stage variable as filename.My output derivation looks like below.
DSLink31.read ---data
svfilename ----filename.

Constraint
index(DSLink31.read,".txt",1)=0

When i run the job with varchar data the job runs fine.
But when i run the job with data as above in the output the filename does not get populated in the sequential file.
Both column datatypes and stagevariable datatype i have maintained as Varchar.

The job runs succesfully if transformer is made to run sequentially.

So is there a problem with partioning of data?

Thanks,
Chetan.C
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chetan.c
Participant
Posts: 112
Joined: Tue Jan 17, 2012 2:09 am
Location: Bangalore

Post by chetan.c »

Thanks ray.
But can you please let me know what could be the problem?
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Given the data that is provided by the external source stage, and the logic you have written, running the transformer in sequential mode is the correct choice. There is no other data present that can guarantee that partitioning will keep related records together (in the same partition) when running in parallel. Therefore, parallel execution mode is not appropriate for your current source data and business rules.

If your source provided an additional column which provided some sort of key that indicates the records which belong together, you could partition and sort or group on that key and run the transformer in parallel.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
chetan.c
Participant
Posts: 112
Joined: Tue Jan 17, 2012 2:09 am
Location: Bangalore

Post by chetan.c »

Hi Jwiles,

Thanks.
Yes I do not have column on which I can sort and partition.
But what could be the reason for such behaviour of the job?

Thanks,
Chetan.C
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

In a parallel job, when data moves from a stage running in sequential mode to a stage running in parallel mode (multiple partitions), by default DataStage will partition the data between the partitions unless you tell it not to do so. In many cases, it will use Round Robin partitioning, which provides a fairly even distribution of data across the nodes, but does NOT guarantee that related rows will remain in the same partition. This is likely what was happening in your case: rows containing the filename and the rows containing the data for that filename probably ended up in different partitions and therefore your results were not what you were expecting.

Because you don't have a separate column on which to properly partition and sort the data, you need to run the job in a single partition (single node configuration file) or as you're doing now, run the transformer in sequential mode. This will keep your data all in the same partition and your logic will produce the correct results.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply