Page 1 of 1

Partioning of Data.

Posted: Thu Apr 12, 2012 4:32 am
by chetan.c
Hi,

My job design looks like this

Code: Select all

externalsourcestage--->transformer---sequential file.
The script in External source stage give output in on single column like below.
sample1.txt
100
sample2.txt
150
..
.
In the transformer I am deriving filename by the below expression in stage variable.
index(DSLink31.read,".txt",1)<>0 then DSLink31.read else svfilename
In the derivation im using this stage variable as filename.My output derivation looks like below.
DSLink31.read ---data
svfilename ----filename.

Constraint
index(DSLink31.read,".txt",1)=0

When i run the job with varchar data the job runs fine.
But when i run the job with data as above in the output the filename does not get populated in the sequential file.
Both column datatypes and stagevariable datatype i have maintained as Varchar.

The job runs succesfully if transformer is made to run sequentially.

So is there a problem with partioning of data?

Thanks,
Chetan.C

Posted: Thu Apr 12, 2012 4:54 am
by ray.wurlod
No.

Posted: Thu Apr 12, 2012 6:57 am
by chetan.c
Thanks ray.
But can you please let me know what could be the problem?

Posted: Thu Apr 12, 2012 9:07 am
by jwiles
Given the data that is provided by the external source stage, and the logic you have written, running the transformer in sequential mode is the correct choice. There is no other data present that can guarantee that partitioning will keep related records together (in the same partition) when running in parallel. Therefore, parallel execution mode is not appropriate for your current source data and business rules.

If your source provided an additional column which provided some sort of key that indicates the records which belong together, you could partition and sort or group on that key and run the transformer in parallel.

Regards,

Posted: Fri Apr 13, 2012 5:20 am
by chetan.c
Hi Jwiles,

Thanks.
Yes I do not have column on which I can sort and partition.
But what could be the reason for such behaviour of the job?

Thanks,
Chetan.C

Posted: Fri Apr 13, 2012 8:18 am
by jwiles
In a parallel job, when data moves from a stage running in sequential mode to a stage running in parallel mode (multiple partitions), by default DataStage will partition the data between the partitions unless you tell it not to do so. In many cases, it will use Round Robin partitioning, which provides a fairly even distribution of data across the nodes, but does NOT guarantee that related rows will remain in the same partition. This is likely what was happening in your case: rows containing the filename and the rows containing the data for that filename probably ended up in different partitions and therefore your results were not what you were expecting.

Because you don't have a separate column on which to properly partition and sort the data, you need to run the job in a single partition (single node configuration file) or as you're doing now, run the transformer in sequential mode. This will keep your data all in the same partition and your logic will produce the correct results.

Regards,