Job using dataset files is slower than sequential files

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

splayer wrote:...Both versions output the file to the same folder...
No, probably just your dataset descriptor and sequential file are in the same directory. You should try writing your sequential file to the directory pointed to by the resource disk setting in your APT_CONFIG file. That way you are comparing performance on the same disk partition and removing potentional I/O differences from the equation.
patonp
Premium Member
Premium Member
Posts: 110
Joined: Thu Mar 11, 2004 7:59 am
Location: Toronto, ON

Post by patonp »

I've read in another post that datasets containing bounded-length varchar fields can grow to be be quite large as they allocate almost the full amount of space defined, even when only a few characters of the varchar field are actually used.

Is the total size of your datasets much larger than your sequential file? (i.e. could the I/O involved writing out a larger set of files be the cause of your performance discrepancy?)
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

Arndw, when I changed my resourcedisk path to the Datasets folder where datasets are created, the time taken for sequential file was the same as for data set files, 69secs. So depending on the folder, performance varies. Do you have any idea as to why it might be?

I would think that dataset version should at least be a few seconds faster.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

splayer - look at the filesystems and options used for your two partitions.
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

This is my job:
SeqFile --> Modify --> SurrogateKeyGenerator --> Transformer --> TargetFile(Dataset stage)

To answer balajisr's question, partitioning is set to Auto throughout.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Arnd means to look at the hardware. For example, is one directory on local disk and the other in a SAN?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

That particular directory can be of different mount point, which might have network congestion.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

Ray, pardon my ignorance about hardware but what does SAN stand for?
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Not to be confused with NAS. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DSguruji wrote:Storage Area Network
More usually Storage ARRAY Network - an array of storage devices (disks) connected with intelligent controllers so that they can be managed as a single entity or partitioned as required.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply