Dataset

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
samsuf2002
Premium Member
Premium Member
Posts: 397
Joined: Wed Apr 12, 2006 2:28 pm
Location: Tennesse

Dataset

Post by samsuf2002 »

Hi All,

I am working on datasets and want to clarify some of my doubts, I am running a job that creates a dataset data.ds in my working folder and the data stored in dsn directories (0 -7), but what I see is that all the records (1200 rows) stored in just one dsn folder that is "dsn0" and the dataset is auto partitioned. Want to know how dataset behaves for different partitions specially for auto.

Some time I see that there are more than one file in dsn0 with no data for the same dataset like

Code: Select all

-rw-rw-r--   1 dsadm    datastg      131072 Mar 12 17:03 data.ds.dsadm.....

-rw-rw-r--   1 dsadm    datastg      0      Mar 12 17:03 data.ds.dsadm.....

-rw-rw-r--   1 dsadm    datastg      0      Mar 12 17:03 data.ds.dsadm.....
Not sure how this was created. If anyone can help me understand this behavior that will be great.

Thanks in advance.
hi sam here
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Auto partitioning looks at the adjacent stages and makes a best "guess" at an appropriate partitioning method.

Personally, I never use "Auto" partitioning when it matters. I always make an explicit conscious decision.

In your example, "Auto" seems to have resulted in severe data skew. If you are not doing a key-based operation and want an even distribution of data, prefer "Round Robin" over "Auto".

Mike
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You simply don't have enough data to fill one 128Kbyte block, much less eight of them. DataStage moves data around in buffers of at least this size. That's why they're all in one place.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
throbinson
Charter Member
Charter Member
Posts: 299
Joined: Wed Nov 13, 2002 5:38 pm
Location: USA

Post by throbinson »

Hunh? This has data skew because of bad key choice written all over it. What's buffer size got to do with where the data ends up being written when auto partitioned. Can you explain?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

For a small enough volume of data only one segment file will be created, at least on an SMP server. I have not verified this result in a cluster or grid environment.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
throbinson
Charter Member
Charter Member
Posts: 299
Joined: Wed Nov 13, 2002 5:38 pm
Location: USA

Post by throbinson »

I have a feeling we're talking about two different things. I created a four-node dataset with an integer as a key, hash partitioned on a SMP system. All four files were created. I sent it three rows 1,2,3. Three datasets had data, one was empty. I did the same with 16 rows. All four were populated with 2,7,3,4 rows as verified through Manager. I don't see the bytesize of the data influencing where the data goes. That doesn't make sense to me.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

The type of partitioning will have an affect on where the data lands.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

OK, my example is where an upstream stage executes in sequential mode, for example a job that only has a Sequential File stage and a Data Set stage. It's in that kind of case that the parallel placement of data is impacted, even though the Data Set stage (copy operator) is shown as executing "in parallel" and a partitioner icon appears on the link.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

I've seen the behavior Ray is talking about when going from a DB2 API stage (sequential) to a transformer (parallel) with a partitioning method of Auto. I assumed it was a case of "Auto" making a bad guess...

Mike
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Could be - would need to check the Score. (Auto) in that case *should* give you Round Robin partitioning.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
samsuf2002
Premium Member
Premium Member
Posts: 397
Joined: Wed Apr 12, 2006 2:28 pm
Location: Tennesse

Post by samsuf2002 »

Thanks to all for the valuable information.

I am still wondering about those empty files created for same dataset in the same dsn folder (example shown in my first post). If anyone can put some light on it.
hi sam here
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Maybe you just had horribly skewed data. Check with the Director's Monitor, with "Show Instances" enabled.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply