Page 1 of 1

Dataset

Posted: Wed Apr 08, 2009 7:54 am
by samsuf2002
Hi All,

I am working on datasets and want to clarify some of my doubts, I am running a job that creates a dataset data.ds in my working folder and the data stored in dsn directories (0 -7), but what I see is that all the records (1200 rows) stored in just one dsn folder that is "dsn0" and the dataset is auto partitioned. Want to know how dataset behaves for different partitions specially for auto.

Some time I see that there are more than one file in dsn0 with no data for the same dataset like

Code: Select all

-rw-rw-r--   1 dsadm    datastg      131072 Mar 12 17:03 data.ds.dsadm.....

-rw-rw-r--   1 dsadm    datastg      0      Mar 12 17:03 data.ds.dsadm.....

-rw-rw-r--   1 dsadm    datastg      0      Mar 12 17:03 data.ds.dsadm.....
Not sure how this was created. If anyone can help me understand this behavior that will be great.

Thanks in advance.

Posted: Wed Apr 08, 2009 8:23 am
by Mike
Auto partitioning looks at the adjacent stages and makes a best "guess" at an appropriate partitioning method.

Personally, I never use "Auto" partitioning when it matters. I always make an explicit conscious decision.

In your example, "Auto" seems to have resulted in severe data skew. If you are not doing a key-based operation and want an even distribution of data, prefer "Round Robin" over "Auto".

Mike

Posted: Wed Apr 08, 2009 12:23 pm
by ray.wurlod
You simply don't have enough data to fill one 128Kbyte block, much less eight of them. DataStage moves data around in buffers of at least this size. That's why they're all in one place.

Posted: Wed Apr 08, 2009 1:14 pm
by throbinson
Hunh? This has data skew because of bad key choice written all over it. What's buffer size got to do with where the data ends up being written when auto partitioned. Can you explain?

Posted: Wed Apr 08, 2009 1:43 pm
by ray.wurlod
For a small enough volume of data only one segment file will be created, at least on an SMP server. I have not verified this result in a cluster or grid environment.

Posted: Wed Apr 08, 2009 1:52 pm
by throbinson
I have a feeling we're talking about two different things. I created a four-node dataset with an integer as a key, hash partitioned on a SMP system. All four files were created. I sent it three rows 1,2,3. Three datasets had data, one was empty. I did the same with 16 rows. All four were populated with 2,7,3,4 rows as verified through Manager. I don't see the bytesize of the data influencing where the data goes. That doesn't make sense to me.

Posted: Thu Apr 09, 2009 10:37 am
by DSguru2B
The type of partitioning will have an affect on where the data lands.

Posted: Thu Apr 09, 2009 1:10 pm
by ray.wurlod
OK, my example is where an upstream stage executes in sequential mode, for example a job that only has a Sequential File stage and a Data Set stage. It's in that kind of case that the parallel placement of data is impacted, even though the Data Set stage (copy operator) is shown as executing "in parallel" and a partitioner icon appears on the link.

Posted: Thu Apr 09, 2009 2:37 pm
by Mike
I've seen the behavior Ray is talking about when going from a DB2 API stage (sequential) to a transformer (parallel) with a partitioning method of Auto. I assumed it was a case of "Auto" making a bad guess...

Mike

Posted: Thu Apr 09, 2009 5:29 pm
by ray.wurlod
Could be - would need to check the Score. (Auto) in that case *should* give you Round Robin partitioning.

Posted: Fri Apr 10, 2009 9:41 am
by samsuf2002
Thanks to all for the valuable information.

I am still wondering about those empty files created for same dataset in the same dsn folder (example shown in my first post). If anyone can put some light on it.

Posted: Fri Apr 10, 2009 9:52 am
by ray.wurlod
Maybe you just had horribly skewed data. Check with the Director's Monitor, with "Show Instances" enabled.