Dataset

samsuf2002 · Post by **samsuf2002** » Wed Apr 08, 2009 7:54 am

Hi All,

I am working on datasets and want to clarify some of my doubts, I am running a job that creates a dataset data.ds in my working folder and the data stored in dsn directories (0 -7), but what I see is that all the records (1200 rows) stored in just one dsn folder that is "dsn0" and the dataset is auto partitioned. Want to know how dataset behaves for different partitions specially for auto.

Some time I see that there are more than one file in dsn0 with no data for the same dataset like

Code: Select all

-rw-rw-r--   1 dsadm    datastg      131072 Mar 12 17:03 data.ds.dsadm.....

-rw-rw-r--   1 dsadm    datastg      0      Mar 12 17:03 data.ds.dsadm.....

-rw-rw-r--   1 dsadm    datastg      0      Mar 12 17:03 data.ds.dsadm.....

Not sure how this was created. If anyone can help me understand this behavior that will be great.

Thanks in advance.

Mike · Post by **Mike** » Wed Apr 08, 2009 8:23 am

Auto partitioning looks at the adjacent stages and makes a best "guess" at an appropriate partitioning method.

Personally, I never use "Auto" partitioning when it matters. I always make an explicit conscious decision.

In your example, "Auto" seems to have resulted in severe data skew. If you are not doing a key-based operation and want an even distribution of data, prefer "Round Robin" over "Auto".

Mike

ray.wurlod · Post by **ray.wurlod** » Wed Apr 08, 2009 12:23 pm

You simply don't have enough data to fill one 128Kbyte block, much less eight of them. DataStage moves data around in buffers of at least this size. That's why they're all in one place.

throbinson · Post by **throbinson** » Wed Apr 08, 2009 1:14 pm

Hunh? This has data skew because of bad key choice written all over it. What's buffer size got to do with where the data ends up being written when auto partitioned. Can you explain?

ray.wurlod · Post by **ray.wurlod** » Wed Apr 08, 2009 1:43 pm

For a small enough volume of data only one segment file will be created, at least on an SMP server. I have not verified this result in a cluster or grid environment.

throbinson · Post by **throbinson** » Wed Apr 08, 2009 1:52 pm

I have a feeling we're talking about two different things. I created a four-node dataset with an integer as a key, hash partitioned on a SMP system. All four files were created. I sent it three rows 1,2,3. Three datasets had data, one was empty. I did the same with 16 rows. All four were populated with 2,7,3,4 rows as verified through Manager. I don't see the bytesize of the data influencing where the data goes. That doesn't make sense to me.

DSguru2B · Post by **DSguru2B** » Thu Apr 09, 2009 10:37 am

The type of partitioning will have an affect on where the data lands.

ray.wurlod · Post by **ray.wurlod** » Thu Apr 09, 2009 1:10 pm

OK, my example is where an upstream stage executes in sequential mode, for example a job that only has a Sequential File stage and a Data Set stage. It's in that kind of case that the parallel placement of data is impacted, even though the Data Set stage (copy operator) is shown as executing "in parallel" and a partitioner icon appears on the link.

Mike · Post by **Mike** » Thu Apr 09, 2009 2:37 pm

I've seen the behavior Ray is talking about when going from a DB2 API stage (sequential) to a transformer (parallel) with a partitioning method of Auto. I assumed it was a case of "Auto" making a bad guess...

Mike

ray.wurlod · Post by **ray.wurlod** » Thu Apr 09, 2009 5:29 pm

Could be - would need to check the Score. (Auto) in that case *should* give you Round Robin partitioning.

samsuf2002 · Post by **samsuf2002** » Fri Apr 10, 2009 9:41 am

Thanks to all for the valuable information.

I am still wondering about those empty files created for same dataset in the same dsn folder (example shown in my first post). If anyone can put some light on it.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 10, 2009 9:52 am

Maybe you just had horribly skewed data. Check with the Director's Monitor, with "Show Instances" enabled.