I am working on datasets and want to clarify some of my doubts, I am running a job that creates a dataset data.ds in my working folder and the data stored in dsn directories (0 -7), but what I see is that all the records (1200 rows) stored in just one dsn folder that is "dsn0" and the dataset is auto partitioned. Want to know how dataset behaves for different partitions specially for auto.
Some time I see that there are more than one file in dsn0 with no data for the same dataset like
Auto partitioning looks at the adjacent stages and makes a best "guess" at an appropriate partitioning method.
Personally, I never use "Auto" partitioning when it matters. I always make an explicit conscious decision.
In your example, "Auto" seems to have resulted in severe data skew. If you are not doing a key-based operation and want an even distribution of data, prefer "Round Robin" over "Auto".
You simply don't have enough data to fill one 128Kbyte block, much less eight of them. DataStage moves data around in buffers of at least this size. That's why they're all in one place.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hunh? This has data skew because of bad key choice written all over it. What's buffer size got to do with where the data ends up being written when auto partitioned. Can you explain?
For a small enough volume of data only one segment file will be created, at least on an SMP server. I have not verified this result in a cluster or grid environment.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I have a feeling we're talking about two different things. I created a four-node dataset with an integer as a key, hash partitioned on a SMP system. All four files were created. I sent it three rows 1,2,3. Three datasets had data, one was empty. I did the same with 16 rows. All four were populated with 2,7,3,4 rows as verified through Manager. I don't see the bytesize of the data influencing where the data goes. That doesn't make sense to me.
OK, my example is where an upstream stage executes in sequential mode, for example a job that only has a Sequential File stage and a Data Set stage. It's in that kind of case that the parallel placement of data is impacted, even though the Data Set stage (copy operator) is shown as executing "in parallel" and a partitioner icon appears on the link.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I've seen the behavior Ray is talking about when going from a DB2 API stage (sequential) to a transformer (parallel) with a partitioning method of Auto. I assumed it was a case of "Auto" making a bad guess...
I am still wondering about those empty files created for same dataset in the same dsn folder (example shown in my first post). If anyone can put some light on it.