How Datasets are stored?

amitaguptain · Post by **amitaguptain** » Thu Mar 29, 2007 1:15 am

Hi,

I want to know how Datasets are stored in unix?And can we view the data in a dataset?

What is the primary difference between Dataset and Fileset?

dspxlearn · Post by **dspxlearn** » Thu Mar 29, 2007 4:13 am

'Dsxchange' itself is a huge database. This is discussed many times here.
Lets make a search!!!

DSguru2B · Post by **DSguru2B** » Thu Mar 29, 2007 7:56 am

This post might be of interest to you.

ray.wurlod · Post by **ray.wurlod** » Thu Mar 29, 2007 3:49 pm

Each row in a Data Set (or File Set) carries an additional four columns of management information totalling 80 bits; the partition number, the partition count, the ordinal number of the record within the partition and the record ID.

The data in a Data Set are stored in one or more files on each node in the current configuration file. These files are stored in the directory or directories identified in the resource disk section of each processing node definition in the current configuration file.

The Data Set control file (whose name ends in ".ds") contains the locations of these files as well as the record schema and sundry other information, most of which can be viewed in the Data Set Management tool.

kumar_s · Post by **kumar_s** » Mon Apr 02, 2007 4:29 am

ray.wurlod wrote:Each row in a Data Set (or File Set) carries an additional four columns of management information totalling 80 bytes; the partition number, the partition count, the ordinal number of the record within the partition and the record ID.

It could have been more effecient, if the column information is stored once as header information for each data file rather than each row.

Narayana · Post by **Narayana** » Mon Apr 02, 2007 5:25 am

you can veiw data in the data set using dataset stage of dataset manager in datastage manger menu.

ray.wurlod · Post by **ray.wurlod** » Mon Apr 02, 2007 9:09 am

kumar_s wrote:
ray.wurlod wrote:Each row in a Data Set (or File Set) carries an additional four columns of management information totalling 80 bytes; the partition number, the partition count, the ordinal number of the record within the partition and the record ID.
It could have been more effecient, if the column information is stored once as header information for each data file rather than each row.

How?

The same row may occur in different Data Sets that use the same record schema, say as the result of re-partitioning, even within the same job. And I fail to see how per-row information could have been stored in any kind of header. Would appreciate an expanded explanation of your theory.

Incidentally, this structure is most easily seen by viewing the control file of a File Set.

kumar_s · Post by **kumar_s** » Mon Apr 02, 2007 6:33 pm

I think Iam missing something.

If I check a control file of a fileset, it has information of schema of the file and node information on which the corresponding data file resides.
If I check the individual data file, the whole records are available. The records are spread across the data file based on the partition algorithm I used. I couldn't find the same record present in different data file.
I couldn't find any other record information.

Btw, what I was looking at is, if partition number, the partition count is constant for the whole file (either dataset or fileset), it could have been stored in header file, rather than at row level.

steele · Post by **steele** » Thu Aug 09, 2007 2:39 pm

Would anyone (Ray

) know what role, if any, the column defined "key" plays in the creation of the dataset? Is there any reason that they should match the hash key for the partition?

Thanks!

ArndW · Post by **ArndW** » Thu Aug 09, 2007 2:58 pm

No, the 'key' attribute has no effect in DataSets.

steele · Post by **steele** » Thu Aug 09, 2007 4:07 pm

That was always my belief but I had an IBM rep inform me otherwise.

Thanks ArndW!

ray.wurlod · Post by **ray.wurlod** » Thu Aug 09, 2007 4:23 pm

As far as I am aware Key is irrelevant in Data Sets. It is pertinent in Lookup File Sets. Please harangue your IBM rep for additional information.

ArndW · Post by **ArndW** » Thu Aug 09, 2007 5:08 pm

I agree with Ray - get the IBM rep to convince you otherwise. Even better, write a simple job that outputs to a dataset (keep it simple with just one node) and then make a copy of the generated file, whose location you can discover with the data management tool. Then re-run the same job with the same data but specify other column(s) as keys and see if the dataset is identical on disk.