Page 1 of 1

How Datasets are stored?

Posted: Thu Mar 29, 2007 1:15 am
by amitaguptain
Hi,

I want to know how Datasets are stored in unix?And can we view the data in a dataset?

What is the primary difference between Dataset and Fileset?

Posted: Thu Mar 29, 2007 4:13 am
by dspxlearn
'Dsxchange' itself is a huge database. This is discussed many times here.
Lets make a search!!!

Posted: Thu Mar 29, 2007 7:56 am
by DSguru2B
This post might be of interest to you.

Posted: Thu Mar 29, 2007 3:49 pm
by ray.wurlod
Each row in a Data Set (or File Set) carries an additional four columns of management information totalling 80 bits; the partition number, the partition count, the ordinal number of the record within the partition and the record ID.

The data in a Data Set are stored in one or more files on each node in the current configuration file. These files are stored in the directory or directories identified in the resource disk section of each processing node definition in the current configuration file.

The Data Set control file (whose name ends in ".ds") contains the locations of these files as well as the record schema and sundry other information, most of which can be viewed in the Data Set Management tool.

Posted: Mon Apr 02, 2007 4:29 am
by kumar_s
ray.wurlod wrote:Each row in a Data Set (or File Set) carries an additional four columns of management information totalling 80 bytes; the partition number, the partition count, the ordinal number of the record within the partition and the record ID.
It could have been more effecient, if the column information is stored once as header information for each data file rather than each row. :roll:

Posted: Mon Apr 02, 2007 5:25 am
by Narayana
you can veiw data in the data set using dataset stage of dataset manager in datastage manger menu.

Posted: Mon Apr 02, 2007 9:09 am
by ray.wurlod
kumar_s wrote:
ray.wurlod wrote:Each row in a Data Set (or File Set) carries an additional four columns of management information totalling 80 bytes; the partition number, the partition count, the ordinal number of the record within the partition and the record ID.
It could have been more effecient, if the column information is stored once as header information for each data file rather than each row. :roll:
How?

The same row may occur in different Data Sets that use the same record schema, say as the result of re-partitioning, even within the same job. And I fail to see how per-row information could have been stored in any kind of header. Would appreciate an expanded explanation of your theory.

Incidentally, this structure is most easily seen by viewing the control file of a File Set.

Posted: Mon Apr 02, 2007 6:33 pm
by kumar_s
I think Iam missing something. :roll:

If I check a control file of a fileset, it has information of schema of the file and node information on which the corresponding data file resides.
If I check the individual data file, the whole records are available. The records are spread across the data file based on the partition algorithm I used. I couldn't find the same record present in different data file.
I couldn't find any other record information.

Btw, what I was looking at is, if partition number, the partition count is constant for the whole file (either dataset or fileset), it could have been stored in header file, rather than at row level.

Posted: Thu Aug 09, 2007 2:39 pm
by steele
Would anyone (Ray :wink: ) know what role, if any, the column defined "key" plays in the creation of the dataset? Is there any reason that they should match the hash key for the partition?

Thanks!

Posted: Thu Aug 09, 2007 2:58 pm
by ArndW
No, the 'key' attribute has no effect in DataSets.

Posted: Thu Aug 09, 2007 4:07 pm
by steele
That was always my belief but I had an IBM rep inform me otherwise.

Thanks ArndW!

Posted: Thu Aug 09, 2007 4:23 pm
by ray.wurlod
As far as I am aware Key is irrelevant in Data Sets. It is pertinent in Lookup File Sets. Please harangue your IBM rep for additional information.

Posted: Thu Aug 09, 2007 5:08 pm
by ArndW
I agree with Ray - get the IBM rep to convince you otherwise. Even better, write a simple job that outputs to a dataset (keep it simple with just one node) and then make a copy of the generated file, whose location you can discover with the data management tool. Then re-run the same job with the same data but specify other column(s) as keys and see if the dataset is identical on disk.