How Datasets are stored?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
amitaguptain
Participant
Posts: 12
Joined: Wed Oct 19, 2005 5:29 am

How Datasets are stored?

Post by amitaguptain »

Hi,

I want to know how Datasets are stored in unix?And can we view the data in a dataset?

What is the primary difference between Dataset and Fileset?
amie
dspxlearn
Premium Member
Premium Member
Posts: 291
Joined: Sat Sep 10, 2005 1:26 am

Post by dspxlearn »

'Dsxchange' itself is a huge database. This is discussed many times here.
Lets make a search!!!
Thanks and Regards!!
dspxlearn
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

This post might be of interest to you.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Each row in a Data Set (or File Set) carries an additional four columns of management information totalling 80 bits; the partition number, the partition count, the ordinal number of the record within the partition and the record ID.

The data in a Data Set are stored in one or more files on each node in the current configuration file. These files are stored in the directory or directories identified in the resource disk section of each processing node definition in the current configuration file.

The Data Set control file (whose name ends in ".ds") contains the locations of these files as well as the record schema and sundry other information, most of which can be viewed in the Data Set Management tool.
Last edited by ray.wurlod on Thu Aug 09, 2007 4:21 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

ray.wurlod wrote:Each row in a Data Set (or File Set) carries an additional four columns of management information totalling 80 bytes; the partition number, the partition count, the ordinal number of the record within the partition and the record ID.
It could have been more effecient, if the column information is stored once as header information for each data file rather than each row. :roll:
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Narayana
Participant
Posts: 16
Joined: Fri Mar 30, 2007 9:25 am

Post by Narayana »

you can veiw data in the data set using dataset stage of dataset manager in datastage manger menu.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

kumar_s wrote:
ray.wurlod wrote:Each row in a Data Set (or File Set) carries an additional four columns of management information totalling 80 bytes; the partition number, the partition count, the ordinal number of the record within the partition and the record ID.
It could have been more effecient, if the column information is stored once as header information for each data file rather than each row. :roll:
How?

The same row may occur in different Data Sets that use the same record schema, say as the result of re-partitioning, even within the same job. And I fail to see how per-row information could have been stored in any kind of header. Would appreciate an expanded explanation of your theory.

Incidentally, this structure is most easily seen by viewing the control file of a File Set.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

I think Iam missing something. :roll:

If I check a control file of a fileset, it has information of schema of the file and node information on which the corresponding data file resides.
If I check the individual data file, the whole records are available. The records are spread across the data file based on the partition algorithm I used. I couldn't find the same record present in different data file.
I couldn't find any other record information.

Btw, what I was looking at is, if partition number, the partition count is constant for the whole file (either dataset or fileset), it could have been stored in header file, rather than at row level.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
steele
Premium Member
Premium Member
Posts: 12
Joined: Thu Jul 14, 2005 2:16 pm

Post by steele »

Would anyone (Ray :wink: ) know what role, if any, the column defined "key" plays in the creation of the dataset? Is there any reason that they should match the hash key for the partition?

Thanks!
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

No, the 'key' attribute has no effect in DataSets.
steele
Premium Member
Premium Member
Posts: 12
Joined: Thu Jul 14, 2005 2:16 pm

Post by steele »

That was always my belief but I had an IBM rep inform me otherwise.

Thanks ArndW!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

As far as I am aware Key is irrelevant in Data Sets. It is pertinent in Lookup File Sets. Please harangue your IBM rep for additional information.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I agree with Ray - get the IBM rep to convince you otherwise. Even better, write a simple job that outputs to a dataset (keep it simple with just one node) and then make a copy of the generated file, whose location you can discover with the data management tool. Then re-run the same job with the same data but specify other column(s) as keys and see if the dataset is identical on disk.
Post Reply