Dataset, Fileset

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

tostay2003
Participant
Posts: 97
Joined: Tue Feb 21, 2006 6:45 am

Dataset, Fileset

Post by tostay2003 »

Can some one explain clearly when to use dataset and fileset and when not to use them
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I assume you are referring to a lookup file set and a data set. Lookup filesets are used in lookup stages only, datasets can be used in lookup stages as well as elsewhere. The lookup file set gives superior performance in lookups to datasets [according to the documentation, I've done some tests in the past and the performance seems to be about the same for my small test data sets]
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

My experience gave a failure on Lookup file sets , as i had high volume of data. So we had to go for File sets.. but very bad performance :?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
tostay2003
Participant
Posts: 97
Joined: Tue Feb 21, 2006 6:45 am

Post by tostay2003 »

I am new to parallel jobs, the thing is i have found this dataset and fileset in the documentation of parallel jobs, but dont know where they r used. Looks like sequential files bundled together :D
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Looks like sequential files bundled together
They are. Then again, since UNIX only knows about sequential files, so are databases.
tostay2003
Participant
Posts: 97
Joined: Tue Feb 21, 2006 6:45 am

Post by tostay2003 »

sorry to bother u, i didnt understand that.
tostay2003
Participant
Posts: 97
Joined: Tue Feb 21, 2006 6:45 am

Post by tostay2003 »

do v need to use datasets as intermediate stages, before they r stored given to unix server as sequential file?
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The advantage of datasets (or, as you would say, bundled sequential files) is that they are created in such as way that each individual file within the dataset represent a processing node or process in PX, so reading a dataset is done in parallel. Reading to or writing to a sequential file can only be done by one process at a time; so datasets are inherently more efficient.

You don't need to use datasets as intermediate files; I'm not sure where you got that impression.

My comment regarding sequential files was meant as a general comment on UNIX - there are no ISAM/VSAM or other indexed file constructs in UNIX, everything is a sequential file at the lowest level. That includes databases - although they then have their own internal structures which let indexed operations take place on the file contents.

A DataSet consists of a descriptor file which contains layout and control information and then just points to the <n> sequential files, as you have correctly pointed out. I like the analogy of "bundled sequential files".
ameyvaidya
Charter Member
Charter Member
Posts: 166
Joined: Wed Mar 16, 2005 6:52 am
Location: Mumbai, India

Post by ameyvaidya »

From the Parallel Job Developers Guide:
What is a file set? DataStage can generate and name exported files,
write them to their destination, and list the files it has generated in a
file whose extension is, by convention, .fs. The data files and the file
that lists them are called a file set. This capability is useful because
some operating systems impose a 2 GB limit on the size of a file and
you need to distribute files among nodes to prevent overruns.
What is a data set? DataStage parallel extender jobs use data sets to
manage data within a job. You can think of each link in a job as
carrying a data set. The Data Set stage allows you to store data being
operated on in a persistent form, which can then be used by other
DataStage jobs.
From what additional info I gather, Filesets are Partitioned ASCII Sequential files. While DataSets are Partitioned Binary(DataStage internal Format??) files.
Amey Vaidya<i>
I am rarely happier than when spending an entire day programming my computer to perform automatically a task that it would otherwise take me a good ten seconds to do by hand.</i>
<i>- Douglas Adams</i>
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

If you are looking to stage your data to disk then in order of efficiency you have datasets, filesets and sequential files.
- Datasets are fastest to save and read as they are written in the native internal format of a parallel job and retain indexing and partitioning.
- Filesets have an overhead of being converted to ASCII readable text however they also have some partitioning benefits. This is a good method if you need to stage, archive and reuse these files in a readable format.
- Sequential files are the least efficient method as data needs to be repartitioned and written sequential and then partitioned when they are processed downstream.

Lookup filesets are saved in the format of a lookup table file and are the most efficient at being loaded and used by a lookup stage. So they are an option for staging reference data.
tostay2003
Participant
Posts: 97
Joined: Tue Feb 21, 2006 6:45 am

Post by tostay2003 »

Thanks a lot for the reply
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

And dataset and fileset can be use only by datastage. For outer worls sequential file will be the only choice.
Untill now i dont find a vivid use of fileset. Since any any further use, we should rely on datastage. Dataset can be opted for the same purpose which is more effecient.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's not quite true. The File Set control file gives the location of the data files, and these are quite readable by humans and outside processes. The point is, however, that the File Set is partitioned, so that some of the data (rows) are on each processing node. That may inconvenience outside applications, but one could script around the inconvenience, particularly on SMP architecture.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Which is why filesets are so good for archiving, they are relatively fast to write and if you collect them from each node you have a readable set of data.
MTA
Participant
Posts: 37
Joined: Thu Feb 02, 2006 2:25 pm

Post by MTA »

FYI: From development guide:
DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage. Here's how to decide which to use: There are two data sets being combined. One is the primary or driving dataset, sometimes called the left of the join. The other data set(s) are the reference datasets, or the right of the join.
In all cases we are concerned with the size of the reference datasets. If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage may thrash because the reference datasets may not fit in RAM along with everything else that has to be in RAM. This results in very slow
performance since each lookup operation can, and typically does,
cause a page fault and an I/O operation. So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join Processing is very fast and never involves paging or other I/O.
M.T.Anwer
The day the child realizes that all adults are imperfect he becomes an adolescent;
the day he forgives them, he becomes an adult; the day he forgives himself, he becomes wise.
-Aiden Nowlan
Post Reply