Dataset, Fileset
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 97
- Joined: Tue Feb 21, 2006 6:45 am
Dataset, Fileset
Can some one explain clearly when to use dataset and fileset and when not to use them
I assume you are referring to a lookup file set and a data set. Lookup filesets are used in lookup stages only, datasets can be used in lookup stages as well as elsewhere. The lookup file set gives superior performance in lookups to datasets [according to the documentation, I've done some tests in the past and the performance seems to be about the same for my small test data sets]
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 97
- Joined: Tue Feb 21, 2006 6:45 am
They are. Then again, since UNIX only knows about sequential files, so are databases.Looks like sequential files bundled together
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 97
- Joined: Tue Feb 21, 2006 6:45 am
-
- Participant
- Posts: 97
- Joined: Tue Feb 21, 2006 6:45 am
The advantage of datasets (or, as you would say, bundled sequential files) is that they are created in such as way that each individual file within the dataset represent a processing node or process in PX, so reading a dataset is done in parallel. Reading to or writing to a sequential file can only be done by one process at a time; so datasets are inherently more efficient.
You don't need to use datasets as intermediate files; I'm not sure where you got that impression.
My comment regarding sequential files was meant as a general comment on UNIX - there are no ISAM/VSAM or other indexed file constructs in UNIX, everything is a sequential file at the lowest level. That includes databases - although they then have their own internal structures which let indexed operations take place on the file contents.
A DataSet consists of a descriptor file which contains layout and control information and then just points to the <n> sequential files, as you have correctly pointed out. I like the analogy of "bundled sequential files".
You don't need to use datasets as intermediate files; I'm not sure where you got that impression.
My comment regarding sequential files was meant as a general comment on UNIX - there are no ISAM/VSAM or other indexed file constructs in UNIX, everything is a sequential file at the lowest level. That includes databases - although they then have their own internal structures which let indexed operations take place on the file contents.
A DataSet consists of a descriptor file which contains layout and control information and then just points to the <n> sequential files, as you have correctly pointed out. I like the analogy of "bundled sequential files".
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Charter Member
- Posts: 166
- Joined: Wed Mar 16, 2005 6:52 am
- Location: Mumbai, India
From the Parallel Job Developers Guide:
What is a file set? DataStage can generate and name exported files,
write them to their destination, and list the files it has generated in a
file whose extension is, by convention, .fs. The data files and the file
that lists them are called a file set. This capability is useful because
some operating systems impose a 2 GB limit on the size of a file and
you need to distribute files among nodes to prevent overruns.
From what additional info I gather, Filesets are Partitioned ASCII Sequential files. While DataSets are Partitioned Binary(DataStage internal Format??) files.What is a data set? DataStage parallel extender jobs use data sets to
manage data within a job. You can think of each link in a job as
carrying a data set. The Data Set stage allows you to store data being
operated on in a persistent form, which can then be used by other
DataStage jobs.
Amey Vaidya<i>
I am rarely happier than when spending an entire day programming my computer to perform automatically a task that it would otherwise take me a good ten seconds to do by hand.</i>
<i>- Douglas Adams</i>
I am rarely happier than when spending an entire day programming my computer to perform automatically a task that it would otherwise take me a good ten seconds to do by hand.</i>
<i>- Douglas Adams</i>
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
If you are looking to stage your data to disk then in order of efficiency you have datasets, filesets and sequential files.
- Datasets are fastest to save and read as they are written in the native internal format of a parallel job and retain indexing and partitioning.
- Filesets have an overhead of being converted to ASCII readable text however they also have some partitioning benefits. This is a good method if you need to stage, archive and reuse these files in a readable format.
- Sequential files are the least efficient method as data needs to be repartitioned and written sequential and then partitioned when they are processed downstream.
Lookup filesets are saved in the format of a lookup table file and are the most efficient at being loaded and used by a lookup stage. So they are an option for staging reference data.
- Datasets are fastest to save and read as they are written in the native internal format of a parallel job and retain indexing and partitioning.
- Filesets have an overhead of being converted to ASCII readable text however they also have some partitioning benefits. This is a good method if you need to stage, archive and reuse these files in a readable format.
- Sequential files are the least efficient method as data needs to be repartitioned and written sequential and then partitioned when they are processed downstream.
Lookup filesets are saved in the format of a lookup table file and are the most efficient at being loaded and used by a lookup stage. So they are an option for staging reference data.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
And dataset and fileset can be use only by datastage. For outer worls sequential file will be the only choice.
Untill now i dont find a vivid use of fileset. Since any any further use, we should rely on datastage. Dataset can be opted for the same purpose which is more effecient.
Untill now i dont find a vivid use of fileset. Since any any further use, we should rely on datastage. Dataset can be opted for the same purpose which is more effecient.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
That's not quite true. The File Set control file gives the location of the data files, and these are quite readable by humans and outside processes. The point is, however, that the File Set is partitioned, so that some of the data (rows) are on each processing node. That may inconvenience outside applications, but one could script around the inconvenience, particularly on SMP architecture.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Which is why filesets are so good for archiving, they are relatively fast to write and if you collect them from each node you have a readable set of data.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
FYI: From development guide:
DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage. Here's how to decide which to use: There are two data sets being combined. One is the primary or driving dataset, sometimes called the left of the join. The other data set(s) are the reference datasets, or the right of the join.
In all cases we are concerned with the size of the reference datasets. If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage may thrash because the reference datasets may not fit in RAM along with everything else that has to be in RAM. This results in very slow
performance since each lookup operation can, and typically does,
cause a page fault and an I/O operation. So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join Processing is very fast and never involves paging or other I/O.
DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage. Here's how to decide which to use: There are two data sets being combined. One is the primary or driving dataset, sometimes called the left of the join. The other data set(s) are the reference datasets, or the right of the join.
In all cases we are concerned with the size of the reference datasets. If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage may thrash because the reference datasets may not fit in RAM along with everything else that has to be in RAM. This results in very slow
performance since each lookup operation can, and typically does,
cause a page fault and an I/O operation. So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join Processing is very fast and never involves paging or other I/O.
M.T.Anwer
The day the child realizes that all adults are imperfect he becomes an adolescent;
the day he forgives them, he becomes an adult; the day he forgives himself, he becomes wise.
-Aiden Nowlan
The day the child realizes that all adults are imperfect he becomes an adolescent;
the day he forgives them, he becomes an adult; the day he forgives himself, he becomes wise.
-Aiden Nowlan