Dataset, Fileset

tostay2003 · Post by **tostay2003** » Thu Mar 09, 2006 8:56 am

Can some one explain clearly when to use dataset and fileset and when not to use them

ArndW · Post by **ArndW** » Thu Mar 09, 2006 9:29 am

I assume you are referring to a lookup file set and a data set. Lookup filesets are used in lookup stages only, datasets can be used in lookup stages as well as elsewhere. The lookup file set gives superior performance in lookups to datasets [according to the documentation, I've done some tests in the past and the performance seems to be about the same for my small test data sets]

DSguru2B · Post by **DSguru2B** » Thu Mar 09, 2006 9:36 am

My experience gave a failure on Lookup file sets , as i had high volume of data. So we had to go for File sets.. but very bad performance

tostay2003 · Post by **tostay2003** » Thu Mar 09, 2006 10:27 am

I am new to parallel jobs, the thing is i have found this dataset and fileset in the documentation of parallel jobs, but dont know where they r used. Looks like sequential files bundled together :D

ArndW · Post by **ArndW** » Thu Mar 09, 2006 10:29 am

Looks like sequential files bundled together

They are. Then again, since UNIX only knows about sequential files, so are databases.

tostay2003 · Post by **tostay2003** » Thu Mar 09, 2006 11:15 am

sorry to bother u, i didnt understand that.

tostay2003 · Post by **tostay2003** » Thu Mar 09, 2006 11:22 am

do v need to use datasets as intermediate stages, before they r stored given to unix server as sequential file?

ArndW · Post by **ArndW** » Thu Mar 09, 2006 11:29 am

The advantage of datasets (or, as you would say, bundled sequential files) is that they are created in such as way that each individual file within the dataset represent a processing node or process in PX, so reading a dataset is done in parallel. Reading to or writing to a sequential file can only be done by one process at a time; so datasets are inherently more efficient.

You don't need to use datasets as intermediate files; I'm not sure where you got that impression.

My comment regarding sequential files was meant as a general comment on UNIX - there are no ISAM/VSAM or other indexed file constructs in UNIX, everything is a sequential file at the lowest level. That includes databases - although they then have their own internal structures which let indexed operations take place on the file contents.

A DataSet consists of a descriptor file which contains layout and control information and then just points to the <n> sequential files, as you have correctly pointed out. I like the analogy of "bundled sequential files".

ameyvaidya · Post by **ameyvaidya** » Thu Mar 09, 2006 11:12 pm

From the Parallel Job Developers Guide:

What is a file set? DataStage can generate and name exported files,
write them to their destination, and list the files it has generated in a
file whose extension is, by convention, .fs. The data files and the file
that lists them are called a file set. This capability is useful because
some operating systems impose a 2 GB limit on the size of a file and
you need to distribute files among nodes to prevent overruns.

What is a data set? DataStage parallel extender jobs use data sets to
manage data within a job. You can think of each link in a job as
carrying a data set. The Data Set stage allows you to store data being
operated on in a persistent form, which can then be used by other
DataStage jobs.

From what additional info I gather, Filesets are Partitioned ASCII Sequential files. While DataSets are Partitioned Binary(DataStage internal Format??) files.

vmcburney · Post by **vmcburney** » Fri Mar 10, 2006 12:03 am

If you are looking to stage your data to disk then in order of efficiency you have datasets, filesets and sequential files.
- Datasets are fastest to save and read as they are written in the native internal format of a parallel job and retain indexing and partitioning.
- Filesets have an overhead of being converted to ASCII readable text however they also have some partitioning benefits. This is a good method if you need to stage, archive and reuse these files in a readable format.
- Sequential files are the least efficient method as data needs to be repartitioned and written sequential and then partitioned when they are processed downstream.

Lookup filesets are saved in the format of a lookup table file and are the most efficient at being loaded and used by a lookup stage. So they are an option for staging reference data.

tostay2003 · Post by **tostay2003** » Fri Mar 10, 2006 8:19 am

Thanks a lot for the reply

kumar_s · Post by **kumar_s** » Sat Mar 11, 2006 12:59 am

And dataset and fileset can be use only by datastage. For outer worls sequential file will be the only choice.
Untill now i dont find a vivid use of fileset. Since any any further use, we should rely on datastage. Dataset can be opted for the same purpose which is more effecient.

ray.wurlod · Post by **ray.wurlod** » Sat Mar 11, 2006 6:29 am

That's not quite true. The File Set control file gives the location of the data files, and these are quite readable by humans and outside processes. The point is, however, that the File Set is partitioned, so that some of the data (rows) are on each processing node. That may inconvenience outside applications, but one could script around the inconvenience, particularly on SMP architecture.

vmcburney · Post by **vmcburney** » Sat Mar 11, 2006 6:39 pm

Which is why filesets are so good for archiving, they are relatively fast to write and if you collect them from each node you have a readable set of data.

MTA · Post by **MTA** » Sat Mar 11, 2006 11:20 pm

FYI: From development guide:
DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage. Here's how to decide which to use: There are two data sets being combined. One is the primary or driving dataset, sometimes called the left of the join. The other data set(s) are the reference datasets, or the right of the join.
In all cases we are concerned with the size of the reference datasets. If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage may thrash because the reference datasets may not fit in RAM along with everything else that has to be in RAM. This results in very slow
performance since each lookup operation can, and typically does,
cause a page fault and an I/O operation. So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join Processing is very fast and never involves paging or other I/O.