Page 1 of 1

Datasets having 30 M worth of data

Posted: Fri Feb 24, 2006 1:46 pm
by sonia jacob
Hi,

We have a requirement of reading 30M rows from a flat file, looking up a table with again 30M rows and loading it to the target table.

We intend to use datasets in parallel extender. Hopefully we can partition the process so as to improve performace.

Could anyone comment on the size limitaions of using datasets.
Any recommendations on the design of the Lookups.

Thank you.

Posted: Fri Feb 24, 2006 1:51 pm
by kumar_s
There shoulnt be any issue with the dataset as long as the max file size of the sequential file is permited by the OS.
But if you are looking for Lookup file set with very large number of rows, it will certainly affect you performance. You need to take care of loading the fileset into memory. Your physical memory may get exploited. You can find a blog presented by Vincent comparing this size. Apparently you can search this forum for the same. For a large object, you can prefer Join/Merge over Lookup.

IMHT

Posted: Fri Feb 24, 2006 8:17 pm
by vmcburney
If you use a lookup stage with 30 million rows you will find your job still needs to move most of that reference data into a lookup file. It cannot lookup directly against a dataset. The dataset will still be fast in that it is already partitioned and on the nodes. The lookup fileset will be faster as it is accessed directly by the lookup stage without having to write anything to disk.

A join stage or a change capture stage are your other options. If you are comparing several fields change capture is easier to use then lookup as you don't get any of the null value problems you get from a left outer join lookup.

Posted: Tue Feb 28, 2006 6:49 pm
by DSguru2B
A lookup stage for looking up 30M rows is not a good idea. The processes will die very often.
As said by the DS Manuals and Burney, Join Stage should be your first option for this.

Posted: Tue Feb 28, 2006 6:56 pm
by rasi
Having database table as lookup is another option which you can consider. I never tried with 30 Million records. I hope someone would have done this and can comment

Re: Datasets having 30 M worth of data

Posted: Tue Feb 28, 2006 9:07 pm
by Ultramundane
Could anyone comment on the size limitaions of using datasets.
Any recommendations on the design of the Lookups.
Just one gotcha that I know about. In version 7.0.1 (I believe) of datastage, Ascential changed the way that they allocate space for varchar columns. Datastage will allocate the full space of a varchar(x) column for every record even if the column is nullable. For instance, if I had a column defined as follows:

COL01 VARCHAR(500) NULL

datastage would allocate a full 500 bytes for every record even if the average was only 1 byte. The fix is to define the column as varchar without a given length.

COL01 VARCHAR NULL
or
COL01 VARCHAR NOT NULL

I continue to be plagued by this myself. The varchar(X) issue with datasets causes enormous files and huge performance issues because of the size of the files.