Datasets having 30 M worth of data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sonia jacob
Participant
Posts: 122
Joined: Mon Jul 05, 2004 1:33 pm
Location: MA

Datasets having 30 M worth of data

Post by sonia jacob »

Hi,

We have a requirement of reading 30M rows from a flat file, looking up a table with again 30M rows and loading it to the target table.

We intend to use datasets in parallel extender. Hopefully we can partition the process so as to improve performace.

Could anyone comment on the size limitaions of using datasets.
Any recommendations on the design of the Lookups.

Thank you.
Regards
Sonia Jacob
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

There shoulnt be any issue with the dataset as long as the max file size of the sequential file is permited by the OS.
But if you are looking for Lookup file set with very large number of rows, it will certainly affect you performance. You need to take care of loading the fileset into memory. Your physical memory may get exploited. You can find a blog presented by Vincent comparing this size. Apparently you can search this forum for the same. For a large object, you can prefer Join/Merge over Lookup.

IMHT
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

If you use a lookup stage with 30 million rows you will find your job still needs to move most of that reference data into a lookup file. It cannot lookup directly against a dataset. The dataset will still be fast in that it is already partitioned and on the nodes. The lookup fileset will be faster as it is accessed directly by the lookup stage without having to write anything to disk.

A join stage or a change capture stage are your other options. If you are comparing several fields change capture is easier to use then lookup as you don't get any of the null value problems you get from a left outer join lookup.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

A lookup stage for looking up 30M rows is not a good idea. The processes will die very often.
As said by the DS Manuals and Burney, Join Stage should be your first option for this.
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Post by rasi »

Having database table as lookup is another option which you can consider. I never tried with 30 Million records. I hope someone would have done this and can comment
Regards
Siva

Listening to the Learned

"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Re: Datasets having 30 M worth of data

Post by Ultramundane »

Could anyone comment on the size limitaions of using datasets.
Any recommendations on the design of the Lookups.
Just one gotcha that I know about. In version 7.0.1 (I believe) of datastage, Ascential changed the way that they allocate space for varchar columns. Datastage will allocate the full space of a varchar(x) column for every record even if the column is nullable. For instance, if I had a column defined as follows:

COL01 VARCHAR(500) NULL

datastage would allocate a full 500 bytes for every record even if the average was only 1 byte. The fix is to define the column as varchar without a given length.

COL01 VARCHAR NULL
or
COL01 VARCHAR NOT NULL

I continue to be plagued by this myself. The varchar(X) issue with datasets causes enormous files and huge performance issues because of the size of the files.
Post Reply