Hi,
We have a requirement of reading 30M rows from a flat file, looking up a table with again 30M rows and loading it to the target table.
We intend to use datasets in parallel extender. Hopefully we can partition the process so as to improve performace.
Could anyone comment on the size limitaions of using datasets.
Any recommendations on the design of the Lookups.
Thank you.
Datasets having 30 M worth of data
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 122
- Joined: Mon Jul 05, 2004 1:33 pm
- Location: MA
Datasets having 30 M worth of data
Regards
Sonia Jacob
Sonia Jacob
There shoulnt be any issue with the dataset as long as the max file size of the sequential file is permited by the OS.
But if you are looking for Lookup file set with very large number of rows, it will certainly affect you performance. You need to take care of loading the fileset into memory. Your physical memory may get exploited. You can find a blog presented by Vincent comparing this size. Apparently you can search this forum for the same. For a large object, you can prefer Join/Merge over Lookup.
IMHT
But if you are looking for Lookup file set with very large number of rows, it will certainly affect you performance. You need to take care of loading the fileset into memory. Your physical memory may get exploited. You can find a blog presented by Vincent comparing this size. Apparently you can search this forum for the same. For a large object, you can prefer Join/Merge over Lookup.
IMHT
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
If you use a lookup stage with 30 million rows you will find your job still needs to move most of that reference data into a lookup file. It cannot lookup directly against a dataset. The dataset will still be fast in that it is already partitioned and on the nodes. The lookup fileset will be faster as it is accessed directly by the lookup stage without having to write anything to disk.
A join stage or a change capture stage are your other options. If you are comparing several fields change capture is easier to use then lookup as you don't get any of the null value problems you get from a left outer join lookup.
A join stage or a change capture stage are your other options. If you are comparing several fields change capture is easier to use then lookup as you don't get any of the null value problems you get from a left outer join lookup.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Having database table as lookup is another option which you can consider. I never tried with 30 Million records. I hope someone would have done this and can comment
Regards
Siva
Listening to the Learned
"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
Siva
Listening to the Learned
"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
-
- Participant
- Posts: 407
- Joined: Mon Jun 27, 2005 8:54 am
- Location: Walker, Michigan
- Contact:
Re: Datasets having 30 M worth of data
Just one gotcha that I know about. In version 7.0.1 (I believe) of datastage, Ascential changed the way that they allocate space for varchar columns. Datastage will allocate the full space of a varchar(x) column for every record even if the column is nullable. For instance, if I had a column defined as follows:Could anyone comment on the size limitaions of using datasets.
Any recommendations on the design of the Lookups.
COL01 VARCHAR(500) NULL
datastage would allocate a full 500 bytes for every record even if the average was only 1 byte. The fix is to define the column as varchar without a given length.
COL01 VARCHAR NULL
or
COL01 VARCHAR NOT NULL
I continue to be plagued by this myself. The varchar(X) issue with datasets causes enormous files and huge performance issues because of the size of the files.