Lookup Dataset sizing
Moderators: chulett, rschirm, roy
Lookup Dataset sizing
I dont know if its documented anywhere in the standard PX documentation, but i need to estimate the size of a lookup table/dataset created in datastage, is there a guide for this?
In the current scenario, I have 29 Million rows in PROD for refernce dataset, with each row having 2 Varchar Columns of length 150, 1 varchar column of length 25 and 2 decimal(38,10) columns for surrogate keys.
Complicating this is the fact that its a single node system on windows 32 Bit, so im forced to use execution mode = sequential.. Apparantly there were install related issues for multi node config..
I tried using join stage(left Outer) with data presorted on both key columns on the input and ref dataset, but job failed with "DSEE-TFIO" errors like these
ora_InvHdr_Vdr,0: Write to dataset on [fd 4] failed (Insufficient system resources exist to complete the requested service. ) on node node1, hostname A300SM111
There is also an existing server job that seems to work fine for now , but due to forecasted data volumes, the 32 bit hash file would reach its storage limit, and we are trying to avoid server jobs in the hope that the single node issue will be resolved sometime .
Sparse lookup does not seem to be an option as data volumes are in the range of 500 K to 1 M rows every morning : )
Im now trying to use a lookup file set - but thats just a testing exercise , and awaiting results
In the current scenario, I have 29 Million rows in PROD for refernce dataset, with each row having 2 Varchar Columns of length 150, 1 varchar column of length 25 and 2 decimal(38,10) columns for surrogate keys.
Complicating this is the fact that its a single node system on windows 32 Bit, so im forced to use execution mode = sequential.. Apparantly there were install related issues for multi node config..
I tried using join stage(left Outer) with data presorted on both key columns on the input and ref dataset, but job failed with "DSEE-TFIO" errors like these
ora_InvHdr_Vdr,0: Write to dataset on [fd 4] failed (Insufficient system resources exist to complete the requested service. ) on node node1, hostname A300SM111
There is also an existing server job that seems to work fine for now , but due to forecasted data volumes, the 32 bit hash file would reach its storage limit, and we are trying to avoid server jobs in the hope that the single node issue will be resolved sometime .
Sparse lookup does not seem to be an option as data volumes are in the range of 500 K to 1 M rows every morning : )
Im now trying to use a lookup file set - but thats just a testing exercise , and awaiting results
Not a remarkable volume of data.I have 29 Million rows in PROD for refernce dataset
Forgive me but I am not versant with Windoze and EE so I was not aware of this problem? Is this really an issue?Complicating this is the fact that its a single node system on windows 32 Bit, so im forced to use execution mode = sequential .. Apparantly there were install related issues for multi node config..
Running with a single node configuration should still not be a problem and a source of 500k to 1 million does not rule out a sparse lookup, especially if the table is in good shape and indexed properly.
The general guidance on when to use a join and when to use a lookup is around a 50:1 ratio of stream vs lookup. You are well under this so the lookup will be ok as well as a join. I have seen the guidance state a ratio as high as 100:1, but I think this is a bit much and would likely use a join especially if it is a normal lookup.
I have not used the lookup file set in many years, don't know anyone using it and I think either a relational lookup or dataset is the way to go.
Do you have adequate disk and scratch space?
Mike Hester
mhester@petra-ps.com
mhester@petra-ps.com
Well I have almost 50 GB Free space on the DS Server disk.
The creation of lookup fileset failed with the following error ( I reduced the varchar colums to size 50 Bytes each after browsing through the data and using trim/substr functions on the oracle query)
Lookup_File_Set_0,0: Could not map table file "D:/IBM/InformationServer/Server/Datasets/lookuptable.20100826.ln5f5na (size 7208638312 bytes)": Invalid argument
The creation of lookup fileset failed with the following error ( I reduced the varchar colums to size 50 Bytes each after browsing through the data and using trim/substr functions on the oracle query)
Lookup_File_Set_0,0: Could not map table file "D:/IBM/InformationServer/Server/Datasets/lookuptable.20100826.ln5f5na (size 7208638312 bytes)": Invalid argument
These options had been tried earlier( before using the join), but encountered the all too familiar errors beginning with "Could not map table"either a relational lookup or dataset is the way to go.
Im really curious if There is some limitation with Windows 32 bit thats causing the problem, Total Disk Free space at 50 GB is not indicative of any underlying issue, Even when I just try to write the refernce data from the oracle DB to a File-Set Stage, the job hangs forever( > 20 mins in this case) when its 95% done ( that takes around 10 mins) in terms of number of rows transferred..
Will a peek in the windows event log help ?? It looks like the issue will be headed the IBM way by Monday : (
This is my first experience too with Windoze and EE.. The server edition with windows did not behave in unexpected ways..
I have seen clients succesfully use Datastage EE(32 bit) on 16 nodes with Sun Solaris -64 bit ( IBM unsupported at that time) and on Fujitsu Hardware (IBM unsupported for Datastage) and THEIR OWN proprietary award winning disk+network technology (100 GBps NAS) ..but rules change when its windows I guess
Will a peek in the windows event log help ?? It looks like the issue will be headed the IBM way by Monday : (
This is my first experience too with Windoze and EE.. The server edition with windows did not behave in unexpected ways..
I have seen clients succesfully use Datastage EE(32 bit) on 16 nodes with Sun Solaris -64 bit ( IBM unsupported at that time) and on Fujitsu Hardware (IBM unsupported for Datastage) and THEIR OWN proprietary award winning disk+network technology (100 GBps NAS) ..but rules change when its windows I guess
check if this helps
http://www-01.ibm.com/support/docview.w ... s&lang=all
or
check if you can reduce your record length
http://www-01.ibm.com/support/docview.w ... s&lang=all
or
check if you can reduce your record length
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
Thanks folks .. I did point my admin to look at the IBM website.. In the meanwhile , Im using a job to create a fileset with reference data , and using this fileset in conjunction with a left outer join in a subsequent job..And so far it works ( we are still in the load testing phase)!!!
The problem with the disk space is that both scratch and dataset (resource disk) disks are on the same mount point ( D drive in Windows).. So Im not sure if thats part of the problem..
The problem with the disk space is that both scratch and dataset (resource disk) disks are on the same mount point ( D drive in Windows).. So Im not sure if thats part of the problem..
Well... your "could not map" issue is definitely an artifact of a 32bit operating system - Windows or UNIX - and one that has bitten me in the butt in the past. Make sure you minimize how much data goes to each partition lookup, for us that meant intelligent hash partitioning to ensure each node got only the records that each one 'needed'.rameshrr3 wrote:Im really curious if There is some limitation with Windows 32 bit thats causing the problem
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers