Lookup Dataset sizing

rameshrr3 · Post by **rameshrr3** » Thu Aug 26, 2010 9:54 am

I dont know if its documented anywhere in the standard PX documentation, but i need to estimate the size of a lookup table/dataset created in datastage, is there a guide for this?

In the current scenario, I have 29 Million rows in PROD for refernce dataset, with each row having 2 Varchar Columns of length 150, 1 varchar column of length 25 and 2 decimal(38,10) columns for surrogate keys.

Complicating this is the fact that its a single node system on windows 32 Bit, so im forced to use execution mode = sequential.. Apparantly there were install related issues for multi node config..

I tried using join stage(left Outer) with data presorted on both key columns on the input and ref dataset, but job failed with "DSEE-TFIO" errors like these

ora_InvHdr_Vdr,0: Write to dataset on [fd 4] failed (Insufficient system resources exist to complete the requested service. ) on node node1, hostname A300SM111

There is also an existing server job that seems to work fine for now , but due to forecasted data volumes, the 32 bit hash file would reach its storage limit, and we are trying to avoid server jobs in the hope that the single node issue will be resolved sometime .

Sparse lookup does not seem to be an option as data volumes are in the range of 500 K to 1 M rows every morning : )

Im now trying to use a lookup file set - but thats just a testing exercise , and awaiting results

mhester · Post by **mhester** » Thu Aug 26, 2010 10:45 am

I have 29 Million rows in PROD for refernce dataset

Not a remarkable volume of data.

Complicating this is the fact that its a single node system on windows 32 Bit, so im forced to use execution mode = sequential .. Apparantly there were install related issues for multi node config..

Forgive me but I am not versant with Windoze and EE so I was not aware of this problem? Is this really an issue?

Running with a single node configuration should still not be a problem and a source of 500k to 1 million does not rule out a sparse lookup, especially if the table is in good shape and indexed properly.

The general guidance on when to use a join and when to use a lookup is around a 50:1 ratio of stream vs lookup. You are well under this so the lookup will be ok as well as a join. I have seen the guidance state a ratio as high as 100:1, but I think this is a bit much and would likely use a join especially if it is a normal lookup.

I have not used the lookup file set in many years, don't know anyone using it and I think either a relational lookup or dataset is the way to go.

Do you have adequate disk and scratch space?

rameshrr3 · Post by **rameshrr3** » Thu Aug 26, 2010 12:13 pm

Well I have almost 50 GB Free space on the DS Server disk.

The creation of lookup fileset failed with the following error ( I reduced the varchar colums to size 50 Bytes each after browsing through the data and using trim/substr functions on the oracle query)

Lookup_File_Set_0,0: Could not map table file "D:/IBM/InformationServer/Server/Datasets/lookuptable.20100826.ln5f5na (size 7208638312 bytes)": Invalid argument

either a relational lookup or dataset is the way to go.

These options had been tried earlier( before using the join), but encountered the all too familiar errors beginning with "Could not map table"

rameshrr3 · Post by **rameshrr3** » Fri Aug 27, 2010 8:42 am

Im really curious if There is some limitation with Windows 32 bit thats causing the problem, Total Disk Free space at 50 GB is not indicative of any underlying issue, Even when I just try to write the refernce data from the oracle DB to a File-Set Stage, the job hangs forever( > 20 mins in this case) when its 95% done ( that takes around 10 mins) in terms of number of rows transferred..

Will a peek in the windows event log help ?? It looks like the issue will be headed the IBM way by Monday : (

This is my first experience too with Windoze and EE.. The server edition with windows did not behave in unexpected ways..

I have seen clients succesfully use Datastage EE(32 bit) on 16 nodes with Sun Solaris -64 bit ( IBM unsupported at that time) and on Fujitsu Hardware (IBM unsupported for Datastage) and THEIR OWN proprietary award winning disk+network technology (100 GBps NAS) ..but rules change when its windows I guess

madhukar · Post by **madhukar** » Fri Aug 27, 2010 9:57 am

check if this helps
http://www-01.ibm.com/support/docview.w ... s&lang=all

or
check if you can reduce your record length

Sreenivasulu · Post by **Sreenivasulu** » Fri Aug 27, 2010 10:40 am

Whenever i get space issue i add all the mounts to resource disk space and scratch disk space and generally it is around 20 GB but with no issues.

You can overcome getting issues with a 2GB file by pointing to the mount which has 50GB space (referred by you)

Regards
Sreeni

rameshrr3 · Post by **rameshrr3** » Mon Aug 30, 2010 7:22 am

Thanks folks .. I did point my admin to look at the IBM website.. In the meanwhile , Im using a job to create a fileset with reference data , and using this fileset in conjunction with a left outer join in a subsequent job..And so far it works ( we are still in the load testing phase)!!!

The problem with the disk space is that both scratch and dataset (resource disk) disks are on the same mount point ( D drive in Windows).. So Im not sure if thats part of the problem..

chulett · Post by **chulett** » Mon Aug 30, 2010 7:32 am

rameshrr3 wrote:Im really curious if There is some limitation with Windows 32 bit thats causing the problem

Well... your "could not map" issue is definitely an artifact of a 32bit operating system - Windows or UNIX - and one that has bitten me in the butt in the past. Make sure you minimize how much data goes to each partition lookup, for us that meant intelligent hash partitioning to ensure each node got only the records that each one 'needed'.

rameshrr3 · Post by **rameshrr3** » Mon Aug 30, 2010 9:33 am

Yes.. As I have stated before the 1 node system isnt of great help in partitioning, guess I wouldnt have run into this if i had a few extra 'nodes'

chulett · Post by **chulett** » Mon Aug 30, 2010 10:14 am

Any reason you can't "add" them? Run on more than one node to cut down on the lookup size?

rameshrr3 · Post by **rameshrr3** » Wed Sep 01, 2010 1:43 pm

Any reason you can't "add" them?

Well my admin says there's a config problem on the Windows server and probably are working with Infernal Blue Machines, and further probing was not encouraged by the powers that be