Hashed File Problems with large record set

SPA_BI · Post by **SPA_BI** » Sun Jan 28, 2007 9:35 pm

Hello,

I have a question regarding a possible limitation in hashed files.

We have a job that performs a lookup on a hashed file that was tested to function ok with small amounts of data. The same lookup failed to return any values when the hashed file contained 500,000 records.
The Debug mode showed all records as containing NULL values (this is incorrect) when the hash file was so large.

Pre-load to memory was set to disabled and there are no disk issues.

Does anyone have any ideas about this behaviour?

thanks,

SPA

DSguru2B · Post by **DSguru2B** » Sun Jan 28, 2007 10:09 pm

When you say "Failed to return any values", you mean that the none of the source records had a hit on the hashed file?

Hashed files have size limitations. 2.2 GB is the limit. But this barrier can be overcome by making it a 64 bit hashed file. Search the forum for more information on "how to".

Make sure your hashed file keys are trimmed and the source keys are also trimmed before doing the lookup.

SPA_BI · Post by **SPA_BI** » Sun Jan 28, 2007 10:37 pm

thanks for the reply.

I should have pointed out that the hash file is a lookup of itself (after an aggregation stage); so there are no issues with the key matching up.
It works well when i Cull the amount of records down but it seems like the sheer amount of records causes the lookup to fail.

So without the 2.2 G limit being reached, I'm wondering why the lookup works with a few of the records but not the large amount.
Could resource demands on the server result in this behaviour?

narasimha · Post by **narasimha** » Sun Jan 28, 2007 11:41 pm

Do you get any errors/waranings at all, or its just that the look ups are failing?
As DSguru2B suggested, do a trim before you do a lookup.
If possible share the design of your job.

SPA_BI · Post by **SPA_BI** » Mon Jan 29, 2007 12:12 am

thanks for reply,

There are no warnings/errors

The hash file is a lookup of itself. It works fine with a smaller set (even using records from the larger set), but when the whole 500,000 is pushed though, the lookup fails to match up any of the records.

How would the trim be of benefit?

chulett · Post by **chulett** » Mon Jan 29, 2007 12:23 am

SPA_BI wrote:The hash file is a lookup of itself.

I don't know about anyone else, but I'd appreciate a clarification as to what this statement means.

The 'trim' would help get past the classic 'lookup doesn't work' problem - the keys don't match because of extraneous whitespace in one or both values. People load "A" from one source and try to match it to "A " from another.

ray.wurlod · Post by **ray.wurlod** » Mon Jan 29, 2007 12:47 am

Please describe your job more completely. In particular what is the data type of the lookup key, what is the data type of that column coming out of the Aggregator stage, and what happens to the column in the Aggregator stage (is it grouped or does it have an aggregate function applied to it)? Mention also whether you have read cache and/or write cache enabled in the Hashed File stages.

ArndW · Post by **ArndW** » Mon Jan 29, 2007 2:36 am

SPA_BI wrote:...The hash file is a lookup of itself...

My thought there is that you might have buffered writes turned on, so that a lookup of a record might not return a value because it hasn't actually been written to the file yet. This could happen more frequently when a lot of writes are done and are buffered. As stated earlier, a more detailed description might help clarify that this is not the problem.

If you attempt to write past the default 2Gb limit (since dynamic files are stored in 2 OS files with most data being written to one file the actual limit is not predictable and is slightly over 2Gb) you will get write errors and most likely a corrupted file. With 500,000 records you would need an average record length of over 4096 bytes to exceed 2Gb - is this the case?

DSguru2B · Post by **DSguru2B** » Mon Jan 29, 2007 8:08 am

I highly doubt its the 2.2 Gigs limit issue here ArndW. As you noted, the job would abort and atleast spit out a message in the log file. None of that is happening. I think your analysis about "buffered write" might be it. Lets see what the OP comes back with.

SPA_BI · Post by **SPA_BI** » Wed Jan 31, 2007 5:19 pm

Thanks for the help guys, it was a wrong aggregation being performed on the key.

Closing request.

ray.wurlod · Post by **ray.wurlod** » Wed Jan 31, 2007 10:42 pm

Thanks for the feedback. We were fairly sure the problem wasn't in the hashed file itself.