Creating Hash File is efficient or Creating Index?

allavivek · Post by **allavivek** » Sat Jul 10, 2010 7:54 am

Hi All,

Iam a new Bie to Datastage...Firstly this forum is very use full for learning and thanks for Everyone contributing for helping to learn and solve issues...

I have a doubt regarding Speed and efficiency in identifying a record(for huge number of records say for example 10,000)...
1.Creating indexes to data present in table or
2.Creating Hash file for data.

chulett · Post by **chulett** » Sat Jul 10, 2010 8:07 am

Since it seems we're talking about a lookup in the Server product, consider the two approaches you've listed.

1) Singleton trips across the network for every incoming row in the job to do a SQL select against your target table. Sure, an index 'helps' here but this is (IMHO) a viable solution only for very low volume processing jobs.

2) Local, typically in-memory cached, very efficient equi-join hashed structure.

ps. 10,000 records doesn't come close to 'huge' in my experience.

allavivek · Post by **allavivek** » Sat Jul 10, 2010 8:25 am

Thanks chullet for your reply,

So, If data is huge , around million records its is efficient to use hash file stage to store data before further processing..

Have another question

Q.Hash files stores non key values as dynamic arrays....I want to know how key values are mapped to non key values...like a tree with key values as parent nodes mapped to non key values or any other form?

chulett · Post by **chulett** » Sat Jul 10, 2010 8:50 am

Still not huge and yes, can still be efficient.

I'll let others with more intimate knowledge of the inner workings of hashed files respond but they're not stored as 'dynamic arrays'. No idea who wrote this (Ken, perhaps?) and I take exception to their use of the word 'small' but since they don't quantify it, it's hard to say just how small small is but it seems like it might help answer your questions.

You could google around for other resources as well, they are out there.

chulett · Post by **chulett** » Sat Jul 10, 2010 8:51 am

ps. Welcome to the proper forum for this discussion.

allavivek · Post by **allavivek** » Sat Jul 10, 2010 11:48 am

Thanks chullet for your response..

And ill post in proper from next time....

ray.wurlod · Post by **ray.wurlod** » Sat Jul 10, 2010 2:28 pm

Keys and non-key columns are stored adjacent in the DataStage (UniVerse) hashed file structure. Each page consists of a contiguous set of records. The key value is used to determine the physical location of the page on which the record is stored (using a hashing algorithm, from which the name HASHED FILE is obtained), the physical record may be retrieved from there. A "standard" hashed file is limited to 2GB of data (which is typically way more than a million records) but, if 64-bit addressing is enabled, then the hashed file can theoretically hold 1.9 million TB of data - though most operating systems limit this to 1TB or 9TB.

Incidentally, records ARE stored as dynamic arrays in DataStage hashed files.

chulett · Post by **chulett** » Sat Jul 10, 2010 4:49 pm

Interesting. Makes sense, it's just that I had read the document I linked to and there was this passage in it to that point:

The data is stored as a continuous string of characters, with a column separator (@AM system variable). Each column is positional in nature and relative to the columns preceding it. It should become obvious that a hash file is almost the same as a sequential file, meaning columns are positional and referenced by position, not name.

So, is that incorrect or did I just not put it together with it being a dynamic array? It sure sounds like one, I just guess the @AM part confused me as I thought they used Field (or Value) marks rather than whatever the heck the 'A' stands for.

ray.wurlod · Post by **ray.wurlod** » Sun Jul 11, 2010 3:30 am

The "A" stands for "attribute". "Attribute mark" is a synonym for "field mark". Therefore storage really is a dynamic array in a DataStage hashed file.

The key is separated from the rest of the record by a "segment mark" (Char(255)). On that basis the entire record (key and non-key pieces) can be considered a dynamic array.

It's actually a tad more complex than that, as the record is padded to a whole "word" length - that is, a multiple of the word size (32-bit or 64-bit). And there are forward and backward link pointers within a group (page) as well as a few flag bits.

chulett · Post by **chulett** » Sun Jul 11, 2010 6:48 am

Cool, thanks for that.

allavivek · Post by **allavivek** » Sun Jul 11, 2010 9:40 am

More information on below link..

http://home.iprimus.com.au/raywurlod/un ... naries.htm