Creating Hash File is efficient or Creating Index?

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
allavivek
Premium Member
Premium Member
Posts: 211
Joined: Sat May 01, 2010 5:07 pm

Creating Hash File is efficient or Creating Index?

Post by allavivek »

Hi All,

Iam a new Bie to Datastage...Firstly this forum is very use full for learning and thanks for Everyone contributing for helping to learn and solve issues...

I have a doubt regarding Speed and efficiency in identifying a record(for huge number of records say for example 10,000)...
1.Creating indexes to data present in table or
2.Creating Hash file for data.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Since it seems we're talking about a lookup in the Server product, consider the two approaches you've listed.

1) Singleton trips across the network for every incoming row in the job to do a SQL select against your target table. Sure, an index 'helps' here but this is (IMHO) a viable solution only for very low volume processing jobs.

2) Local, typically in-memory cached, very efficient equi-join hashed structure.

ps. 10,000 records doesn't come close to 'huge' in my experience. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
allavivek
Premium Member
Premium Member
Posts: 211
Joined: Sat May 01, 2010 5:07 pm

Post by allavivek »

Thanks chullet for your reply,

So, If data is huge , around million records its is efficient to use hash file stage to store data before further processing..

Have another question

Q.Hash files stores non key values as dynamic arrays....I want to know how key values are mapped to non key values...like a tree with key values as parent nodes mapped to non key values or any other form?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Still not huge and yes, can still be efficient. :wink:

I'll let others with more intimate knowledge of the inner workings of hashed files respond but they're not stored as 'dynamic arrays'. No idea who wrote this (Ken, perhaps?) and I take exception to their use of the word 'small' but since they don't quantify it, it's hard to say just how small small is but it seems like it might help answer your questions.

You could google around for other resources as well, they are out there.
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

ps. Welcome to the proper forum for this discussion.
-craig

"You can never have too many knives" -- Logan Nine Fingers
allavivek
Premium Member
Premium Member
Posts: 211
Joined: Sat May 01, 2010 5:07 pm

Post by allavivek »

Thanks chullet for your response..

And ill post in proper from next time....
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Keys and non-key columns are stored adjacent in the DataStage (UniVerse) hashed file structure. Each page consists of a contiguous set of records. The key value is used to determine the physical location of the page on which the record is stored (using a hashing algorithm, from which the name HASHED FILE is obtained), the physical record may be retrieved from there. A "standard" hashed file is limited to 2GB of data (which is typically way more than a million records) but, if 64-bit addressing is enabled, then the hashed file can theoretically hold 1.9 million TB of data - though most operating systems limit this to 1TB or 9TB.

Incidentally, records ARE stored as dynamic arrays in DataStage hashed files.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Interesting. Makes sense, it's just that I had read the document I linked to and there was this passage in it to that point:
The data is stored as a continuous string of characters, with a column separator (@AM system variable). Each column is positional in nature and relative to the columns preceding it. It should become obvious that a hash file is almost the same as a sequential file, meaning columns are positional and referenced by position, not name.
So, is that incorrect or did I just not put it together with it being a dynamic array? It sure sounds like one, I just guess the @AM part confused me as I thought they used Field (or Value) marks rather than whatever the heck the 'A' stands for. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The "A" stands for "attribute". "Attribute mark" is a synonym for "field mark". Therefore storage really is a dynamic array in a DataStage hashed file.

The key is separated from the rest of the record by a "segment mark" (Char(255)). On that basis the entire record (key and non-key pieces) can be considered a dynamic array.

It's actually a tad more complex than that, as the record is padded to a whole "word" length - that is, a multiple of the word size (32-bit or 64-bit). And there are forward and backward link pointers within a group (page) as well as a few flag bits.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Cool, thanks for that.
-craig

"You can never have too many knives" -- Logan Nine Fingers
allavivek
Premium Member
Premium Member
Posts: 211
Joined: Sat May 01, 2010 5:07 pm

Post by allavivek »

Post Reply