Hash File, Create File Options max number for min modulus

wdudek · Post by **wdudek** » Tue Dec 30, 2003 4:48 pm

It has been suggested to us that Creating our hash files with the minumum modulus set to the current size listed from ANALYZE.FILE will improve our performance since the file will not need to be resized as much as it otherwise would. (it grows daily) This seems to have been working for us, but now the modulus suggested by this method is too large to enter into the minimum modulus box. Can I run CREATE.FILE from the Administrator Command prompt, or anywhere else, with the larger number. The number being reported is 1,020,721, and the largest number I can input through the gui is 999,999.

kcbland · Post by **kcbland** » Tue Dec 30, 2003 6:10 pm

First of all, the need for such a large file, does it really exist? You are talking HUGE, like gigabytes in size. In 6 years of using DataStage in multi-terabyte warehouses, the need for such large modulus hash files has never manifested, as they are useful objects when kept within certain characteristics. Think of it this way, the need for one LARGE reference object means you first have to populate it. The bigger it is, the less efficient.

If you have a lookup of such size, it should carry over that you have equally significant amount of source data you process. If so, YOU HAVE TO USE DIVIDE-AND-CONQUER techniques with instantiated jobs so as to utilize all cpu power available. Using such techniques introduces the ability to separate (partition) source data, and therefore, separate (partition) large reference lookups using the same separation (partitioning) strategy. This would allow you to use small (quicker) more efficient (faster to load) hash files. Where you have one large, you now would have N smaller, containing the subset rows for each partition of data.

Somethings in your question suggest to me that your premise may actually be flawed. When you state "it grows daily", are you saying this hash file is not a temporary work construct, but actually has data persistency that carries from one run to the next? If this hash file has corrupted or errant data processed into it then you have to do some sort of "refresh" to make it correct? I hope this is not the situation, for you have built a copy of target data within your ETL tool. If it is, you MUST move to a strategy of clearing and populating your lookups with each run.

You can easily scan your source data and stage the distinct list of natural keys in your source dimension and fact data. Put that into an ETL work table within your warehouse. Inner join that table to your large dimension or fact table and write those rows to your hash lookup. Most data warehouses only process a minor fraction of source data, so therefore only a minor fraction of target data has to be within your sandbox. The trick is having just enough to satisfy processing the current load. I suspect you've got the whole enchilada.

Check this out if you haven't already seen it:
viewtopic.php?t=85364

By the way, in answer to your direct question (shudder), you can manually create the hash file (snooze while it's running, it will take awhile, make sure you've got the space) as a 64bit file. You're probably not putting your hash files into an external path either, but if you did, you would see in the job log the command "mkdbfile blah blah blah" and all you do is directly execute this command with the keyword 64BIT added on. I DO NOT RECOMMEND THE GLOBAL 64BIT SETTING in uvconfig for all hash files.

ray.wurlod · Post by **ray.wurlod** » Tue Dec 30, 2003 8:29 pm

wdudek wrote:It has been suggested to us that Creating our hash files with the minumum modulus set to the current size listed from ANALYZE.FILE will improve our performance since the file will not need to be resized as much as it otherwise would. (it grows daily) This seems to have been working for us, but now the modulus suggested by this method is too large to enter into the minimum modulus box. Can I run CREATE.FILE from the Administrator Command prompt, or anywhere else, with the larger number. The number being reported is 1,020,721, and the largest number I can input through the gui is 999,999.

To be pedantic, you're not saving any I/O, you're just time shifting it. Instead of growing while data are being added, you're growing the file before data are added to it.

Do you really need such a large hashed file? You're up around the 2GB limit, so will need to be really careful. Check out the unsupported utility Hashed File Calculator, which can generate the appropriate command based on average record size and number of records.

Then you can execute the command through the Administrator client's command window, or in a telnet session.

However, before you do that, scrutinize your design with an intense scruit.
Are there any columns you're loading that are not used in the ETL design? If so don't load them! Are there any rows you're loading that are not used in your ETL design? If so don't load them! Suddenly you need a much smaller hashed file, one that may be able to be loaded into a memory cache (which a 2GB+ hashed file definitely cannot).

wdudek · Post by **wdudek** » Wed Dec 31, 2003 9:25 am

Thank you for the replies, to answer the concern that the hash file is used for permanent storage, no it is not. The scenario is as follows.

We have a Unidata database, created by third party software, in the 30 gb range, the powers that be decided that this was not easily accessable and that it should be available in an Oracle database. Also in 90 percent of the Unidata files, there is no way that we are aware of which we can use any type of update date to limit the data that is pulled over, so that we are only getting modified records. And since this is what the people that sign my check want, I am forced to extract almost the entire database nightly.

So getting down to the point of the hash files, we are using the hash files to normalize the data on the datastage server as opposed to the Unix server where the Unidata db resides, as our vendor has told us that this is the best way to do this. In most cases, we are not doing lookups against the hash file, but only using it to normalize the data, as we try to mimic the table structure from the unidata db, in oracle as much as possible.

Iv'e seen the above mentioned post, and even have a printed copy laying here on my desk, but unless my limited knowledge of hash files is really wrong I'm at least not breaking any rules with what I'm trying to do, although at this point I am wondering if I am going in the wrong direction.

Please provide any thoughts or comments on this as this forum has been one of the best places I have found for learning.

kcbland · Post by **kcbland** » Wed Dec 31, 2003 9:39 am

Well, you get the "Between a Rock and a Hard Place Award" for 2003.

So, are you pulling from a Unidata stage directly into a hashed file stage? Why not normalize coming out of the Unidata stage into a sequential file? If the hash file is not needed for normalization purposes, then skip it.

Let's say you have a large source Unidata file. You write a job that reads from that file using normalization and streams to a sequential text file. Now, you put a WITH condition into the stage to partition the source data, if the primary key is an integer, do something like ends with 0. Now, 10 of those jobs, each getting the rows with key that ends with 0-9, will stream 10X as much data to your server as possible. In addition, once you get it into sequential file form, you can do some tricks.

You can take today's 0 file and compare it to yesterdays, and pop out the deltas. Do that for all 10 and you have the incremental changes. This is a simple technique, made possible by staging to sequential files. Let me know if you want me to expand on this.

wdudek · Post by **wdudek** » Wed Dec 31, 2003 10:54 am

Kenneth,

I like your methedology for finding the incremental changes and have many places using the oracle oci control where I can use this to speed up our processing. From what I understand I would be splitting the data into multiple flat files using an algorith like the one you mentioned, and then could do something like a binary compare on a line by line basis to find the changed records, which would then be passed on to oracle for inserts/updates.

In regards to the jobs using the large hash files, the destination of the data in Datastage is a text file formatted to work with the oracle bulk loader. This file is then copied to the Oracle server and a stored procedure is executed that causes sqlldr to be executed as a process on the oracle server. Using this methedology we are able to import flat files consisting of 4 - 10 million records into oracle in about 5 minutes a piece. As a note these are mostly failry narrow usually containing about 5 columns. However I will definately try normalizing in the Unidata stage and going directly into the sequential files for the jobs currently using the hash files.

kcbland · Post by **kcbland** » Wed Dec 31, 2003 12:39 pm

For the delta technique, the comparison can be done several ways. You can use a CRC32 function (search the forum) to compute the value, then compare it to the previous row CRC32 value. Or, if you have something like MKS toolkit then the Unix equivalent diff command can be used. For that, you basically sort the two files, drive the diff statement from todays file against yesterdays, then pipe the results thru sed with some syntax and viola, out comes anything new or different about today's file.

Both techniques have their pros and cons, but that's two easy ways to do what you want.

mhester · Post by **mhester** » Wed Dec 31, 2003 1:35 pm

If your current employer will allow you to modify the dictionary of the source Unidata file then you could also put an "I" descriptor in the dictionary that will write the key value to a temp file anytime the record is updated or if a new record is added or deleted (essentially a trigger). I used this technique when I worked for a major ERP vendor and we were doing something very similar to what you are doing with both UniData and UniVerse and DS.

If you are interested, I could dig out the particulars and send it to you or post the code here. It is a very efficient method of only processing new, changed or deleted rows.

Regards,

Michael Hester

wdudek · Post by **wdudek** » Wed Dec 31, 2003 3:16 pm

Unforunately we cannot change the structure of any of the unidata files, without voiding the support we receive from the software vendor who provides us the application that is riding on the database, but yes that sounds like it would have been good for our scenario if we were allowed to do it.

ray.wurlod · Post by **ray.wurlod** » Wed Dec 31, 2003 3:54 pm

That's the beauty of the design; you're NOT changing the structure!
An I-descriptor is metadata describing a virtual field (think of it as a view or as a trigger - this will be clear when you see the technique). Indeed, if your customer is running UniData 6.0 you can use triggers directly. And if they won't even let you do any of that, you can put an I-descriptor in the VOC file! UniData and UniVerse are very flexible environments in this way.

To answer the original question, the maximum modulo for a static hashed file is 8,388,608. There is no documented upper limit for the modulo of a dynamic hashed file, though - with 64-bit addressing enabled - it's of the order of 19 million TB divided by group size.

wdudek · Post by **wdudek** » Fri Jan 02, 2004 11:27 am

Ok, I seem to be a little lost on the I descriptor in the Unidata file, I understand the concept in regards to how we would use this with Datastage but the details on the Unidata side are not that clear to mel. Can someone provide some more information on this. We are running Unidata 5.2.2

ray.wurlod · Post by **ray.wurlod** » Fri Jan 02, 2004 4:17 pm

An I-descriptor is an entry in a UniData (or UniVerse) file dictionary that describes a virtual column - that is, a column that does not have an actual existence but the value in which is generated when the column name is referred to in a query. Needless to add, virtual columns (also known as I-types or V-types in UniData) are not updateable.
Each I-descriptor looks quite like a D-descriptor (a regular column definition) except that, instead of a location integer in field #2, it contains an expression - exactly the same kind of expression you might use in a DataStage output derivation, except that the DataStage functions, etc., are not available.
More information can be found in Chapter 5 (Creating Virtual Attributes) in the manual Using UniData, which can be downloaded from the IBM web site www-3.ibm.com/software/data/u2/pubs/lib ... a/Udtu.pdf
I-descriptors have to be compiled, which is quite like compiling a DataStage job in that it generates executable opcodes based on the derivation expression.

wdudek · Post by **wdudek** » Mon Jan 05, 2004 1:02 pm

nice source for documentation, although the website is actually at http://www-306.ibm.com/software/data/u2 ... 52unidata/ I have downladed the documentation and will look into this. If we were to use this methedology, wouldn't we still need to pull all of the unidata data from our database and then use a reference to find the chagned/new records? As a side point, our biggest bottle necks seem to be coming out of Unidata.

kcbland · Post by **kcbland** » Mon Jan 05, 2004 3:52 pm

As an experiment, you might want to try to see the dynamics of your bottleneck. If your problem is simply bytes/second throughput of a single operation to scan the hash file and pass the results to the DataStage query, then investigate the resources on the source server. Let's say it's an 4 cpu box, well, if your single job pulling from that server is choking a single cpu, then I bet four jobs pulling from that server will choke all four cpus. Now, if those jobs are all pulling a different set of rows from the same hash file, you could potentially increase your net throughput by 4X.

What you want to do is make sure that all of the columns you are pulling ARE A or S attributes in the dictionary. If you're pulling I's or F's or T's, you could potentially have underlying inefficiencies from the get-go. You may wish to search this forum when I recommended setting up a DataStage account and putting F-pointers in that account pointing to remote account data files, but have local account dictionaries. This way, you can setup clean dictionaries and not deal with inefficiently written attributes for data that doesn't even exist in the hash file.

ray.wurlod · Post by **ray.wurlod** » Mon Jan 05, 2004 6:11 pm

Nice thought, but you can't create F pointers from DataStage to UniData.

DSXchange

Hash File, Create File Options max number for min modulus

Hash File, Create File Options max number for min modulus

Re: Hash File, Create File Options max number for min modulu