Hash File

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Hash File

Post by rasi »

Hi,

Is is advisable to have a hash lookup with 25 millions rows or to have a db plugin stage.

kindly advise
Rasi
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Hash file, Yes and No. You have to give a whole lot more information about what you're trying to do.

No because:
(1) skillful optimization will most likely be required
(2) this is so big that it implies persistency. If you corrupt the file, or introduce data that you have to remove, you have put yourself in an unfavorable situation.
(3) if your average row size is large, you either will: (a) exceed a 32-bit filesize limitation and have to manually create it as 64-bit; or (b) limit the number of columns that you can fit within 32-bits; or (c) have to create a partitioned hash file (distributed hash file).

Yes because:
(1) it makes sense for the unstated situation you're facing
(2) your reference data is already in a hash file coming from a Universe source system, so why rebuild another large lookup local to DataStage when you could binary FTP the data files over.

Please give us a lot more information.

Thanks,

Kenneth Bland
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Post by rasi »

Hi,

The Scenerio is that we need to join three or more tables and with many filters. This hash file with 25 millions rows will be used by other jobs since the condition and filter is same. This hash file has to be re-created every month once the data is load into the warehouse. Hope this info will help you to advise more.

Cheers
Rasi
Peytot
Participant
Posts: 145
Joined: Wed Jun 04, 2003 7:56 am
Location: France

Post by Peytot »

Hi Rasi,
For the Hash File, you need to know that the size is 2 Giga. That can be a problem for you.
We have this problem with Server, and we use now PX or we change our design in using temporary table.

I think the best solution with Server is the test (evolution, performance).

Good Luck

Pey
degraciavg
Premium Member
Premium Member
Posts: 39
Joined: Tue May 20, 2003 3:36 am
Location: Singapore

Post by degraciavg »

If you want to create 64-bit hash files, make sure you set this in uvconfig in DSHOME:

64BIT_FILES 1

This will allow you to breach the 2GB limit but it works ONLY if your platform allows you to create 64-bit files.

regards,
vladimir
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Rasi, you still have not given us enough information. To say you are joining three tables is irrelevant, as you could say 10 and it wouldn't make a difference. We are still guessing here what you are trying to do.

What does this mean? "The Scenerio is that we need to join three or more tables and with many filters. This hash file with 25 millions rows will be used by other jobs since the condition and filter is same." Are you saying that you have a common lookup that many jobs will share, and you only want to compute the lookup a single time and then leverage the results in other jobs?

If so, then how does this make sense? "This hash file has to be re-created every month once the data is load into the warehouse." Don't you mean BEFORE you load into the warehouse? Why would you create a hash file ONCE the data is loaded into the warehouse?

If you are building a lookup to faciliate a monthly load INTO your warehouse, I understand why the volume of the hash file could be so high. Because your incremental slice of data is at a month granularity instead of a daily, it makes sense that your hash file would be large. However, what is the nature of the hash file? Is it incoming data that needs to be processed and transformed into a staging file? Is it data from the warehouse that you are putting into a hash file to facilitate references prior to loading into the warehouse, like if you are checking for the existence of the row in the warehouse and then handling it as an insert or update? Are you having to do a relational style lookup, not using a one-to-one lookup of a hash file?

Your original question uses DB2 as an alternative reference technology, so I'm guessing that the some of the reference data is already in a database, probably your target. You see, I'm having to make guesses because your original question is so vague, and your followup statement shed no light. If you are processing incoming data, referencing the warehouse with a complex query, your question may make more sense. You may be trying to avoid creating a runtime work table of computed results to simplify the complex query that DataStage would ask when referencing through DB2 plugin. If that's the case, we need to know more about the query's structure to even tell you if it would work in a hash file. Otherwise, if you want to do relational lookups using UV/ODBC on a 25 million row hash file, our answers would be radically different for that also.

Please please please describe your need



Kenneth Bland
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Post by rasi »

Hi Kenneth

Thanks for your reply. Sorry for not explaining in detail.

1. The hash file I am creating is after loading data into datawarehouse. This is post load job to re-create Data Mart monthly.
2. The hash file will be used as a common lookup for other jobs.
3. The hash has to be recreated since there is movement in the relationship of customers and also merge in customers data.
4. The query is quite complicated with some inner join and unions.
5. The whole point is I don't want this query to run for all the jobs which will take ages to finish and lots of resource.

Hope this explains you in detail.

Cheers
Siva
Post Reply