Page 1 of 1

DataStage HashFile

Posted: Sat Aug 07, 2004 3:13 am
by Rajendran
Hi,

I have some doubts on HashFile option of Normalize-on/Un normalize for Outputs HashFile.

-When should be use this option ?
-What is the purpose of using Normalize-on?
-How to decide on which field we will normalize?

Regards,
Rajendran

Posted: Sat Aug 07, 2004 6:00 am
by kduke
If you do a search Ray has covered this. There are certain databases which can embed a table. Universe and UniData are probably the most common besides the COBOL files. In COBOL these are called repeating groups in Universe they are called multivalues with controlling and dependent associations.

In Universe you can store an invoice in one table. The invoice header fields are all single valued where the line items are all multivalued. All line items have a part number, quanity and amount fields. The part number is usually considered the controlling field. The other 2 fields are associated with the part number. DataStage can normalize these on the fly by making it look like these fields look like they came from a separate table. The dictionary definitions for these fields need to define this relationship properly before DataStage can extract it properly. Most experience Universe, UniData and other database developers which have these capabilities know how to set this up.

Kenneth Bland has a type 2 slowly changing dimension which demonstrates how to do this within DataStage. The performance gains can be thousands times faster. It is a complicated solution to understand but worth the effort to learn. Similar performance gains were available within Universe because an invoice was stored in one record instead of spread out over many databases and many records. This is why COBOL systems could handle thousands of users years ago when it took Oracle years to catch up.

The downside are reports have to normalize this data to deal with it. In data warehousing we are concerned about reporting. A star schema is a design to make reports run faster. Creating a star schema is called denormalizing which is kind of misleading because we have no data stored in repeating groups. It is because we collapse many tables into dimension tables and everything into one join away from our fact table. Our denormalizing also sometimes replicates data because of the table designs.

This is why traditional DBA and data architects design terrible data warehouses. They are taught to normalize data to reduce data replication and create very efficient databases for applications. They have to unlearn these concepts in order to design good data warehouses and good datamarts.

Thank you

Posted: Sat Aug 07, 2004 6:17 am
by Rajendran
kduke wrote:If you do a search Ray has covered this. There are certain databases which can embed a table. Universe and UniData are probably the most common besides the COBOL files. In COBOL these are called repeating groups in Universe they are called multivalues with controlling and dependent associations.

In Universe you can store an invoice in one table. The invoice header fields are all single valued where the line items are all multivalued. All line items have a part number, quanity and amount fields. The part number is usually considered the controlling field. The other 2 fields are associated with the part number. DataStage can normalize these on the fly by making it look like these fields look like they came from a separate table. The dictionary definitions for these fields need to define this relationship properly before DataStage can extract it properly. Most experience Universe, UniData and other database developers which have these capabilities know how to set this up.

Kenneth Bland has a type 2 slowly changing dimension which demonstrates how to do this within DataStage. The performance gains can be thousands times faster. It is a complicated solution to understand but worth the effort to learn. Similar performance gains were available within Universe because an invoice was stored in one record instead of spread out over many databases and many records. This is why COBOL systems could handle thousands of users years ago when it took Oracle years to catch up.

The downside are reports have to normalize this data to deal with it. In data warehousing we are concerned about reporting. A star schema is a design to make reports run faster. Creating a star schema is called denormalizing which is kind of misleading because we have no data stored in repeating groups. It is because we collapse many tables into dimension tables and everything into one join away from our fact table. Our denormalizing also sometimes replicates data because of the table designs.

This is why traditional DBA and data architects design terrible data warehouses. They are taught to normalize data to reduce data replication and create very efficient databases for applications. They have to unlearn these concepts in order to design good data warehouses and good datamarts.

Re: DataStage HashFile

Posted: Sun Aug 08, 2004 3:06 am
by ray.wurlod
Rajendran wrote:Hi,

I have some doubts on HashFile option of Normalize-on/Un normalize for Outputs HashFile.

-When should be use this option ?
-What is the purpose of using Normalize-on?
-How to decide on which field we will normalize?

Regards,
Rajendran
The decision is initially not in your hands. "Normalize on" will not be enabled unless at least one of the columns is defined as multi-valued; this will only ever occur in UniVerse, UniData or Hashed File stage types.

If it is enabled, then your choice is made from the drop-down list depending on which of the (one or more) nested tables' data you want to extract or load.

You need to know you data - or at least its structure - to make this decision intelligently.