DataStage HashFile

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Rajendran
Participant
Posts: 16
Joined: Wed Jul 28, 2004 7:56 am
Location: Dubai

DataStage HashFile

Post by Rajendran »

Hi,

I have some doubts on HashFile option of Normalize-on/Un normalize for Outputs HashFile.

-When should be use this option ?
-What is the purpose of using Normalize-on?
-How to decide on which field we will normalize?

Regards,
Rajendran
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

If you do a search Ray has covered this. There are certain databases which can embed a table. Universe and UniData are probably the most common besides the COBOL files. In COBOL these are called repeating groups in Universe they are called multivalues with controlling and dependent associations.

In Universe you can store an invoice in one table. The invoice header fields are all single valued where the line items are all multivalued. All line items have a part number, quanity and amount fields. The part number is usually considered the controlling field. The other 2 fields are associated with the part number. DataStage can normalize these on the fly by making it look like these fields look like they came from a separate table. The dictionary definitions for these fields need to define this relationship properly before DataStage can extract it properly. Most experience Universe, UniData and other database developers which have these capabilities know how to set this up.

Kenneth Bland has a type 2 slowly changing dimension which demonstrates how to do this within DataStage. The performance gains can be thousands times faster. It is a complicated solution to understand but worth the effort to learn. Similar performance gains were available within Universe because an invoice was stored in one record instead of spread out over many databases and many records. This is why COBOL systems could handle thousands of users years ago when it took Oracle years to catch up.

The downside are reports have to normalize this data to deal with it. In data warehousing we are concerned about reporting. A star schema is a design to make reports run faster. Creating a star schema is called denormalizing which is kind of misleading because we have no data stored in repeating groups. It is because we collapse many tables into dimension tables and everything into one join away from our fact table. Our denormalizing also sometimes replicates data because of the table designs.

This is why traditional DBA and data architects design terrible data warehouses. They are taught to normalize data to reduce data replication and create very efficient databases for applications. They have to unlearn these concepts in order to design good data warehouses and good datamarts.
Mamu Kim
Rajendran
Participant
Posts: 16
Joined: Wed Jul 28, 2004 7:56 am
Location: Dubai

Thank you

Post by Rajendran »

kduke wrote:If you do a search Ray has covered this. There are certain databases which can embed a table. Universe and UniData are probably the most common besides the COBOL files. In COBOL these are called repeating groups in Universe they are called multivalues with controlling and dependent associations.

In Universe you can store an invoice in one table. The invoice header fields are all single valued where the line items are all multivalued. All line items have a part number, quanity and amount fields. The part number is usually considered the controlling field. The other 2 fields are associated with the part number. DataStage can normalize these on the fly by making it look like these fields look like they came from a separate table. The dictionary definitions for these fields need to define this relationship properly before DataStage can extract it properly. Most experience Universe, UniData and other database developers which have these capabilities know how to set this up.

Kenneth Bland has a type 2 slowly changing dimension which demonstrates how to do this within DataStage. The performance gains can be thousands times faster. It is a complicated solution to understand but worth the effort to learn. Similar performance gains were available within Universe because an invoice was stored in one record instead of spread out over many databases and many records. This is why COBOL systems could handle thousands of users years ago when it took Oracle years to catch up.

The downside are reports have to normalize this data to deal with it. In data warehousing we are concerned about reporting. A star schema is a design to make reports run faster. Creating a star schema is called denormalizing which is kind of misleading because we have no data stored in repeating groups. It is because we collapse many tables into dimension tables and everything into one join away from our fact table. Our denormalizing also sometimes replicates data because of the table designs.

This is why traditional DBA and data architects design terrible data warehouses. They are taught to normalize data to reduce data replication and create very efficient databases for applications. They have to unlearn these concepts in order to design good data warehouses and good datamarts.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Re: DataStage HashFile

Post by ray.wurlod »

Rajendran wrote:Hi,

I have some doubts on HashFile option of Normalize-on/Un normalize for Outputs HashFile.

-When should be use this option ?
-What is the purpose of using Normalize-on?
-How to decide on which field we will normalize?

Regards,
Rajendran
The decision is initially not in your hands. "Normalize on" will not be enabled unless at least one of the columns is defined as multi-valued; this will only ever occur in UniVerse, UniData or Hashed File stage types.

If it is enabled, then your choice is made from the drop-down list depending on which of the (one or more) nested tables' data you want to extract or load.

You need to know you data - or at least its structure - to make this decision intelligently.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply