Tuning Fact Table Process
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 26
- Joined: Tue May 24, 2005 6:20 am
- Location: Tallahassee
Tuning Fact Table Process
I have a fact table loading process that is not performing to my expectation. It's taking my ETL process 33 minutes to load 170,000 Rows (3 MB) of data (approx. 85 Rows/Sec). The data is extracted and loaded into tables in the same Oracle Database (using DRS stage). This process populates data into a fact table. The process has 4 major transformations. Each transformation has 5 reference lookups (used to attain to correct surrogate keys). Some lookup are performed against hashed files while other are database tables.
Performance was much worst that this. The following tuning steps have been done to get me to this point.
1. Addition of IPC stages to force the process to use all available processors
2. Tuning the data extraction SQL and the lookup SQL.
3. Add indexes to the lookup tables
4. Tune the hashed files by adjusting the Minimum Modulus
5. Tuned Array and Transaction Size
In addition to these techniques, I also replace the source and target tables with Sequential files and the performance outcome marginally improved.
Below are some particulars for my environment:
- DataStage Server Edition 7.5.1A (Located on Server A)
- Server A (Solaris OS, 4 CPU, 2GB)
- Oracle 10G DataBase (Located on Server B)
- Server B (Solaris OS, 12 CPU, 24GB, Holds 15 DBs)
Any idea on how I can improve me performance?
Performance was much worst that this. The following tuning steps have been done to get me to this point.
1. Addition of IPC stages to force the process to use all available processors
2. Tuning the data extraction SQL and the lookup SQL.
3. Add indexes to the lookup tables
4. Tune the hashed files by adjusting the Minimum Modulus
5. Tuned Array and Transaction Size
In addition to these techniques, I also replace the source and target tables with Sequential files and the performance outcome marginally improved.
Below are some particulars for my environment:
- DataStage Server Edition 7.5.1A (Located on Server A)
- Server A (Solaris OS, 4 CPU, 2GB)
- Oracle 10G DataBase (Located on Server B)
- Server B (Solaris OS, 12 CPU, 24GB, Holds 15 DBs)
Any idea on how I can improve me performance?
Byron
How are you loading your fact table now. Are you collecting all your data into a flat file in one job after all the lookups and transformations and then in a second job loading into your table or everything in one job without an intermediate sequential file stage?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Are you watching prstat on both servers to see how your cpu utilization is doing?
Why just some hashed lookups, not all hashed lookups?
What about multiple job instances?
How about DIRECT path bulk loading inserts and using a "bulk update" script (DIRECT path load updates into a work table, then use a parallel DML enabled correlated update statement)
How about just using OCI-->XFM-->SEQ to Extract your source data and see the maximum speed? If putting SEQ's as the output stage and very little performance gain, you confirm that the OCI lookups or the OCI extract is the limiter.
Why just some hashed lookups, not all hashed lookups?
What about multiple job instances?
How about DIRECT path bulk loading inserts and using a "bulk update" script (DIRECT path load updates into a work table, then use a parallel DML enabled correlated update statement)
How about just using OCI-->XFM-->SEQ to Extract your source data and see the maximum speed? If putting SEQ's as the output stage and very little performance gain, you confirm that the OCI lookups or the OCI extract is the limiter.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
-
- Premium Member
- Posts: 26
- Joined: Tue May 24, 2005 6:20 am
- Location: Tallahassee
Thanks the replys.
Monitoring the process with prstat, mpstat, and sar simultaneously. No major bottlenecks in memory, or CPU utilization. In fact the cpu utilization peaked at 81% on all 4 cpus during processing.
For SCD II, some of the lookups require the date range criteria. This is not accomodated in hashed file and the performance was terrible when I attempted it with the UV Stage.
Prior to making my initial posting, I created a simple job to measure the base performance for my envoronmement.
DRS ------> IPC----->Xform ------> IPC ------>DRS
The performance was 1500 Rows/Sec. Considering the performance metrics that I have read from other posts that is slow as well.
I have not attempted the bulk loaders because it seems like my degredation is not due to the extraction or loading events.
Monitoring the process with prstat, mpstat, and sar simultaneously. No major bottlenecks in memory, or CPU utilization. In fact the cpu utilization peaked at 81% on all 4 cpus during processing.
For SCD II, some of the lookups require the date range criteria. This is not accomodated in hashed file and the performance was terrible when I attempted it with the UV Stage.
Prior to making my initial posting, I created a simple job to measure the base performance for my envoronmement.
DRS ------> IPC----->Xform ------> IPC ------>DRS
The performance was 1500 Rows/Sec. Considering the performance metrics that I have read from other posts that is slow as well.
I have not attempted the bulk loaders because it seems like my degredation is not due to the extraction or loading events.
Byron
-
- Premium Member
- Posts: 26
- Joined: Tue May 24, 2005 6:20 am
- Location: Tallahassee
DSguru2B wrote:How are you loading your fact table now. Are you collecting all your data into a flat file in one job after all the lookups and transformations and then in a second job loading into your table or everything in one job without an intermediate sequential file stage?
Yes all in 1 job.
Byron
Sounds like you missed the need to create an index over the key fields so that the UV lookup isn't dog slow. Search the forum for the CREATE.INDEX and BUILD.INDEX syntax for this.itcconsulting wrote:For SCD II, some of the lookups require the date range criteria. This is not accomodated in hashed file and the performance was terrible when I attempted it with the UV Stage.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 26
- Joined: Tue May 24, 2005 6:20 am
- Location: Tallahassee
Below is the output for one of my UV tables. If data existed in the table before creating the index, is an index rebuild required? If so, what is the syntax?
>LIST.INDEX HASH_D_JRNL_SOURCE_LKP2 ALL
Alternate Key Index Summary for file HASH_D_JRNL_SOURCE_LKP2
File........... HASH_D_JRNL_SOURCE_LKP2
Indices........ 1 (0 A-type, 0 C-type, 0 D-type, 0 I-type, 1 SQL, 0 S-type)
Index Updates.. Enabled, No updates pending
Index name Type Build Nulls In DICT S/M Just Unique Field num/I-type
JRNL_SOURCE_IDX SQL Not Reqd Yes Yes S R N ConcatKey;Eff_St
art_Dt;Eff_End_D
t;@1:@TM:@2:@TM:
@3
>LIST.INDEX HASH_D_JRNL_SOURCE_LKP2 ALL
Alternate Key Index Summary for file HASH_D_JRNL_SOURCE_LKP2
File........... HASH_D_JRNL_SOURCE_LKP2
Indices........ 1 (0 A-type, 0 C-type, 0 D-type, 0 I-type, 1 SQL, 0 S-type)
Index Updates.. Enabled, No updates pending
Index name Type Build Nulls In DICT S/M Just Unique Field num/I-type
JRNL_SOURCE_IDX SQL Not Reqd Yes Yes S R N ConcatKey;Eff_St
art_Dt;Eff_End_D
t;@1:@TM:@2:@TM:
@3
Byron
As far as I know, you must always BUILD an index after CREATE'ing it. And, as noted, the syntax can be found by searching the forum - this is one such example.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It would have been nice to encase your LIST.INDEX between Code tags - in fact, why not edit your previous post to make it so?
BUILD.INDEX is required if there are data in a hashed file on which an index is created. BUILD.INDEX is not required if the hashed file is empty when the index is created.
BUILD.INDEX is required if there are data in a hashed file on which an index is created. BUILD.INDEX is not required if the hashed file is empty when the index is created.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 26
- Joined: Tue May 24, 2005 6:20 am
- Location: Tallahassee
Thanks Ray. Here is the LIST.INDEX for one of my UV Table indexes.ray.wurlod wrote:It would have been nice to encase your LIST.INDEX between Code tags - in fact, why not edit your previous post to make it so?
BUILD.INDEX is required if there are data in a hashed file on which an index is created. BUILD.INDEX is not required if the hashed file is empty when the index is created.
Code: Select all
>LIST.INDEX HASH_D_JRNL_SOURCE_LKP2 ALL
Alternate Key Index Summary for file HASH_D_JRNL_SOURCE_LKP2
File........... HASH_D_JRNL_SOURCE_LKP2
Indices........ 1 (0 A-type, 0 C-type, 0 D-type, 0 I-type, 1 SQL, 0 S-type)
Index Updates.. Enabled, No updates pending
Index name Type Build Nulls In DICT S/M Just Unique Field num/I-type
JRNL_SOURCE_IDX SQL Not Reqd Yes Yes S R N ConcatKey;Eff_St
art_Dt;Eff_End_D
t;@1:@TM:@2:@TM:
@3
Code: Select all
>BUILD.INDEX HASH_D_JRNL_SOURCE_LKP2 JRNL_SOURCE_IDX;
JRNL_SOURCE_IDX; is not a secondary index field.
Byron
-
- Premium Member
- Posts: 26
- Joined: Tue May 24, 2005 6:20 am
- Location: Tallahassee