Gaps(Holes) in surr_key generated by Surr_Key_Gen stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
basav_ds
Participant
Posts: 24
Joined: Sun Nov 11, 2007 11:19 pm
Location: Mumbai

Gaps(Holes) in surr_key generated by Surr_Key_Gen stage

Post by basav_ds »

Hi,
I have 3 jobs.... 2 parallel and 1 sequence job.
In First parallel job,
It reads Employee data from I/P Source file Emp1.txt which has 3940 records. These are inputted to Surr_Key_Gen stage where Surrogate key Emp_ID is generated starting from value 1. Then o/p of Surr_Key_Gen stage is inserted into DB2 Table 'Employee_Surrogate'.

Second parallel job is exactly similar as the first job, only the difference being that i/p source file is Emp2.txt. Here the target table is same i.e 'Employee_Surrogate'.
Source file Emp2.txt has 6060 records. Surrogate key Emp_ID is generated starting from the last highest value previously generated.
Here in both the above jobs, Surrogate Key Generator stages are referring to the same surrogate key source file.

The third job, is a Sequence Job. In this the parallel jobs 1 and 2 are executed parallelly.

Observed Output:
Total no of records inserted into the target table Employee_Surrogate is 10,000 (i.e 3940+6060).
Generated Surrogate key range 1 to 10,030.
Hole in surrogate key 3971 to 4000(i.e 30)

Q:Why the holes are created in surrogate key range?
Please suggest the solution to remove these holes, if any..

Thank You.
I never let school to interfere in my education
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

This is most likely due to the block size that is being used. In order to speed up execution, surrogate keys are generated in blocks instead of going to the file/database for each and every key. It is possible that holes can occur when the program is stopped where the rest of the keys in the block are left unused.
Raghavendra
Participant
Posts: 147
Joined: Sat Apr 30, 2005 1:23 am
Location: Bangalore,India

Post by Raghavendra »

If the partitions are not balanced properly, then surrogate key stage will generate the values with holes in the range.
If you want to generate the keys without any holes you should achieve perfect balance of records in the partitioning or you can use round robin partitioning while reading the data from the input (provided this partition will not affect your functionality)
Raghavendra
Dare to dream and care to achieve ...
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

It's unnecessary to fill the holes. Surrogate keys inherently have no meaning other than to uniquely identify a row. The fact that they can be non-contiguous does not cause a single data discrepancy or issue. The largest concern is probably running out of numbers when exceeding the datatype.

This is a very common topic and an indication of ignorance in the "data architects" who design data warehouses. These "experts" often set the integer datatypes too small and then gaps become a concern because the maximum number is achieved.

The solution is to use a consistent integer datatype everywhere and then educate your architects. Suggest they read Kimball's articles and sections from his books where he more eloquently deflates this non-issue.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

+1. A non-issue.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

One solution would be to force the entire job to execute sequentially. Which defeats the purpose of having this parallel engine and is, as others have noted, unnecessary. Then again, for only 10,000 rows it would probably perform adequately.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply