Page 1 of 1

Gaps(Holes) in surr_key generated by Surr_Key_Gen stage

Posted: Fri Feb 29, 2008 1:34 am
by basav_ds
Hi,
I have 3 jobs.... 2 parallel and 1 sequence job.
In First parallel job,
It reads Employee data from I/P Source file Emp1.txt which has 3940 records. These are inputted to Surr_Key_Gen stage where Surrogate key Emp_ID is generated starting from value 1. Then o/p of Surr_Key_Gen stage is inserted into DB2 Table 'Employee_Surrogate'.

Second parallel job is exactly similar as the first job, only the difference being that i/p source file is Emp2.txt. Here the target table is same i.e 'Employee_Surrogate'.
Source file Emp2.txt has 6060 records. Surrogate key Emp_ID is generated starting from the last highest value previously generated.
Here in both the above jobs, Surrogate Key Generator stages are referring to the same surrogate key source file.

The third job, is a Sequence Job. In this the parallel jobs 1 and 2 are executed parallelly.

Observed Output:
Total no of records inserted into the target table Employee_Surrogate is 10,000 (i.e 3940+6060).
Generated Surrogate key range 1 to 10,030.
Hole in surrogate key 3971 to 4000(i.e 30)

Q:Why the holes are created in surrogate key range?
Please suggest the solution to remove these holes, if any..

Thank You.

Posted: Fri Feb 29, 2008 3:01 am
by ArndW
This is most likely due to the block size that is being used. In order to speed up execution, surrogate keys are generated in blocks instead of going to the file/database for each and every key. It is possible that holes can occur when the program is stopped where the rest of the keys in the block are left unused.

Posted: Fri Feb 29, 2008 6:56 am
by Raghavendra
If the partitions are not balanced properly, then surrogate key stage will generate the values with holes in the range.
If you want to generate the keys without any holes you should achieve perfect balance of records in the partitioning or you can use round robin partitioning while reading the data from the input (provided this partition will not affect your functionality)

Posted: Fri Feb 29, 2008 7:59 am
by kcbland
It's unnecessary to fill the holes. Surrogate keys inherently have no meaning other than to uniquely identify a row. The fact that they can be non-contiguous does not cause a single data discrepancy or issue. The largest concern is probably running out of numbers when exceeding the datatype.

This is a very common topic and an indication of ignorance in the "data architects" who design data warehouses. These "experts" often set the integer datatypes too small and then gaps become a concern because the maximum number is achieved.

The solution is to use a consistent integer datatype everywhere and then educate your architects. Suggest they read Kimball's articles and sections from his books where he more eloquently deflates this non-issue.

Posted: Fri Feb 29, 2008 8:17 am
by chulett
+1. A non-issue.

Posted: Fri Feb 29, 2008 3:33 pm
by ray.wurlod
One solution would be to force the entire job to execute sequentially. Which defeats the purpose of having this parallel engine and is, as others have noted, unnecessary. Then again, for only 10,000 rows it would probably perform adequately.