Gaps(Holes) in surr_key generated by Surr_Key_Gen stage

basav_ds · Post by **basav_ds** » Fri Feb 29, 2008 1:34 am

Hi,
I have 3 jobs.... 2 parallel and 1 sequence job.
In First parallel job,
It reads Employee data from I/P Source file Emp1.txt which has 3940 records. These are inputted to Surr_Key_Gen stage where Surrogate key Emp_ID is generated starting from value 1. Then o/p of Surr_Key_Gen stage is inserted into DB2 Table 'Employee_Surrogate'.

Second parallel job is exactly similar as the first job, only the difference being that i/p source file is Emp2.txt. Here the target table is same i.e 'Employee_Surrogate'.
Source file Emp2.txt has 6060 records. Surrogate key Emp_ID is generated starting from the last highest value previously generated.
Here in both the above jobs, Surrogate Key Generator stages are referring to the same surrogate key source file.

The third job, is a Sequence Job. In this the parallel jobs 1 and 2 are executed parallelly.

Observed Output:
Total no of records inserted into the target table Employee_Surrogate is 10,000 (i.e 3940+6060).
Generated Surrogate key range 1 to 10,030.
Hole in surrogate key 3971 to 4000(i.e 30)

Q:Why the holes are created in surrogate key range?
Please suggest the solution to remove these holes, if any..

Thank You.

ArndW · Post by **ArndW** » Fri Feb 29, 2008 3:01 am

This is most likely due to the block size that is being used. In order to speed up execution, surrogate keys are generated in blocks instead of going to the file/database for each and every key. It is possible that holes can occur when the program is stopped where the rest of the keys in the block are left unused.

Raghavendra · Post by **Raghavendra** » Fri Feb 29, 2008 6:56 am

If the partitions are not balanced properly, then surrogate key stage will generate the values with holes in the range.
If you want to generate the keys without any holes you should achieve perfect balance of records in the partitioning or you can use round robin partitioning while reading the data from the input (provided this partition will not affect your functionality)

kcbland · Post by **kcbland** » Fri Feb 29, 2008 7:59 am

It's unnecessary to fill the holes. Surrogate keys inherently have no meaning other than to uniquely identify a row. The fact that they can be non-contiguous does not cause a single data discrepancy or issue. The largest concern is probably running out of numbers when exceeding the datatype.

This is a very common topic and an indication of ignorance in the "data architects" who design data warehouses. These "experts" often set the integer datatypes too small and then gaps become a concern because the maximum number is achieved.

The solution is to use a consistent integer datatype everywhere and then educate your architects. Suggest they read Kimball's articles and sections from his books where he more eloquently deflates this non-issue.

chulett · Post by **chulett** » Fri Feb 29, 2008 8:17 am

+1. A non-issue.

ray.wurlod · Post by **ray.wurlod** » Fri Feb 29, 2008 3:33 pm

One solution would be to force the entire job to execute sequentially. Which defeats the purpose of having this parallel engine and is, as others have noted, unnecessary. Then again, for only 10,000 rows it would probably perform adequately.