Sample stage Pecent Option

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Althaf6553
Participant
Posts: 64
Joined: Wed Sep 26, 2007 6:52 am
Location: Syracuse ,NY

Sample stage Pecent Option

Post by Althaf6553 »

I am clear on sample stage period method
But confused with sample stage Percent option

What i understand is it gives you X pecentage of records to output based on a seed value(Random Genrator).

I have job => SqFile->Sample Stage->SqFile(I am using a two node file)

I have 10 records in the input When i give percent as 50 and Seed value as 3 I get 3 records in out put
and when i specify percent as 70 and give seed value as 100 I get 9 records in the output .

Can any one please explain me in undestanding what actually is happening here .
Althaf
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Re: Sample stage Pecent Option

Post by jwiles »

10 records is too small a quantity with which to receive any useful results when using random sample, as 1 record = 10% of your source data. I would recommend at least 1000-10000 as a starting point if you wish to better see how sample is working.

The percent option will keep approximately x percentage of records passing through each instance of the operator (each partition). The operator can only guarantee an approximate value (not the exact percentage) because it is unaware of the total number of records in your data set (or the number of records flowing through each partition). So, instead of exactly 70%, you may keep 70.5% or 69.5%, with the range of deviation inversely proportional to the total number of records being processed (you experienced +/- 20% of the total in your tests with only 10 records).

Which records are kept are determined by a random number sequence generated based on the seed value provided in the options and probably uses a threshold value to compare to the generated random numbers (this is how I've accomplished it outside of DataStage in the past).
Post Reply