Selecting random records from a dataset

Bostonian · Post by **Bostonian** » Sat Sep 11, 2010 11:18 am

Hi All,
1st of all, congratulations for catching up with number of posts on Parallel jobs with Server ones ! This indeed is a milestone on its own.

I have this peculiar scenario for which I am still thinking the best approach I should adopt.

I have a dataset which contains more than 250 MM records sorted & hash-partitioned a key. I want to select 500 K records from it randomly. I tried to use the random partition in the output but this is not giving the desired results as I see that it picks the same records in every run which makes me think that random partition is not exactly random for every run. I am not able to use the Sample Stage as I don't know what exactly shall I put as percentage of rows to pick as the input count varies for every run. Is there a way I can reshuffle the records randomly and pick up 500 K records out of it?

We are using 4 nodes on this box.

ray.wurlod · Post by **ray.wurlod** » Sat Sep 11, 2010 6:00 pm

Use the Sample stage. Provide a job parameter for the skip and period properties. Set the number of records (per node) to 125000 (or, if you prefer, use a downstream Head stage). Generate the job parameter values in a controlling sequence using Rnd() functions.

Bostonian · Post by **Bostonian** » Sun Sep 12, 2010 2:59 pm

Thank you, Ray. What is 'skip' property? I don't see it in the Sample Stage.

I tested with 10 MM records for which I put the number of records per node as 125,000. I observed that having a generated value for Rnd() function doesn't provide 500,000 records consistently. But when the period is set to 15, it gives me exact 500,000 records. I also used Random partitioning for reading the input dataset. Would you recommend any different partitioning method?