Selecting random records from a dataset

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Bostonian
Premium Member
Premium Member
Posts: 33
Joined: Fri Nov 20, 2009 3:32 pm

Selecting random records from a dataset

Post by Bostonian »

Hi All,
1st of all, congratulations for catching up with number of posts on Parallel jobs with Server ones ! This indeed is a milestone on its own.

I have this peculiar scenario for which I am still thinking the best approach I should adopt.

I have a dataset which contains more than 250 MM records sorted & hash-partitioned a key. I want to select 500 K records from it randomly. I tried to use the random partition in the output but this is not giving the desired results as I see that it picks the same records in every run which makes me think that random partition is not exactly random for every run. I am not able to use the Sample Stage as I don't know what exactly shall I put as percentage of rows to pick as the input count varies for every run. Is there a way I can reshuffle the records randomly and pick up 500 K records out of it?

We are using 4 nodes on this box.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Use the Sample stage. Provide a job parameter for the skip and period properties. Set the number of records (per node) to 125000 (or, if you prefer, use a downstream Head stage). Generate the job parameter values in a controlling sequence using Rnd() functions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Bostonian
Premium Member
Premium Member
Posts: 33
Joined: Fri Nov 20, 2009 3:32 pm

Post by Bostonian »

Thank you, Ray. What is 'skip' property? I don't see it in the Sample Stage.

I tested with 10 MM records for which I put the number of records per node as 125,000. I observed that having a generated value for Rnd() function doesn't provide 500,000 records consistently. But when the period is set to 15, it gives me exact 500,000 records. I also used Random partitioning for reading the input dataset. Would you recommend any different partitioning method?
Post Reply