Selecting random records from a dataset
Posted: Sat Sep 11, 2010 11:18 am
Hi All,
1st of all, congratulations for catching up with number of posts on Parallel jobs with Server ones ! This indeed is a milestone on its own.
I have this peculiar scenario for which I am still thinking the best approach I should adopt.
I have a dataset which contains more than 250 MM records sorted & hash-partitioned a key. I want to select 500 K records from it randomly. I tried to use the random partition in the output but this is not giving the desired results as I see that it picks the same records in every run which makes me think that random partition is not exactly random for every run. I am not able to use the Sample Stage as I don't know what exactly shall I put as percentage of rows to pick as the input count varies for every run. Is there a way I can reshuffle the records randomly and pick up 500 K records out of it?
We are using 4 nodes on this box.
1st of all, congratulations for catching up with number of posts on Parallel jobs with Server ones ! This indeed is a milestone on its own.
I have this peculiar scenario for which I am still thinking the best approach I should adopt.
I have a dataset which contains more than 250 MM records sorted & hash-partitioned a key. I want to select 500 K records from it randomly. I tried to use the random partition in the output but this is not giving the desired results as I see that it picks the same records in every run which makes me think that random partition is not exactly random for every run. I am not able to use the Sample Stage as I don't know what exactly shall I put as percentage of rows to pick as the input count varies for every run. Is there a way I can reshuffle the records randomly and pick up 500 K records out of it?
We are using 4 nodes on this box.