Generating dummy data between a range sequentially

jreddy · Post by **jreddy** » Thu Sep 06, 2007 12:54 pm

My requirement is to populate a dimension table with values 0.01 to 80.00
To implement this, my job has a column generator stage with column defined as decimal(4,2) and in the generator properties, i have set the following
Initial value: 0.01
Increment: 0.01
Limit: 80.00

since the column generator stage needs an input link, i just used a row generator stage with number of records set as 1000 (random number > number of rows expected)

The output seems to be having data from 0.01 to 4.98 and restarts at 0.01 again ... so i set the mode to sequential, but still the output data has values from 0.01 to 9.99 and restarts again.

Is there some other setup required in the Column generator stage to make sure the numbers are generated in sequence with the right increment..I have not set any values for Level or Vector (??)

Or is there another way to implement this requirement?

thanks in advance for all your suggestions

ray.wurlod · Post by **ray.wurlod** » Thu Sep 06, 2007 4:28 pm

There are 8000 values between 0.01 and 80.00 if the increment is 0.01. You need to set your rows to generate property appropriately.

jreddy · Post by **jreddy** » Fri Sep 07, 2007 5:19 am

Thanks Ray, I extended the number of rows to 10000, because when i put 8000, i missed the last value 80.00, and i figured since i set the upper limit to 80.00, even it generated more rows, it wont be processed.

I did get all the values from 0.01 to 80.00 but i am getting some duplicates for some values, I added a sort and a remove duplicates between the column generator and the Dataset, but they still remain.

Any suggestions on why that might be happening..

jreddy · Post by **jreddy** » Fri Sep 07, 2007 5:28 am

Actually, what i did to get rid of duplicates was to set the option 'Allow duplicates' to False in the SORT stage and it all worked fine.

thanks for your advice Ray. Appreciate it

jreddy · Post by **jreddy** » Fri Feb 15, 2008 2:48 pm

There is a new problem with this same job. I have the row generator, column generator operating in sequential mode (initial value:0, increment:0.01) and then i am removing duplicates generated.

but now i realised that there are couple values that are missing consistently. For this job that generates data between 0 and 80, these values are missing always everytime i run this job.

0.14, 17.9, 72.12

and i am unable to figure out why these are missing. Noticed that column generator itself is not generating these 3 values. Running with 2 node configuration.

Has anyone had a similar problem before and has any suggestion for me on how to make sure all values are generated.

Thanks in advance

kcbland · Post by **kcbland** » Fri Feb 15, 2008 3:06 pm

It's probably a silly math issue with the internal algorithms. Maybe the partitioning logic is doing something stupid like using floating point.

Have you considered generating integer values and then dividing by 100 afterwards to get back to the scale you want?

jreddy · Post by **jreddy** » Fri Feb 15, 2008 3:43 pm

Thanks Kenneth,

I still cant understand why, but doing what you suggested made my job work

Must be some silly math algorithm issue as you said.

Thanks

kcbland · Post by **kcbland** » Fri Feb 15, 2008 4:23 pm

I'm sure Ray can give you the specific reason, but the idea is that when dealing with decimal values you have something called floating point precision. 1/7 is one type of example of an infinite series value.

0.14285714285714285714285714285714

Notice a pattern? So when you say .14 a

goes off above my head. My guess is that some partitioning algorithm used somehow drops these rows because they're infinite series and not nice numbers.

If volume isn't a consideration you could have tried one node processing to remove the partitioning from the equation.

ray.wurlod · Post by **ray.wurlod** » Fri Feb 15, 2008 5:30 pm

Generate integers 1 through 8000 and divide by 100 downstream.

The problem probably is related to internal storage of floating point numbers but I am unable (and, indeed, unwilling) to devote time to investigating more closely. As well, I'd probably need source code for the generator operator, which I don't have. Why not ask the question of the vendor?