Page 1 of 1

Surrogate Key stage not working

Posted: Mon Oct 29, 2012 11:05 pm
by vamsi.4a6
I Updated state file value with 100.Then i am using below JOb

sequentialfile---->surrogateKeystage-->sequentialfileoutput

o/p I am getting
col1,col2,col3
"5","e","101"
"1","a","1101"
"6","f","2101"
"4","d","2102"
"2","b","3101"
"3","c","3102"


Excepted Output
col1,col2,col3
1,a,101
2,b,102
3,c,103
4,d,104
5,e,105
6,f,106

Col3 is my surrgoate key column

Posted: Tue Oct 30, 2012 12:30 am
by Poovalingam
I think your question is why the generated keys are not in order..? I think you are using 4 node apt file and so data stage created surrogate key in 4 different sequence.

If you need surrogatey key in sequence then you may need to execute in sequential mode. I'm not much worked with Surrogate key stage. It's better you wait for any other expert to provide other comments.

Cheers,
Poova.

Posted: Tue Oct 30, 2012 1:37 am
by jerome_rajan
That's not the issue here. Col3 is the surrogate key column. The problem here looks two-fold.

1. The SK is generated with a different pattern than just a one-up.
2. The data looks all jumbled up

Posted: Tue Oct 30, 2012 2:23 am
by Poovalingam
For both the question parallelism is the cause. As I told executing in sequential mode will resolve your problem. But we will lose the parallelism. What is your problem if we don't have surrogate key in same pattern? As per my understanding it's just a key and it holds no value and so it can be in different pattern. Data stage will ensure us same key will not be generated in further runs.

Cheers,
Poova.

Posted: Tue Oct 30, 2012 2:09 pm
by jwiles
Is the expected output what is actually required? Or to ask in another way: Is the actual output incorrect for the business rules being implemented?

Poova's analysis is correct...the output looks like it does because: 1) the SKG stage is running in parallel and 2) the block size w/in SKG is probably set to 1000. Within a partition, SKG is assigning keys from a block of numbers in order: p0--101, 102; p1--1001, 1002; p2--2001, 2002; p3--3001, 3002.

Why "jumbled up"? SeqFile writes rows out in the order they arrive at the stage. When running in parallel, you're not guaranteed which partition will deliver it's data first, so therefore output order is not guaranteed to match input order unless you specifically write the job to guarantee it, either by running it sequentially (as suggested) or sorting and collecting the rows so the output order matches the input order. This job does none of that, based upon the description given.

Regards,