How to Assign key Values for duplicated records

reddysrc · Post by **reddysrc** » Mon Jun 08, 2009 10:36 am

HI,

here i have millions of records in my source data with duplicate records.
i need to assign key values for all records, same key values for duplicate recods.
One ID for ONE name , no matter of address
it should be done in one job.

could you let me know how it is possible.

her is the source data

Name; Address; zip

John 123 west st,VA 23233
steve 122 E st , CA 92804
John 89 NE st,NC 55123
steve 44 main st,MN 45123
John 66 old st,WI 53186

above is my source data,

i need to convert into below file. in one job.

ID Name; Address; zip
1001 John 123 west st,VA 23233
2002 steve 122 E st , CA 92804
1001 John 89 NE st,NC 55123
2002 steve 44 main st,MN 45123
1001 John 66 old st,WI 53186

John sould have 1001 ID for all addresses
Steve should have 2002 ID for all addresses.

Thanks
Vemi

priyadarshikunal · Post by **priyadarshikunal** » Mon Jun 08, 2009 10:46 am

Source Data

Code: Select all

Key   Address
John  123 west st,VA 23233 
steve 122 E st , CA 92804 
John  89 NE st,NC 55123 
steve 44 main st,MN 45123 
John  66 old st,WI 53186

Source date ---->copy----> two links

first link keep data as it is.

second link ->aggregation on key (here Name)---> surrogate key generator

Then join both outputs on key.

But this will work only for one time load. Else create a dimension table with surrogate key and key on file. Then load data as you would load a star schema.

nagarjuna · Post by **nagarjuna** » Mon Jun 08, 2009 11:18 am

Hi ,

You can also achieve by using a sort stage.In sort stage sort the rows by the name and include the option keycolumn change .So , output of this sort stage will give you one extra field ( changekeycolumn) .Transfer this output to transformer stage .Based on the value of the changekeycolumn you can generate id for the rows using stage variables .

i/p-----> Sort ----> Transformer ----> o/p

shaimil · Post by **shaimil** » Tue Jun 09, 2009 4:38 am

If time is not an issue and then you could simply use a server job and hash file.

1. Read data

2. Lookup up key in hash file and pull back surrogate key, if there. Obviously will be empty to start

3. If not in lookup assign key and write Key field and surrogate key to hash file.

4. Output data.

Just be careful of the hash file settings. Read the documentation. From memory you need to disable lock for updates on lookup and avoid write cache on output to hash file.

priyadarshikunal · Post by **priyadarshikunal** » Tue Jun 09, 2009 4:49 am

shaimil wrote:If time is not an issue and then you could simply use a server job and hash file.

Why to use a server job if its not giving better results?