HI,
here i have millions of records in my source data with duplicate records.
i need to assign key values for all records, same key values for duplicate recods.
One ID for ONE name , no matter of address
it should be done in one job.
could you let me know how it is possible.
her is the source data
Name; Address; zip
John 123 west st,VA 23233
steve 122 E st , CA 92804
John 89 NE st,NC 55123
steve 44 main st,MN 45123
John 66 old st,WI 53186
above is my source data,
i need to convert into below file. in one job.
ID Name; Address; zip
1001 John 123 west st,VA 23233
2002 steve 122 E st , CA 92804
1001 John 89 NE st,NC 55123
2002 steve 44 main st,MN 45123
1001 John 66 old st,WI 53186
John sould have 1001 ID for all addresses
Steve should have 2002 ID for all addresses.
Thanks
Vemi
How to Assign key Values for duplicated records
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
Source Data
Source date ---->copy----> two links
first link keep data as it is.
second link ->aggregation on key (here Name)---> surrogate key generator
Then join both outputs on key.
But this will work only for one time load. Else create a dimension table with surrogate key and key on file. Then load data as you would load a star schema.
Code: Select all
Key Address
John 123 west st,VA 23233
steve 122 E st , CA 92804
John 89 NE st,NC 55123
steve 44 main st,MN 45123
John 66 old st,WI 53186
first link keep data as it is.
second link ->aggregation on key (here Name)---> surrogate key generator
Then join both outputs on key.
But this will work only for one time load. Else create a dimension table with surrogate key and key on file. Then load data as you would load a star schema.
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.
Genius may have its limitations, but stupidity is not thus handicapped.
Hi ,
You can also achieve by using a sort stage.In sort stage sort the rows by the name and include the option keycolumn change .So , output of this sort stage will give you one extra field ( changekeycolumn) .Transfer this output to transformer stage .Based on the value of the changekeycolumn you can generate id for the rows using stage variables .
i/p-----> Sort ----> Transformer ----> o/p
You can also achieve by using a sort stage.In sort stage sort the rows by the name and include the option keycolumn change .So , output of this sort stage will give you one extra field ( changekeycolumn) .Transfer this output to transformer stage .Based on the value of the changekeycolumn you can generate id for the rows using stage variables .
i/p-----> Sort ----> Transformer ----> o/p
Nag
If time is not an issue and then you could simply use a server job and hash file.
1. Read data
2. Lookup up key in hash file and pull back surrogate key, if there. Obviously will be empty to start
3. If not in lookup assign key and write Key field and surrogate key to hash file.
4. Output data.
Just be careful of the hash file settings. Read the documentation. From memory you need to disable lock for updates on lookup and avoid write cache on output to hash file.
1. Read data
2. Lookup up key in hash file and pull back surrogate key, if there. Obviously will be empty to start
3. If not in lookup assign key and write Key field and surrogate key to hash file.
4. Output data.
Just be careful of the hash file settings. Read the documentation. From memory you need to disable lock for updates on lookup and avoid write cache on output to hash file.
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI