How to Assign key Values for duplicated records

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
reddysrc
Participant
Posts: 39
Joined: Mon May 30, 2005 3:19 pm

How to Assign key Values for duplicated records

Post by reddysrc »

HI,

here i have millions of records in my source data with duplicate records.
i need to assign key values for all records, same key values for duplicate recods.
One ID for ONE name , no matter of address
it should be done in one job.

could you let me know how it is possible.

her is the source data

Name; Address; zip

John 123 west st,VA 23233
steve 122 E st , CA 92804
John 89 NE st,NC 55123
steve 44 main st,MN 45123
John 66 old st,WI 53186

above is my source data,

i need to convert into below file. in one job.

ID Name; Address; zip
1001 John 123 west st,VA 23233
2002 steve 122 E st , CA 92804
1001 John 89 NE st,NC 55123
2002 steve 44 main st,MN 45123
1001 John 66 old st,WI 53186

John sould have 1001 ID for all addresses
Steve should have 2002 ID for all addresses.


Thanks
Vemi
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Source Data

Code: Select all

Key   Address
John  123 west st,VA 23233 
steve 122 E st , CA 92804 
John  89 NE st,NC 55123 
steve 44 main st,MN 45123 
John  66 old st,WI 53186 
Source date ---->copy----> two links

first link keep data as it is.

second link ->aggregation on key (here Name)---> surrogate key generator


Then join both outputs on key.

But this will work only for one time load. Else create a dimension table with surrogate key and key on file. Then load data as you would load a star schema.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
nagarjuna
Premium Member
Premium Member
Posts: 533
Joined: Fri Jun 27, 2008 9:11 pm
Location: Chicago

Post by nagarjuna »

Hi ,

You can also achieve by using a sort stage.In sort stage sort the rows by the name and include the option keycolumn change .So , output of this sort stage will give you one extra field ( changekeycolumn) .Transfer this output to transformer stage .Based on the value of the changekeycolumn you can generate id for the rows using stage variables .



i/p-----> Sort ----> Transformer ----> o/p
Nag
shaimil
Charter Member
Charter Member
Posts: 37
Joined: Fri Feb 28, 2003 5:37 am
Location: UK

Post by shaimil »

If time is not an issue and then you could simply use a server job and hash file.

1. Read data

2. Lookup up key in hash file and pull back surrogate key, if there. Obviously will be empty to start

3. If not in lookup assign key and write Key field and surrogate key to hash file.

4. Output data.

Just be careful of the hash file settings. Read the documentation. From memory you need to disable lock for updates on lookup and avoid write cache on output to hash file.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

shaimil wrote:If time is not an issue and then you could simply use a server job and hash file.
Why to use a server job if its not giving better results?
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Post Reply