![Smile :)](./images/smilies/icon_smile.gif)
1) I am doing lookup with a table which has 1 row but from source I get 100 M records.
Well, According my understanding the lookup stage just passes the sources information and since lookup has one row, there will be no memory issue. For lookup stage there will be 2GB space.
2) I need to get max date and populate 3 keys.
Ex: SSN, Policy ID, Last name , Trns DT
Here I have data in such a way that there will be multiple transactions. But I need to get latest data.
Procedure:
From source I will get around 80 million records:
Step 1: I will sort all the records with sort stage in order SSN, Policy ID, Last name , Trns DT
Step 2: After Sort stage , I use duplicate stage and keys are
SSN, Policy ID, Last name
I tested with small amount of data it works for me.
Is there anything else I should care ? Like Partition and Nodes ?