Page 1 of 1

Logic Issue

Posted: Tue May 14, 2013 12:36 pm
by sriec12
I want to cross check my concepts with experts :)

1) I am doing lookup with a table which has 1 row but from source I get 100 M records.

Well, According my understanding the lookup stage just passes the sources information and since lookup has one row, there will be no memory issue. For lookup stage there will be 2GB space.

2) I need to get max date and populate 3 keys.

Ex: SSN, Policy ID, Last name , Trns DT

Here I have data in such a way that there will be multiple transactions. But I need to get latest data.

Procedure:

From source I will get around 80 million records:
Step 1: I will sort all the records with sort stage in order SSN, Policy ID, Last name , Trns DT

Step 2: After Sort stage , I use duplicate stage and keys are
SSN, Policy ID, Last name

I tested with small amount of data it works for me.


Is there anything else I should care ? Like Partition and Nodes ?

Posted: Wed May 15, 2013 12:10 am
by chandra.shekhar@tcs.com
Whatever you have told, it looks fine.
Instead of using Sort and Remove Duplicate stage, you can use only Transformer stage to achieve the same.

Posted: Wed May 15, 2013 1:35 am
by chulett
The data would still need to be sorted for the Transformer logic to work correctly.

What role is the lookup playing in all this? I'm guessing it is providing some kind of static data for the run but you didn't clarify. And the real test would be a full load, have you done that yet?

Posted: Wed May 15, 2013 3:55 am
by ray.wurlod
If you're doing a lookup against a single row table you're effectively returning a constant. You could generate this in the Transformer stage or in a Column Generator stage.

Posted: Wed May 15, 2013 8:49 am
by sriec12
1) Regarding Lookup Stage its not static value.

Its a weekly load, I need to load the as of date for every week. In this case

is it fine to do Lookup for one row against 80M records ?

I apologize for the confusion

Posted: Wed May 15, 2013 1:46 pm
by chulett
sriec12 wrote:1) Regarding Lookup Stage its not static value. Its a weekly load, I need to load the as of date for every week.
So as I said, static over the course of the run. Sure that's "fine" as it is cached into and read from memory but why not simply pass that in as a job parameter?

Posted: Thu May 16, 2013 7:39 am
by sriec12
@chulett --- So is it fine to use lookup 1 row aginast 80M records

Well, Its a great point......................It can be created as job parameter.

Reason for why its not parameterized

1) This is a fact job, very few jobs need this as of date.

Posted: Thu May 16, 2013 9:20 am
by chulett
Odd reason in my book, we have a ton of "single job" parameters. We use them for data needed during a run that changes on a daily basis without regard to how many jobs may (or may not) need each one.