Page 1 of 1

Alternate logic required ?

Posted: Tue Jun 04, 2013 2:40 am
by chandra.shekhar@tcs.com
Hi All,

I am preparing a job for a new requirement in our Project.
I have to find out the Top 20 Customers for each Branch.
Now my source is a Seq File(~150 million rows) and it contains 4 Amount columns, Customer No and Branch No.
I have to add these Amount columns then find out the customers with most Amount balance.
So I have to use two Transformers :( 1st for addition of 4 Amount columns and nullability check and 2nd for filtering out top 20 customers.

In the 2nd transformer, I am doing hash partition on Branch No and then sorting on Branch asc, Total Amount desc.
Then using a stage variable where I generate a counter and finally only those rows will go the output where the count is <= 20.

Any alternate logics will be appreciated.

Posted: Tue Jun 04, 2013 5:34 am
by ray.wurlod
An alternate logic would be to use a ranking function in the presentation layer (e.g. a query) rather than using an ETL tool to perform a non-ETL task.

Posted: Tue Jun 04, 2013 5:40 am
by chandra.shekhar@tcs.com
Thanks Ray,
But I cant use a rank function since my source is a Seq File.
Is there any way where I can eliminate a transformer ?
Or is addition of columns possible in some other stage ??

Posted: Tue Jun 04, 2013 7:20 am
by chulett
Is it working for you right now? I don't see any burning reason to try to eliminate anything if it is.

Posted: Tue Jun 04, 2013 8:31 am
by chandra.shekhar@tcs.com
No Craig, the job is running fine.
Just wanted to know if there's anything which I may be missing.
Additionally I wanted tune the job for performance.
(As I said my source will have 150+ million records)
Also is addition of amount columns possible in transformer only ?
Or Modify stage can also do it ?

Posted: Tue Jun 04, 2013 9:22 am
by chulett
Do you have a 'performance problem' right now? Sure, a Modify stage can be used to 'add' columns. And FYI - you haven't stated what your target is.

Posted: Wed Jun 05, 2013 12:09 am
by chandra.shekhar@tcs.com
Yeah Craig, My job completes in around 1.5 hrs, and I need to make it finish with in 30 min.
Can you give me the logic that can be used in Modify stage ? I have checked but didnt get the solution.
And my target is a DB2 table.

Posted: Wed Jun 05, 2013 2:37 am
by jerome_rajan
A UNIX script may be? And then you can use the "External Source' to read the data.

Posted: Wed Jun 05, 2013 7:08 am
by chulett
Not sure what logic you'd need that's not in the documentation:
Modify stage: Options category: Specification

This is a statement with one of the following the forms:
* DROP columnname [, columnname]
* KEEP columnname [, columnname]
* new_columnname [:new_type] = [explicit_conversion_function] old_columnname
Note the last one.

Posted: Wed Jun 05, 2013 1:09 pm
by mobashshar
One options is to use Unix script in before job subroutine and get the total of all 4 amount column.
For this you can use something like awk '$6+=$5+=$4+=$3' where $n is the column number with amount fields.
This will remove your first transformer. You will end up using one transformer to filter out top 20 customer per branch.
Hope this helps.

Posted: Thu Jun 06, 2013 12:16 am
by chandra.shekhar@tcs.com
@Craig,
I also found the same but I think the last logic is for conversion only, not for addition.
found the list of conversions logic from the same document but there was nothing for addition.

@mobashshar
I will try that, but what if any of the columns has null value.
Then may be total amount value value will be null.
I dont know how the addition works in UNIX.

Posted: Thu Jun 06, 2013 12:27 am
by mobashshar
Check it. Your total amount will not be null in case if one or more amount fields are null. It will only add the numeric values.