Alternate logic required ?
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
Alternate logic required ?
Hi All,
I am preparing a job for a new requirement in our Project.
I have to find out the Top 20 Customers for each Branch.
Now my source is a Seq File(~150 million rows) and it contains 4 Amount columns, Customer No and Branch No.
I have to add these Amount columns then find out the customers with most Amount balance.
So I have to use two Transformers 1st for addition of 4 Amount columns and nullability check and 2nd for filtering out top 20 customers.
In the 2nd transformer, I am doing hash partition on Branch No and then sorting on Branch asc, Total Amount desc.
Then using a stage variable where I generate a counter and finally only those rows will go the output where the count is <= 20.
Any alternate logics will be appreciated.
I am preparing a job for a new requirement in our Project.
I have to find out the Top 20 Customers for each Branch.
Now my source is a Seq File(~150 million rows) and it contains 4 Amount columns, Customer No and Branch No.
I have to add these Amount columns then find out the customers with most Amount balance.
So I have to use two Transformers 1st for addition of 4 Amount columns and nullability check and 2nd for filtering out top 20 customers.
In the 2nd transformer, I am doing hash partition on Branch No and then sorting on Branch asc, Total Amount desc.
Then using a stage variable where I generate a counter and finally only those rows will go the output where the count is <= 20.
Any alternate logics will be appreciated.
Thanx and Regards,
ETL User
ETL User
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
No Craig, the job is running fine.
Just wanted to know if there's anything which I may be missing.
Additionally I wanted tune the job for performance.
(As I said my source will have 150+ million records)
Also is addition of amount columns possible in transformer only ?
Or Modify stage can also do it ?
Just wanted to know if there's anything which I may be missing.
Additionally I wanted tune the job for performance.
(As I said my source will have 150+ million records)
Also is addition of amount columns possible in transformer only ?
Or Modify stage can also do it ?
Last edited by chandra.shekhar@tcs.com on Wed Jun 05, 2013 5:52 am, edited 1 time in total.
Thanx and Regards,
ETL User
ETL User
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
A UNIX script may be? And then you can use the "External Source' to read the data.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Not sure what logic you'd need that's not in the documentation:
Note the last one.Modify stage: Options category: Specification
This is a statement with one of the following the forms:
* DROP columnname [, columnname]
* KEEP columnname [, columnname]
* new_columnname [:new_type] = [explicit_conversion_function] old_columnname
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 91
- Joined: Wed Apr 20, 2005 7:59 pm
- Location: U.S.
One options is to use Unix script in before job subroutine and get the total of all 4 amount column.
For this you can use something like awk '$6+=$5+=$4+=$3' where $n is the column number with amount fields.
This will remove your first transformer. You will end up using one transformer to filter out top 20 customer per branch.
Hope this helps.
For this you can use something like awk '$6+=$5+=$4+=$3' where $n is the column number with amount fields.
This will remove your first transformer. You will end up using one transformer to filter out top 20 customer per branch.
Hope this helps.
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
@Craig,
I also found the same but I think the last logic is for conversion only, not for addition.
found the list of conversions logic from the same document but there was nothing for addition.
@mobashshar
I will try that, but what if any of the columns has null value.
Then may be total amount value value will be null.
I dont know how the addition works in UNIX.
I also found the same but I think the last logic is for conversion only, not for addition.
found the list of conversions logic from the same document but there was nothing for addition.
@mobashshar
I will try that, but what if any of the columns has null value.
Then may be total amount value value will be null.
I dont know how the addition works in UNIX.
Thanx and Regards,
ETL User
ETL User
-
- Participant
- Posts: 91
- Joined: Wed Apr 20, 2005 7:59 pm
- Location: U.S.