Page 1 of 2

need a help on job design

Posted: Mon Jan 05, 2009 6:51 am
by dr.murthy
hi all,

could you please share your thoughts for job design of the following reqirement
mine source data is

ID,LOC,PHONE,COUNTRY
1,Home,1234,IND
1,Work,0012,IND
2,Home,0345,AUS
3,Work,0234,JAP
4,Work,1357,USA
4,Home,5678,USA

i need an out put like

ID,HOME_PHONE,WORK_PHONE,COUNTRY
1,1234,0012,IND
2,0345,NULL,AUS
3,NULL,0234,JAP
4,5678,1357,USA


ANY SUGGESTIONS????

Posted: Mon Jan 05, 2009 7:20 am
by dhanashreepanse
1. Split your input file into 2 streams based on the LOC (home or work). This will split into 3 rows each. Have all the columns except LOC.

2. Do a Full outer join on the 2 streams thus generated. Keys should be ID and COUNTRY.

Posted: Mon Jan 05, 2009 11:10 am
by Sainath.Srinivasan
Split the phone into two columns and pass through aggregator to max

Posted: Mon Jan 05, 2009 4:11 pm
by ray.wurlod
Will the country ever be different for the same value of ID?

Posted: Mon Jan 05, 2009 11:52 pm
by dr.murthy
ray.wurlod wrote:Will the country ever be different for the same value of ID? ...

yes ray country code would be different .different id's having same country code.

Posted: Tue Jan 06, 2009 12:37 am
by jhmckeever
Just to clarify: You're stating that a specific ID value may be associated with more than one COUNTRY? (In your example data, each ID has only a single COUNTRY associated to it)

Posted: Tue Jan 06, 2009 2:19 am
by dr.murthy
jhmckeever wrote:Just to clarify: You're stating that a specific ID value may be associated with more than one COUNTRY? (In your example data, each ID has only a single COUNTRY associated to it)
yes, here id means it's an emp_id field,so in each COUNTRY n number of emp_id's are available.COUNTRY it's not an key field

Posted: Tue Jan 06, 2009 3:09 am
by priyadarshikunal
dr.murthy wrote:
jhmckeever wrote:Just to clarify: You're stating that a specific ID value may be associated with more than one COUNTRY? (In your example data, each ID has only a single COUNTRY associated to it)
yes, here id means it's an emp_id field,so in each COUNTRY n number of emp_id's are available.COUNTRY it's not an key field
So Which Country (Home or work) you are planning to put in country field (Output) if they can be different for same Emp ID.

Posted: Tue Jan 06, 2009 4:17 am
by dr.murthy
So Which Country (Home or work) you are planning to put in country field (Output) if they can be different for same Emp ID.[/quote]

hi ,
please find my source data format MY SOURCE DATA IS

EMPID,LOC,PHONE,COUNTRY
1,Home,1234,IND
2,Home,0345,IND
3,Work,0234,JAP
4,Work,1357,USA
1,Work,0012,IND
4,Home,5678,USA

I NEED AN OUT PUT SOME THING LIKE

EMPID,HOME_PHONE,WORK_PHONE,COUNTRY
1,1234,0012,IND
2,0345,NULL,IND
3,NULL,0234,JAP

in source data some empid's doesnot have neither home nor work phone numbers.so i design the job some thing like

SEQFILE---->SWITCH----->


IN SWITCH I SPLITTED THE DATA BASED ON LOC MEANS HOME REC GOES TO ONE LINK AND WORK RECORDS GOES TO SOME OTHER LINK
FINALLY I JOINED THEM BOTH BASED ON EMPID AND RENAME THE FIELD NAMES.SUPPOSE IF THE RECORDS COMMING FROM LOC LINK THEN I RENAMED THE PHONE FIELD AS LOC_PHONE SIMILARLY IF THE RECORDS COMMING FROM HOME LINK IT WAS RENAMED AS HOME_PHONE

O/P GIVES ONLY FOUR FIELDS

EMPID,HOME_PHONE,WORK_PHONE,COUNTRY

BUT ISSUE IS IF SOME RECORDS DOESNOT HAVE HOME PHONE OR WORK PHONE WE ARE NOT ABLE TO HANDLE THOSE RECORDS

Posted: Tue Jan 06, 2009 4:22 am
by dhanashreepanse
First time when you split your file, capture all such records that do not have HOME PHONE OR WORK PHONE in a separate file.

In the end you can append this file to the actual output file.

Posted: Tue Jan 06, 2009 5:25 am
by priyadarshikunal
dr.murthy wrote:So Which Country (Home or work) you are planning to put in country field (Output) if they can be different for same Emp ID.
hi ,
please find my source data format MY SOURCE DATA IS

EMPID,LOC,PHONE,COUNTRY
1,Home,1234,IND
2,Home,0345,IND
3,Work,0234,JAP
4,Work,1357,USA
1,Work,0012,IND
4,Home,5678,USA

I NEED AN OUT PUT SOME THING LIKE

EMPID,HOME_PHONE,WORK_PHONE,COUNTRY
1,1234,0012,IND
2,0345,NULL,IND
3,NULL,0234,JAP

in source data some empid's doesnot have neither home nor work phone numbers.so i design the job some thing like

SEQFILE---->SWITCH----->


IN SWITCH I SPLITTED THE DATA BASED ON LOC MEANS HOME REC GOES TO ONE LINK AND WORK RECORDS GOES TO SOME OTHER LINK
FINALLY I JOINED THEM BOTH BASED ON EMPID AND RENAME THE FIELD NAMES.SUPPOSE IF THE RECORDS COMMING FROM LOC LINK THEN I RENAMED THE PHONE FIELD AS LOC_PHONE SIMILARLY IF THE RECORDS COMMING FROM HOME LINK IT WAS RENAMED AS HOME_PHONE

O/P GIVES ONLY FOUR FIELDS

EMPID,HOME_PHONE,WORK_PHONE,COUNTRY

BUT ISSUE IS IF SOME RECORDS DOESNOT HAVE HOME PHONE OR WORK PHONE WE ARE NOT ABLE TO HANDLE THOSE RECORDS[/quote]

Do you really need the records without any details.
If yes then pass the records having null values in the location field to third link and perform a full outer join on in all three links. But this depends on the requirement as if a value in number is present and location is not present then where to put that number.

EMPID,LOC,PHONE,COUNTRY
1,Home,1234,IND
1,Work,0345,USA

then what should be the output or you are sure as this will not happen.

Posted: Tue Jan 06, 2009 5:43 am
by dr.murthy
yes i need all records,for example in my source data empid =2 doesnot have a work phone so i need a output record with work_phone is null.

but in my job design iam not able to capture those records in full outer join.can you please tell me is any other way to design he job.

Posted: Tue Jan 06, 2009 10:34 am
by priyadarshikunal
dr.murthy wrote:yes i need all records,for example in my source data empid =2 doesnot have a work phone so i need a output record with work_phone is null.

but in my job design iam not able to capture those records in full outer join.can you please tell me is any other way to design he job.

Code: Select all

If yes then pass the records having null values in the location field to third link and perform a full outer join on in all three links. But this depends on the requirement as if a value in number is present and location is not present then where to put that number. 
As posted in other thread (Your current design). It may fail if data comes like.

EMPID,LOC,PHONE,COUNTRY
1,Home,1234,IND
1,,,IND

as you will get two records which might not be needed.

I cannot explain each and everything. Give the problem a thought based on all the suggestions you have got. Also think on the two probable resolutions you have in this thread.

One More thing is you haven't replied to my question in previous post mentioned in last few lines. :evil:

Posted: Tue Jan 06, 2009 11:09 pm
by dr.murthy
hi,
i got a solution finally, i develop a job but it seems very complicated.i had to use four lookups in my job.

as you specified
EMPID,LOC,PHONE,COUNTRY
1,Home,1234,IND
1,,,IND

if the empid =1 doesnot have a work phone number in source only one record is comming like
EMPID,LOC,PHONE,COUNTRY
1,Home,1234,IND
but in output i need work phone number is null FORempid=1

like
EMPID,WRK_PHN,HOME_PHN,LOC
1, NULL ,1234,,IND
IS THERE ANY SIMPLE WAY TO DESIGN THIS JOB

Posted: Wed Jan 07, 2009 1:16 am
by ray.wurlod
No need to shout! (All capitals = shouting)

There is a very simple solution, to which others have alluded and which requires no lookups at all.

The simple design is to use data sorted by ID, assemble the record in a Transformer stage populating the two phone numbers into separate fields with appropriate null handling, and using stage variables to remember the previous record. The output from the Transformer stage is then run through a single Aggregator stage which outputs the Last of each group; the data are grouped on ID. No lookups are required.

This solution is right for server job, because you posted in the server job forum. If it really is a parallel job, substitute a Remove Duplicates stage for the Aggregator stage and make sure that the data are hash or modulus partitioned by ID.