Scenerio with datastage limitations
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 148
- Joined: Thu Apr 10, 2008 12:47 am
Scenerio with datastage limitations
Hi All,
i have situation here, not sure whether parellel job can handle this.
file_name column is actual file name and record is there correspondance record, now i have to capture each data seperatly in respective file given below also there order should be retrieved as it present in input file
input file
file_name record
FILE1 Record1
FILE1 Record2
FILE1 Record3
FILE2 Record21
FILE2 Record22
FILE2 Record23
FILE3 Record31
FILE3 Record32
FILE3 Record33
FILE4 Record41
FILE4 Record42
FILE4 Record43
output should be equal to number of file present in file name column
so here file_name column have total 4 file FILE1...FILE4 so total 4 output should be there.
e.g
File1--file name
Record_column----column name
Record1
Record2
Record3
File2
Record_column
Record21
Record22
Record23
File3
Record_column
Record31
Record32
Record33
File4
Record_column
Record41
Record42
Record43
now i have to designe keeping in mind that number of files can be changed in some run i will get 20 files or 19 or may 21.
also each file will have 10000 corespond data.
regards,
Devesh
i have situation here, not sure whether parellel job can handle this.
file_name column is actual file name and record is there correspondance record, now i have to capture each data seperatly in respective file given below also there order should be retrieved as it present in input file
input file
file_name record
FILE1 Record1
FILE1 Record2
FILE1 Record3
FILE2 Record21
FILE2 Record22
FILE2 Record23
FILE3 Record31
FILE3 Record32
FILE3 Record33
FILE4 Record41
FILE4 Record42
FILE4 Record43
output should be equal to number of file present in file name column
so here file_name column have total 4 file FILE1...FILE4 so total 4 output should be there.
e.g
File1--file name
Record_column----column name
Record1
Record2
Record3
File2
Record_column
Record21
Record22
Record23
File3
Record_column
Record31
Record32
Record33
File4
Record_column
Record41
Record42
Record43
now i have to designe keeping in mind that number of files can be changed in some run i will get 20 files or 19 or may 21.
also each file will have 10000 corespond data.
regards,
Devesh
-
- Participant
- Posts: 46
- Joined: Fri Oct 06, 2006 6:36 am
- Location: Tucson
Hi Devesh
Hi Devesh,
I have tried your scenario.The desired output can be achived through the help of a datastage job sequence and a parallel job. The steps are as below-
In job Sequence-
1. Execute command1-
cat PathName/FileName | nawk -F"|" '{print $1}' | sort -u | wc -l
The above command will open the file and take the distinct count of the first column.In your case it is 4.
This count will be the end value for the start loop activity and will determine the number of times the loop should run.
Each time it wiil pass a new FileName as parameter and that will helps us to crate a new file.
2.UserVariable Activity1-
Here we are formatting the output of the execute command
trim(Convert(@FM,"", Execute_Command1Output))
3.StartLoopActivity-
From=1
Step=1
To=#UserVariable Activity1#
4.ExecuteCommand2-
Command
cat PathName/FileName | nawk -F"|" '{print $1}'| sort -u|sed -n
Parameters
#StartLoop_Activity_58.$Counter#p
Where #StartLoop_Activity_58.$Counter#=The counter of the loop.
The above command returns the #FileName# as each loop runs.
Eg- cat dsxchange | nawk -F"|" '{print $1}'| sort -u|sed -n 1(#StartLoop_Activity_58.$Counter#)p will give
File1
cat dsxchange | nawk -F"|" '{print $1}'| sort -u|sed -n 2(#StartLoop_Activity_58.$Counter#)p will give
File2
etc
5.UserVariable Activity2-
Here we are formatting the output of the esecute command
trim(Convert(@FM,"", Execute_Command2Output))
6.JobActivity-
for calling the job and passing the filename as parameter
The Filename parameter of the job=UserVariable Activity2(in the Expression Grid)
7. Endloop Activity- Completing the loop and mapping it back to start loop activity.
In datastage parallel Job-
A parrallel job is designed for creating multiple file.
seqfile------->filterstage------->seqfile
1.In seq file we can read the source file.
2. in filter stage we can pass the below transformation-
Col1 like '#FileName#'
where Col1 is the first column carrying the file name and the #FileName# is the parameter getting passed from the Job Sequence.
We will map only the second column to the target which should be the contents of the file.
3. we are createing the file with the #FileName#.txt where #FileName# we are getting from the JobSequence as parameter.
Now the loop runs in the sequence and will call the job with the #FileName# parameter. Once the job runs the filter stage will filter out the records for the respective #FileName# from the source file and create separate files in the target for each group of values.
Please let me know if you have any clarification.
I have tried your scenario.The desired output can be achived through the help of a datastage job sequence and a parallel job. The steps are as below-
In job Sequence-
1. Execute command1-
cat PathName/FileName | nawk -F"|" '{print $1}' | sort -u | wc -l
The above command will open the file and take the distinct count of the first column.In your case it is 4.
This count will be the end value for the start loop activity and will determine the number of times the loop should run.
Each time it wiil pass a new FileName as parameter and that will helps us to crate a new file.
2.UserVariable Activity1-
Here we are formatting the output of the execute command
trim(Convert(@FM,"", Execute_Command1Output))
3.StartLoopActivity-
From=1
Step=1
To=#UserVariable Activity1#
4.ExecuteCommand2-
Command
cat PathName/FileName | nawk -F"|" '{print $1}'| sort -u|sed -n
Parameters
#StartLoop_Activity_58.$Counter#p
Where #StartLoop_Activity_58.$Counter#=The counter of the loop.
The above command returns the #FileName# as each loop runs.
Eg- cat dsxchange | nawk -F"|" '{print $1}'| sort -u|sed -n 1(#StartLoop_Activity_58.$Counter#)p will give
File1
cat dsxchange | nawk -F"|" '{print $1}'| sort -u|sed -n 2(#StartLoop_Activity_58.$Counter#)p will give
File2
etc
5.UserVariable Activity2-
Here we are formatting the output of the esecute command
trim(Convert(@FM,"", Execute_Command2Output))
6.JobActivity-
for calling the job and passing the filename as parameter
The Filename parameter of the job=UserVariable Activity2(in the Expression Grid)
7. Endloop Activity- Completing the loop and mapping it back to start loop activity.
In datastage parallel Job-
A parrallel job is designed for creating multiple file.
seqfile------->filterstage------->seqfile
1.In seq file we can read the source file.
2. in filter stage we can pass the below transformation-
Col1 like '#FileName#'
where Col1 is the first column carrying the file name and the #FileName# is the parameter getting passed from the Job Sequence.
We will map only the second column to the target which should be the contents of the file.
3. we are createing the file with the #FileName#.txt where #FileName# we are getting from the JobSequence as parameter.
Now the loop runs in the sequence and will call the job with the #FileName# parameter. Once the job runs the filter stage will filter out the records for the respective #FileName# from the source file and create separate files in the target for each group of values.
Please let me know if you have any clarification.
Re: Scenerio with datastage limitations
Hi ,devesh_ssingh wrote:...capture each data seperatly in respective file given below also there order should be retrieved as it present in input file
Had found this on dsxchange itself.
In a parallel job use an External_target stage and use the below code in the program box.
Code: Select all
cd /your_directory;awk '{field1=index($0,",");print substr($0,field1+1)>substr($0,1,field1-1)}'
Thanks,
Chetan.C
-
- Participant
- Posts: 148
- Joined: Thu Apr 10, 2008 12:47 am
@dvpawankumar---i already have the scripts but end user support team lack knowledge of unix and forcing us to achieve with DS jobs.
@jwiles--i have tried with folder stage but output is only picking the last corresponds records only.
@bhasds - Thanks for detailed solution, much appreciated !!!!
will try and let you know if stuck somewhere.
@chetan.c ---thanks for your feedback. will try by this method only.
Regards!!!!
Devesh
@jwiles--i have tried with folder stage but output is only picking the last corresponds records only.
@bhasds - Thanks for detailed solution, much appreciated !!!!
will try and let you know if stuck somewhere.
@chetan.c ---thanks for your feedback. will try by this method only.
Regards!!!!
Devesh
You said as much in someone else's Folder stage post and some suggestions were made there as to how to solve that. Did you try to follow up on them?devesh_ssingh wrote:@jwiles--i have tried with folder stage but output is only picking the last corresponds records only.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
For the first question, you can concatenate the ".txt" to the filename column within a transformer.
The code does this.
field1=index($0,","); Find out the position of "," which I'm assuming as the delimiter between Filename and the record($0 the entire row).
substr($0,field1+1) Extracting the second field from position of ","+1
> Redirect to filename
substr($0,1,field1-1) Extract filename from beginning of the string until position of ","-1.
Thanks,
Chetan.C
The code does this.
field1=index($0,","); Find out the position of "," which I'm assuming as the delimiter between Filename and the record($0 the entire row).
substr($0,field1+1) Extracting the second field from position of ","+1
> Redirect to filename
substr($0,1,field1-1) Extract filename from beginning of the string until position of ","-1.
Thanks,
Chetan.C
An issue I foresee with the external target is that it executes against every line so your redirect should really use ">>" and you need to ensure the file doesn't exist/is empty prior to running.
Alternatively, run the command supplied via a sequence command stage if source is already in the format you require (or direct via command prompt if OS level control is acceptable)
Alternatively, run the command supplied via a sequence command stage if source is already in the format you require (or direct via command prompt if OS level control is acceptable)