Scenerio with datastage limitations

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
devesh_ssingh
Participant
Posts: 148
Joined: Thu Apr 10, 2008 12:47 am

Scenerio with datastage limitations

Post by devesh_ssingh »

Hi All,

i have situation here, not sure whether parellel job can handle this.

file_name column is actual file name and record is there correspondance record, now i have to capture each data seperatly in respective file given below also there order should be retrieved as it present in input file

input file

file_name record
FILE1 Record1
FILE1 Record2
FILE1 Record3
FILE2 Record21
FILE2 Record22
FILE2 Record23
FILE3 Record31
FILE3 Record32
FILE3 Record33
FILE4 Record41
FILE4 Record42
FILE4 Record43


output should be equal to number of file present in file name column
so here file_name column have total 4 file FILE1...FILE4 so total 4 output should be there.
e.g

File1--file name
Record_column----column name
Record1
Record2
Record3

File2
Record_column
Record21
Record22
Record23


File3
Record_column
Record31
Record32
Record33


File4
Record_column
Record41
Record42
Record43

now i have to designe keeping in mind that number of files can be changed in some run i will get 20 files or 19 or may 21.
also each file will have 10000 corespond data.

regards,
Devesh
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

You should be able to accomplish this using the Folder Stage from the Server pallete. It has been discussed many times here in the forum, and the Folder stage is documented in the IS documentation.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
dvpawankumar
Participant
Posts: 46
Joined: Fri Oct 06, 2006 6:36 am
Location: Tucson

Post by dvpawankumar »

Hi Devesh,

Quick question. Is there any specific reason that you want to acieve this using only datastage. If not, this can be acieved by a simple unix script. Think of and let us know.
Thanks and Regards,
Pavan.
bhasds
Participant
Posts: 79
Joined: Thu May 27, 2010 1:49 am

Hi Devesh

Post by bhasds »

Hi Devesh,


I have tried your scenario.The desired output can be achived through the help of a datastage job sequence and a parallel job. The steps are as below-

In job Sequence-

1. Execute command1-

cat PathName/FileName | nawk -F"|" '{print $1}' | sort -u | wc -l

The above command will open the file and take the distinct count of the first column.In your case it is 4.

This count will be the end value for the start loop activity and will determine the number of times the loop should run.

Each time it wiil pass a new FileName as parameter and that will helps us to crate a new file.

2.UserVariable Activity1-

Here we are formatting the output of the execute command

trim(Convert(@FM,"", Execute_Command1Output))

3.StartLoopActivity-

From=1
Step=1
To=#UserVariable Activity1#

4.ExecuteCommand2-

Command

cat PathName/FileName | nawk -F"|" '{print $1}'| sort -u|sed -n

Parameters

#StartLoop_Activity_58.$Counter#p

Where #StartLoop_Activity_58.$Counter#=The counter of the loop.

The above command returns the #FileName# as each loop runs.

Eg- cat dsxchange | nawk -F"|" '{print $1}'| sort -u|sed -n 1(#StartLoop_Activity_58.$Counter#)p will give

File1

cat dsxchange | nawk -F"|" '{print $1}'| sort -u|sed -n 2(#StartLoop_Activity_58.$Counter#)p will give

File2

etc


5.UserVariable Activity2-

Here we are formatting the output of the esecute command

trim(Convert(@FM,"", Execute_Command2Output))

6.JobActivity-

for calling the job and passing the filename as parameter

The Filename parameter of the job=UserVariable Activity2(in the Expression Grid)

7. Endloop Activity- Completing the loop and mapping it back to start loop activity.


In datastage parallel Job-

A parrallel job is designed for creating multiple file.


seqfile------->filterstage------->seqfile

1.In seq file we can read the source file.

2. in filter stage we can pass the below transformation-

Col1 like '#FileName#'

where Col1 is the first column carrying the file name and the #FileName# is the parameter getting passed from the Job Sequence.

We will map only the second column to the target which should be the contents of the file.

3. we are createing the file with the #FileName#.txt where #FileName# we are getting from the JobSequence as parameter.


Now the loop runs in the sequence and will call the job with the #FileName# parameter. Once the job runs the filter stage will filter out the records for the respective #FileName# from the source file and create separate files in the target for each group of values.


Please let me know if you have any clarification.
chetan.c
Participant
Posts: 112
Joined: Tue Jan 17, 2012 2:09 am
Location: Bangalore

Re: Scenerio with datastage limitations

Post by chetan.c »

devesh_ssingh wrote:...capture each data seperatly in respective file given below also there order should be retrieved as it present in input file
Hi ,
Had found this on dsxchange itself.
In a parallel job use an External_target stage and use the below code in the program box.

Code: Select all

cd /your_directory;awk '{field1=index($0,",");print substr($0,field1+1)>substr($0,1,field1-1)}'
Just tried it out,it should give you the expected result.

Thanks,
Chetan.C
devesh_ssingh
Participant
Posts: 148
Joined: Thu Apr 10, 2008 12:47 am

Post by devesh_ssingh »

@dvpawankumar---i already have the scripts but end user support team lack knowledge of unix and forcing us to achieve with DS jobs.

@jwiles--i have tried with folder stage but output is only picking the last corresponds records only.


@bhasds - Thanks for detailed solution, much appreciated !!!!
will try and let you know if stuck somewhere.

@chetan.c ---thanks for your feedback. will try by this method only.


Regards!!!!
Devesh
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

devesh_ssingh wrote:@jwiles--i have tried with folder stage but output is only picking the last corresponds records only.
You said as much in someone else's Folder stage post and some suggestions were made there as to how to solve that. Did you try to follow up on them?
-craig

"You can never have too many knives" -- Logan Nine Fingers
vamsi.4a6
Participant
Posts: 334
Joined: Sun Jan 22, 2012 7:06 am
Contact:

Post by vamsi.4a6 »

@chetan.c


I am getting required required ouput and Could you please clarify below two doubts

1)I need output with .txt extension?

2)You did not mention any filename in the unix code.May I know How it will work and could you please explain the algorthim in few steps what you mentioned in the code?
chetan.c
Participant
Posts: 112
Joined: Tue Jan 17, 2012 2:09 am
Location: Bangalore

Post by chetan.c »

For the first question, you can concatenate the ".txt" to the filename column within a transformer.

The code does this.
field1=index($0,","); Find out the position of "," which I'm assuming as the delimiter between Filename and the record($0 the entire row).
substr($0,field1+1) Extracting the second field from position of ","+1
> Redirect to filename
substr($0,1,field1-1) Extract filename from beginning of the string until position of ","-1.


Thanks,
Chetan.C
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

An issue I foresee with the external target is that it executes against every line so your redirect should really use ">>" and you need to ensure the file doesn't exist/is empty prior to running.

Alternatively, run the command supplied via a sequence command stage if source is already in the format you require (or direct via command prompt if OS level control is acceptable)
Post Reply