Scenerio with datastage limitations

devesh_ssingh · Post by **devesh_ssingh** » Thu Jul 19, 2012 10:36 am

Hi All,

i have situation here, not sure whether parellel job can handle this.

file_name column is actual file name and record is there correspondance record, now i have to capture each data seperatly in respective file given below also there order should be retrieved as it present in input file

input file

file_name record
FILE1 Record1
FILE1 Record2
FILE1 Record3
FILE2 Record21
FILE2 Record22
FILE2 Record23
FILE3 Record31
FILE3 Record32
FILE3 Record33
FILE4 Record41
FILE4 Record42
FILE4 Record43

output should be equal to number of file present in file name column
so here file_name column have total 4 file FILE1...FILE4 so total 4 output should be there.
e.g

File1--file name
Record_column----column name
Record1
Record2
Record3

File2
Record_column
Record21
Record22
Record23

File3
Record_column
Record31
Record32
Record33

File4
Record_column
Record41
Record42
Record43

now i have to designe keeping in mind that number of files can be changed in some run i will get 20 files or 19 or may 21.
also each file will have 10000 corespond data.

regards,
Devesh

jwiles · Post by **jwiles** » Thu Jul 19, 2012 1:21 pm

You should be able to accomplish this using the Folder Stage from the Server pallete. It has been discussed many times here in the forum, and the Folder stage is documented in the IS documentation.

Regards,

dvpawankumar · Post by **dvpawankumar** » Thu Jul 19, 2012 3:29 pm

Hi Devesh,

Quick question. Is there any specific reason that you want to acieve this using only datastage. If not, this can be acieved by a simple unix script. Think of and let us know.

bhasds · Post by **bhasds** » Fri Jul 20, 2012 1:39 am

Hi Devesh,

I have tried your scenario.The desired output can be achived through the help of a datastage job sequence and a parallel job. The steps are as below-

In job Sequence-

1. Execute command1-

cat PathName/FileName | nawk -F"|" '{print $1}' | sort -u | wc -l

The above command will open the file and take the distinct count of the first column.In your case it is 4.

This count will be the end value for the start loop activity and will determine the number of times the loop should run.

Each time it wiil pass a new FileName as parameter and that will helps us to crate a new file.

2.UserVariable Activity1-

Here we are formatting the output of the execute command

trim(Convert(@FM,"", Execute_Command1Output))

3.StartLoopActivity-

From=1
Step=1
To=#UserVariable Activity1#

4.ExecuteCommand2-

Command

cat PathName/FileName | nawk -F"|" '{print $1}'| sort -u|sed -n

Parameters

#StartLoop_Activity_58.$Counter#p

Where #StartLoop_Activity_58.$Counter#=The counter of the loop.

The above command returns the #FileName# as each loop runs.

Eg- cat dsxchange | nawk -F"|" '{print $1}'| sort -u|sed -n 1(#StartLoop_Activity_58.$Counter#)p will give

File1

cat dsxchange | nawk -F"|" '{print $1}'| sort -u|sed -n 2(#StartLoop_Activity_58.$Counter#)p will give

File2

etc

5.UserVariable Activity2-

Here we are formatting the output of the esecute command

trim(Convert(@FM,"", Execute_Command2Output))

6.JobActivity-

for calling the job and passing the filename as parameter

The Filename parameter of the job=UserVariable Activity2(in the Expression Grid)

7. Endloop Activity- Completing the loop and mapping it back to start loop activity.

In datastage parallel Job-

A parrallel job is designed for creating multiple file.

seqfile------->filterstage------->seqfile

1.In seq file we can read the source file.

2. in filter stage we can pass the below transformation-

Col1 like '#FileName#'

where Col1 is the first column carrying the file name and the #FileName# is the parameter getting passed from the Job Sequence.

We will map only the second column to the target which should be the contents of the file.

3. we are createing the file with the #FileName#.txt where #FileName# we are getting from the JobSequence as parameter.

Now the loop runs in the sequence and will call the job with the #FileName# parameter. Once the job runs the filter stage will filter out the records for the respective #FileName# from the source file and create separate files in the target for each group of values.

Please let me know if you have any clarification.

chetan.c · Post by **chetan.c** » Fri Jul 20, 2012 7:51 am

devesh_ssingh wrote:...capture each data seperatly in respective file given below also there order should be retrieved as it present in input file

Hi ,
Had found this on dsxchange itself.
In a parallel job use an External_target stage and use the below code in the program box.

Code: Select all

cd /your_directory;awk '{field1=index($0,",");print substr($0,field1+1)>substr($0,1,field1-1)}'

Just tried it out,it should give you the expected result.

Thanks,
Chetan.C

devesh_ssingh · Post by **devesh_ssingh** » Sun Jul 22, 2012 3:19 pm

@dvpawankumar---i already have the scripts but end user support team lack knowledge of unix and forcing us to achieve with DS jobs.

@jwiles--i have tried with folder stage but output is only picking the last corresponds records only.

@bhasds - Thanks for detailed solution, much appreciated !!!!
will try and let you know if stuck somewhere.

@chetan.c ---thanks for your feedback. will try by this method only.

Regards!!!!
Devesh

chulett · Post by **chulett** » Sun Jul 22, 2012 11:05 pm

devesh_ssingh wrote:@jwiles--i have tried with folder stage but output is only picking the last corresponds records only.

You said as much in someone else's Folder stage post and some suggestions were made there as to how to solve that. Did you try to follow up on them?

vamsi.4a6 · Post by **vamsi.4a6** » Mon Jul 23, 2012 12:34 am

@chetan.c

I am getting required required ouput and Could you please clarify below two doubts

1)I need output with .txt extension?

2)You did not mention any filename in the unix code.May I know How it will work and could you please explain the algorthim in few steps what you mentioned in the code?

chetan.c · Post by **chetan.c** » Mon Jul 23, 2012 2:48 am

For the first question, you can concatenate the ".txt" to the filename column within a transformer.

The code does this.
field1=index($0,","); Find out the position of "," which I'm assuming as the delimiter between Filename and the record($0 the entire row).
substr($0,field1+1) Extracting the second field from position of ","+1
> Redirect to filename
substr($0,1,field1-1) Extract filename from beginning of the string until position of ","-1.

Thanks,
Chetan.C

Kryt0n · Post by **Kryt0n** » Mon Jul 23, 2012 4:29 pm

An issue I foresee with the external target is that it executes against every line so your redirect should really use ">>" and you need to ensure the file doesn't exist/is empty prior to running.

Alternatively, run the command supplied via a sequence command stage if source is already in the format you require (or direct via command prompt if OS level control is acceptable)

DSXchange

Scenerio with datastage limitations

Scenerio with datastage limitations

Hi Devesh

Re: Scenerio with datastage limitations