Routines to load data in to a sequential file
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 26
- Joined: Thu Feb 03, 2005 1:05 am
Routines to load data in to a sequential file
Hi all,
For instance the sequential file contains 10 records
1
2
3(***end of report***)
4
5
6(***end of report***)
7
8
9
10(***end of report***)
as given above
the records from 1 to 3 should go in to a seperate file(like file1.txt) and 4 to 6 to a seperate file .like that it should keep on inserting records in to a new file till it finds ***end of report***.
In my scenario the record count may vary from 1000 and more.
Comments appreciated.
For instance the sequential file contains 10 records
1
2
3(***end of report***)
4
5
6(***end of report***)
7
8
9
10(***end of report***)
as given above
the records from 1 to 3 should go in to a seperate file(like file1.txt) and 4 to 6 to a seperate file .like that it should keep on inserting records in to a new file till it finds ***end of report***.
In my scenario the record count may vary from 1000 and more.
Comments appreciated.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Create a stage variable in a Transformer stage to count the number of times the string has been seen.
Use constraint expressions to direct output accordingly. Rows go to the first file (output link) if the count is 0, to the second file if the count is 1, and so on.
For large numbers of files this approach is tedious. You could write everything to one file then use an after-job subroutine to write separate files. But what is the file naming convention?
Use constraint expressions to direct output accordingly. Rows go to the first file (output link) if the count is 0, to the second file if the count is 1, and so on.
For large numbers of files this approach is tedious. You could write everything to one file then use an after-job subroutine to write separate files. But what is the file naming convention?
Code: Select all
SUBROUTINE ManyOutputFiles(SourceFile, ErrorCode)
DEFFUN OpenTextFile(FileName, OpenMode, WriteMode, Logging) CALLING "DSU.OpenTextFile"
$INCLUDE UNIVERSE.INCLUDE FILEINFO.H
$UNDEFINE TESTING
ErrorCode = 0
ReadCount = 0
LineCount = 0
SeparatorCount = 0
* Open source file for reading.
hSource = OpenTextFile((SourceFile), "R", "A", "Y")
If Not(FileInfo(hSource, FINFO$IS.FILEVAR))
Then
ErrorCode = 1
GoTo MainExit
End
* Initialize file name counter component. Open first file for writing.
Counter = 1
FileName = "MyFile_" : Fmt(Counter,"R%4")
hTarget = OpenTextFile(FileName, "W", "O", "Y")
If Not(FileInfo(hTarget,FINFO$IS.FILEVAR))
Then
ErrorCode = 1
GoTo MainExit
End
* Main loop reads lines from source file
Loop
While ReadSeq Line From hSource
ReadCount += 1
If Line = "(***end of report***)"
Then
* Close current file, generate new name and open new file for writing.
SeparatorCount += 1
CloseSeq hTarget
Counter += 1
FileName = "MyFile_" : Fmt(Counter,"R%4")
hTarget = OpenTextFile(FileName, "W", "O", "Y")
If Not(FileInfo(hTarget,FINFO$IS.FILEVAR))
Then
ErrorCode = 1
GoTo MainExit
End
End
Else
* Not a separator; just write line to current output file.
WriteSeq Line To hTarget
Then
LineCount += 1
End
End
Repeat
MainExit:
CloseSeq hSource
CloseSeq hTarget
$IFDEF TESTING
Message = "Rows read by routine = " : RowCount
Message<-1> = "Separators found = " : SeparatorCount
Message<-1> = "Lines written by routine = " : LineCount
Message<-1> = "Number of output files = " : Counter
Call DSLogInfo(Message, "Testing ManyOutputFiles routine")
$ENDIF
RETURN
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 26
- Joined: Thu Feb 03, 2005 1:05 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Try this Routine also!!!
Hi dsquestion,
You could try this also!!!
TestEOF is your input filename. FILEPATH is the directory.
openseq FILEPATH:'/TestEOF' to filehandle else
call DSLogFatal('cannot open TestEOF','TestEOF')
end
Cmd = 'cd ':FILEPATH:'/; wc -l ':'TestEOF'
Call DSExecute("UNIX",Cmd,Output,RetCode)
Call DSLogInfo('command=':Output ,"TestEOF")
tempStr=''
ind=0
fileNum =0
If RetCode = 0 then
NoOfLines = Field(Output," ",1)
Call DSLogInfo('Number of Lines=':NoOfLines ,"Count")
For N=1 TO NoOfLines
readseq A from filehandle else Ans=0
ind = count(A , "###end of report###")
fileNum = fileNum +ind
Call DSLogInfo('ind =':ind ,"Count")
if ind =0 then
tempStr := A:char(10)
Call DSLogInfo('tempStr =':tempStr,"Count")
end else
temp1 = ereplace(A,"###end of report###","")
tempStr := temp1
FileName = 'File':fileNum :'.txt'
Cmd2 = 'cd ':FILEPATH:'/; touch ':FileName :';'
Call DSExecute("UNIX",Cmd2,Output2,RetCode2)
Call DSLogInfo('command=':Cmd2 ,"TestEOF")
Openseq FILEPATH:'/':FileName to tempFile Locked then
WRITESEQ tempStr TO tempFile then Ans = 1
CLOSESEQ tempFile
tempStr=''
end Else
Ans = 0
end
end
Next N
Ans = 1
End
Else
Ans = 0
End
closeseq filehandle
conver this routine to job control so that performance will be good
You could try this also!!!
TestEOF is your input filename. FILEPATH is the directory.
openseq FILEPATH:'/TestEOF' to filehandle else
call DSLogFatal('cannot open TestEOF','TestEOF')
end
Cmd = 'cd ':FILEPATH:'/; wc -l ':'TestEOF'
Call DSExecute("UNIX",Cmd,Output,RetCode)
Call DSLogInfo('command=':Output ,"TestEOF")
tempStr=''
ind=0
fileNum =0
If RetCode = 0 then
NoOfLines = Field(Output," ",1)
Call DSLogInfo('Number of Lines=':NoOfLines ,"Count")
For N=1 TO NoOfLines
readseq A from filehandle else Ans=0
ind = count(A , "###end of report###")
fileNum = fileNum +ind
Call DSLogInfo('ind =':ind ,"Count")
if ind =0 then
tempStr := A:char(10)
Call DSLogInfo('tempStr =':tempStr,"Count")
end else
temp1 = ereplace(A,"###end of report###","")
tempStr := temp1
FileName = 'File':fileNum :'.txt'
Cmd2 = 'cd ':FILEPATH:'/; touch ':FileName :';'
Call DSExecute("UNIX",Cmd2,Output2,RetCode2)
Call DSLogInfo('command=':Cmd2 ,"TestEOF")
Openseq FILEPATH:'/':FileName to tempFile Locked then
WRITESEQ tempStr TO tempFile then Ans = 1
CLOSESEQ tempFile
tempStr=''
end Else
Ans = 0
end
end
Next N
Ans = 1
End
Else
Ans = 0
End
closeseq filehandle
conver this routine to job control so that performance will be good
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The benefits of documentation, and other grumbles
I invite anyone to compare the two pieces of code. One is indented, spaced and includes comments to explain what is happening in each section. The other lacks these features.
The second piece of code includes invocation of DSLogFatal(), which results in a job aborting. The first does not.
The first will work on any platform; the second uses UNIX-specific commands.
The second unnecessarily processes the source file twice - once (with wc) to get the line count, the second to process the lines in a counted loop. The first uses an uncounted loop, exiting when EOF is encountered (that is, the ReadSeq statements fails to read another line).
The second opens and closes the output file for each input line read (very poor performance) and overwrites anything already in that file, meanwhile accumulating the total file in a string variable. The first opens the output file once, and appends to it, only closing it when the trigger to switch to a new output file name is encountered in the input.
The second continues to process within the For..Next loop even if the ReadSeq statement fails to read another line (its Else clause does not exit from the loop). What will be output in this case?
Which would you prefer to maintain?
The second piece of code includes invocation of DSLogFatal(), which results in a job aborting. The first does not.
The first will work on any platform; the second uses UNIX-specific commands.
The second unnecessarily processes the source file twice - once (with wc) to get the line count, the second to process the lines in a counted loop. The first uses an uncounted loop, exiting when EOF is encountered (that is, the ReadSeq statements fails to read another line).
The second opens and closes the output file for each input line read (very poor performance) and overwrites anything already in that file, meanwhile accumulating the total file in a string variable. The first opens the output file once, and appends to it, only closing it when the trigger to switch to a new output file name is encountered in the input.
The second continues to process within the For..Next loop even if the ReadSeq statement fails to read another line (its Else clause does not exit from the loop). What will be output in this case?
Which would you prefer to maintain?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Ray, I can't disagree as to which code is more lucid. But I would suggest there is an alternative that makes better use of Unix's capabilities (after all, dsquestion has identified that the server is Unix). The following solution is based on an awk script which, to my mind, is cleaner than having to delve into creating a DataStage solution.
Caveat: this is off the top-off my head. If it isn't quite right, I'll amend it after I can try it at work.
Undoubtedly someone will suggest a perl script here to do something similar, though more concisely!
David
Caveat: this is off the top-off my head. If it isn't quite right, I'll amend it after I can try it at work.
Code: Select all
awk '
BEGIN \
{
# Set the file number.
file_number = 1;
output_to = "file" filenumber ".txt"
}
($0 == "(***end of report***)") \
{
# Increment the file name.
file_number++;
output_to = "file" filenumber ".txt"
# Don't output the end of report line to the file.
next;
}
{
# Output the line to named file
print > output_to;
}' your_input_file_here
David
-
- Participant
- Posts: 26
- Joined: Thu Feb 03, 2005 1:05 am
Hi Ray,dsquestion wrote:Hi All,
Thanks Ray,Ramesh DHL and D for your valuable inputs.Every code which has been published works good in its one way when tested.
Once again I thank you all for your valuable suggestions.
Thanks for your comments!!
I believe this is knowledge sharing area where we can share our ideas in different aspects. It is all poster's interest/job to tailor the code according to their requirements. There is no point in asking for code comparison.
-Ramesh
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
We'll have to agree to disagree there. I took the opportunity to highlight more efficient versus less efficient practices. People are always banging on about performance - this is one area that is often neglected.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.