Routines to load data in to a sequential file

dsquestion · Post by **dsquestion** » Wed Nov 30, 2005 1:10 am

Hi all,

For instance the sequential file contains 10 records

1
2
3(***end of report***)
4
5
6(***end of report***)
7
8
9
10(***end of report***)

as given above
the records from 1 to 3 should go in to a seperate file(like file1.txt) and 4 to 6 to a seperate file .like that it should keep on inserting records in to a new file till it finds ***end of report***.
In my scenario the record count may vary from 1000 and more.

Comments appreciated.

ray.wurlod · Post by **ray.wurlod** » Wed Nov 30, 2005 1:43 am

Create a stage variable in a Transformer stage to count the number of times the string has been seen.

Use constraint expressions to direct output accordingly. Rows go to the first file (output link) if the count is 0, to the second file if the count is 1, and so on.

For large numbers of files this approach is tedious. You could write everything to one file then use an after-job subroutine to write separate files. But what is the file naming convention?

Code: Select all

SUBROUTINE ManyOutputFiles(SourceFile, ErrorCode)
DEFFUN OpenTextFile(FileName, OpenMode, WriteMode, Logging) CALLING "DSU.OpenTextFile"
$INCLUDE UNIVERSE.INCLUDE FILEINFO.H
$UNDEFINE TESTING

ErrorCode = 0
ReadCount = 0
LineCount = 0
SeparatorCount = 0

* Open source file for reading.
hSource = OpenTextFile((SourceFile), "R", "A", "Y")
If Not(FileInfo(hSource, FINFO$IS.FILEVAR)) 
Then
   ErrorCode = 1
   GoTo MainExit
End

* Initialize file name counter component.  Open first file for writing.
Counter = 1
FileName = "MyFile_" : Fmt(Counter,"R%4")
hTarget = OpenTextFile(FileName, "W", "O", "Y")
If Not(FileInfo(hTarget,FINFO$IS.FILEVAR)) 
Then 
   ErrorCode = 1
   GoTo MainExit
End

* Main loop reads lines from source file
Loop
While ReadSeq Line From hSource

   ReadCount += 1

   If Line = "(***end of report***)"
   Then

      * Close current file, generate new name and open new file for writing.
      SeparatorCount += 1
      CloseSeq hTarget
      Counter += 1
      FileName = "MyFile_" : Fmt(Counter,"R%4")
      hTarget = OpenTextFile(FileName, "W", "O", "Y")
      If Not(FileInfo(hTarget,FINFO$IS.FILEVAR)) 
      Then 
         ErrorCode = 1
         GoTo MainExit
      End

   End
   Else

      * Not a separator; just write line to current output file.
      WriteSeq Line To hTarget
      Then
         LineCount += 1
      End

   End

Repeat

MainExit:
CloseSeq hSource
CloseSeq hTarget

$IFDEF TESTING
   Message = "Rows read by routine = " : RowCount
   Message<-1> = "Separators found = " : SeparatorCount
   Message<-1> = "Lines written by routine = " : LineCount
   Message<-1> = "Number of output files = " : Counter
   Call DSLogInfo(Message, "Testing ManyOutputFiles routine")
$ENDIF

RETURN

dsquestion · Post by **dsquestion** » Wed Nov 30, 2005 2:07 am

Hi Ray,

Thanks for the solution.
The naming convention does not have a standard format so no issues in that.Let me try with your solution.

ray.wurlod · Post by **ray.wurlod** » Wed Nov 30, 2005 4:14 am

You'll need to search for OpenTextFile function - it may have been posted as OpenSequentialFile.

rameshDHL · Post by **rameshDHL** » Sat Dec 03, 2005 4:54 am

Hi dsquestion,

You could try this also!!!
TestEOF is your input filename. FILEPATH is the directory.

openseq FILEPATH:'/TestEOF' to filehandle else
call DSLogFatal('cannot open TestEOF','TestEOF')
end

Cmd = 'cd ':FILEPATH:'/; wc -l ':'TestEOF'
Call DSExecute("UNIX",Cmd,Output,RetCode)
Call DSLogInfo('command=':Output ,"TestEOF")
tempStr=''
ind=0
fileNum =0
If RetCode = 0 then
NoOfLines = Field(Output," ",1)
Call DSLogInfo('Number of Lines=':NoOfLines ,"Count")
For N=1 TO NoOfLines
readseq A from filehandle else Ans=0
ind = count(A , "###end of report###")
fileNum = fileNum +ind
Call DSLogInfo('ind =':ind ,"Count")

if ind =0 then
tempStr := A:char(10)

Call DSLogInfo('tempStr =':tempStr,"Count")

end else
temp1 = ereplace(A,"###end of report###","")
tempStr := temp1
FileName = 'File':fileNum :'.txt'

Cmd2 = 'cd ':FILEPATH:'/; touch ':FileName :';'
Call DSExecute("UNIX",Cmd2,Output2,RetCode2)
Call DSLogInfo('command=':Cmd2 ,"TestEOF")
Openseq FILEPATH:'/':FileName to tempFile Locked then
WRITESEQ tempStr TO tempFile then Ans = 1
CLOSESEQ tempFile
tempStr=''
end Else
Ans = 0
end

end
Next N
Ans = 1

End

Else

Ans = 0

End

closeseq filehandle

conver this routine to job control so that performance will be good

ray.wurlod · Post by **ray.wurlod** » Sat Dec 03, 2005 9:11 pm

I invite anyone to compare the two pieces of code. One is indented, spaced and includes comments to explain what is happening in each section. The other lacks these features.

The second piece of code includes invocation of DSLogFatal(), which results in a job aborting. The first does not.

The first will work on any platform; the second uses UNIX-specific commands.

The second unnecessarily processes the source file twice - once (with wc) to get the line count, the second to process the lines in a counted loop. The first uses an uncounted loop, exiting when EOF is encountered (that is, the ReadSeq statements fails to read another line).

The second opens and closes the output file for each input line read (very poor performance) and overwrites anything already in that file, meanwhile accumulating the total file in a string variable. The first opens the output file once, and appends to it, only closing it when the trigger to switch to a new output file name is encountered in the input.

The second continues to process within the For..Next loop even if the ReadSeq statement fails to read another line (its Else clause does not exit from the loop). What will be output in this case?

Which would you prefer to maintain?

djm · Post by **djm** » Sun Dec 04, 2005 2:05 am

Ray, I can't disagree as to which code is more lucid. But I would suggest there is an alternative that makes better use of Unix's capabilities (after all, dsquestion has identified that the server is Unix). The following solution is based on an awk script which, to my mind, is cleaner than having to delve into creating a DataStage solution.

Caveat: this is off the top-off my head. If it isn't quite right, I'll amend it after I can try it at work.

Code: Select all

awk '
BEGIN \
{
    # Set the file number.
  file_number = 1;
  output_to = "file" filenumber ".txt"
}
  
($0 == "(***end of report***)") \
{
    # Increment the file name.
  file_number++;
  output_to = "file" filenumber ".txt"
    # Don't output the end of report line to the file.
  next;
}
{
    # Output the line to named file
  print > output_to;
}' your_input_file_here

Undoubtedly someone will suggest a perl script here to do something similar, though more concisely!

David

dsquestion · Post by **dsquestion** » Tue Dec 06, 2005 6:59 am

Hi All,

Thanks Ray,Ramesh DHL and D for your valuable inputs.Every code which has been published works good in its one way when tested.

Once again I thank you all for your valuable suggestions.

rameshDHL · Post by **rameshDHL** » Tue Dec 06, 2005 12:05 pm

dsquestion wrote:Hi All,

Thanks Ray,Ramesh DHL and D for your valuable inputs.Every code which has been published works good in its one way when tested.

Once again I thank you all for your valuable suggestions.

Hi Ray,

Thanks for your comments!!

I believe this is knowledge sharing area where we can share our ideas in different aspects. It is all poster's interest/job to tailor the code according to their requirements. There is no point in asking for code comparison.

-Ramesh

ray.wurlod · Post by **ray.wurlod** » Tue Dec 06, 2005 1:41 pm

We'll have to agree to disagree there. I took the opportunity to highlight more efficient versus less efficient practices. People are always banging on about performance - this is one area that is often neglected.

DSXchange

Routines to load data in to a sequential file

Routines to load data in to a sequential file

Try this Routine also!!!

The benefits of documentation, and other grumbles