I had a discussion with Sandy about her need to read a sequential file in BASIC. Her needs are to split a source file into an indeterminate number of output files based on a column of data within each row. I'm going to share an after-job routine I wrote that splits a load file into multiple output files, based on the YearMonth of a business date within each row. Since target tables are usually partitioned along a business date, in month ranges, this routine has application to doing partitioned loading.
I'll summarize Oracle DBA design work for data warehouses here briefly. Large fact tables are partitioned into month ranges. ALL indexes are locally managed, meaning that no index spans data outside of its own partition. There is not a global primary key. The primary key is locally managed, and must include the partitioning column as part of the primary key definition.
That being said, when you DIRECT path load a table such as this from a single file, the database must determine the path of the row so that it will be placed into the appropriate partition. In addition, after all rows have been loaded, the database will then start to incrementally update the indexes within each partition in a rather lengthy manner, because it has to do all as a single exercise. If you were to have your data pre-sorted into a file per partition, you could run N background sqlldr DIRECT path runs, and since each load manages the work for the partition, you gain a high degree of parallelism and performance.
The routine I'm supplying has some tricks to learn from simple observation. It pretty much runs the gamut, from opening a sequential file, to reading a hash file into an array in memory, to dynamically creating output sequential files. My last client uses this on a large table that required partition loading. I setup the ETL control table for managing partitions and their loading mechanisms. This particular table is used to map months to partitions, and the partition could exist in a main or history table, as well as have other features.
Without further ado...
Code: Select all
FunctionName = "FileSplitterByMonth"
*
* Dimension to hold the number of output file handles (1 per month)
DIM OutputFileHandles(100)
*
* InputArg looks like:
* WorkHashFileDirectory,/pathtofile/filename,{PARTITION/blank}, date column ordinal location,{DEBUG}
*
*
CommonHashFileDirectory = TRIM(FIELD(InputArg,",",1))
InputFileName = TRIM(FIELD(InputArg,",",2))
OutputFileBaseName = FIELD(InputFileName, "/", 1, DCOUNT(InputFileName,"/")-1)
UsePartitionName = (TRIM(FIELD(InputArg,",",3)) = "PARTITION")
DateColumnLocation = TRIM(FIELD(InputArg,",",4))
DEBUGGING = (TRIM(FIELD(InputArg,",",5)) = "DEBUG")
WorkFileName = InputFileName:".bak"
*
* Pull the table name from the filename
*
TableToProcess = UPCASE(TRIM(FIELD(FIELD(InputFileName, ".", 1), "/", DCOUNT(InputFileName,"/"))))
*
* Look at the extension, it's either ".ins" or ".upd"
*
LoadType = TRIM(FIELD(InputFileName, ".", 2))
*
* Make a backup copy of original file
*
UNIXcmd = "cp ":InputFileName:" ":WorkFileName
If DEBUGGING Then Call DSLogInfo("UNIX command: ":UNIXcmd, FunctionName)
Call DSExecute("UNIX", UNIXcmd, ScreenOutput, SystemReturnCode)
If DEBUGGING Then Call DSLogInfo("UNIX output: ":UNIXcmd:@AM:ScreenOutput, FunctionName)
If SystemReturnCode > 0 Then
Call DSLogFatal("The script [":UNIXcmd:"] had a SystemReturnCode of [":SystemReturnCode:"]",
FunctionName)
End
Txt = ""
Txt<-1> = "CommonHashFileDirectory: ":CommonHashFileDirectory
Txt<-1> = "InputFileName: ":InputFileName
Txt<-1> = "UsePartitionName: ":UsePartitionName
Txt<-1> = "TableToProcess: ":TableToProcess
Txt<-1> = "LoadType: ":LoadType
Txt<-1> = "WorkFileName: ":WorkFileName
If DEBUGGING Then Call DSLogInfo(Txt, FunctionName)
OutputFileDates = ""
OutputFileNames = ""
OutputFileHandlePointers = ""
OutputFileNamesUniq = ""
OutputFileHandleCount = 0
*
* Read the staging hash file that contains the contents of warehouse table ETL_CTRL_PARTITIONS.
* This tables primary key is tablename and YYYYMM. Every fact table is partitioned by business date month
* of the fact. All inserts will be DIRECT path loaded into the named partitioned after separation.
* Every partition has the ability to be marked by DBA's as read-only in this table. We will create
* a load file for each month, but add "-READONLY" to the filename so that it will not be grabbed by the
* auto-loader that sweeps all load files into partitions in parallel.
*
OPENPATH CommonHashFileDirectory:"/stgETL_CTRL_PARTITIONS" TO F.ETL_CTRL_PARTITIONS Else
Call DSLogFatal("Unable to open stgETL_CTRL_PARTITIONS", FunctionName)
End
OPENSEQ WorkFileName TO InputFileHandle Then
If DEBUGGING Then Call DSLogInfo("Splitting file: ":InputFileName, FunctionName)
SELECT F.ETL_CTRL_PARTITIONS
*
* Load the table into memory. Build a series of arrays to hold months and file names. Since a
* table can have a logical name "ORDER_DETAILS", but span a mainline and history table
* "ORDER_DETAILS_MAIN" and "ORDER_DETAILS_HIST", we include both a logical and physical name
* for the table that holds that month partition. A month may be in the main table today, but next
* week is moved via transportable tablespace/partition exchanging into the history table. All
* subsequent loads need to be directed there.
*
cntr = 0
Done = @FALSE
Loop
Readnext ETL_CTRL_PARTITIONSkey Else Done = @TRUE
Until Done Do
Read row From F.ETL_CTRL_PARTITIONS,ETL_CTRL_PARTITIONSkey Then
YYYYMM = FIELD(ETL_CTRL_PARTITIONSkey, @TM, 1)
LogicalTableName = FIELD(ETL_CTRL_PARTITIONSkey, @TM, 2)
IF LogicalTableName = TableToProcess Then
PhysicalTableName = UPCASE(TRIM(row<1>))
PartitionName = UPCASE(TRIM(row<2>))
Readonly = row<3>
Begin Case
Case Readonly = "Y"
FileNameToUse = PartitionName:"-READONLY.":LoadType
Case UsePartitionName
FileNameToUse = PartitionName:".":LoadType
Case @TRUE
FileNameToUse = PhysicalTableName:".":LoadType
*
* If using DataStage to do loads, we need to make sure the files are there for the jobs
*
UNIXcmd = "touch ":OutputFileBaseName:"/":FileNameToUse
If DEBUGGING Then Call DSExecute("UNIX", UNIXcmd, ScreenOutput, SystemReturnCode)
End Case
cntr += 1
OutputFileDates<cntr> = YYYYMM
OutputFileNames<cntr> = FileNameToUse
End
End
Repeat
*
* Parse the pipe-delimited data file. Get the YYYY-MM-DD business date and find out what file it goes into.
* Find the file pointer and write to it the data. If the YYYYMM doesn't have an existing file pointer,
* then go to the routine that creates and opens one.
*
Done = @FALSE
OutputFileHandleCount = 0
Loop
Until Done Do
READSEQ line From InputFileHandle Then
DateCheck = FIELD(line, "|", DateColumnLocation)
YYYYMM = DateCheck[1,4]:"-":DateCheck[5,2]
Locate YYYYMM IN OutputFileDates SETTING PTR Then
OutputFileHandlePtr = OutputFileHandlePointers<PTR>
If OutputFileHandlePtr = "" Then Gosub CreateOutputFileHandle
WRITESEQ line TO OutputFileHandles(OutputFileHandlePtr) Else
Call DSLogFatal("Could not update file", FunctionName)
End
End
End Else
Done = @TRUE
End
Repeat
For HandlePtr = 1 To OutputFileHandleCount
WEOFSEQ OutputFileHandles(HandlePtr)
CLOSESEQ OutputFileHandles(HandlePtr)
Next HandlePtr
End Else
Call DSLogWarn("Unable to open file: ":WorkFileName, FunctionName)
End
ErrorCode = 0
Goto Fini
CreateOutputFileHandle:
*
* OutputFileNames<PTR> does not have a handle, so open the file else create it then open it
*
NewFileName = OutputFileBaseName:"/":OutputFileNames<PTR>
OPENSEQ NewFileName TO OutputFileHandle Then
ValidOpen = @TRUE
End Else
If DEBUGGING Then Call DSLogInfo("Creating output file: ":NewFileName, FunctionName)
CREATE OutputFileHandle Then ValidOpen = @TRUE
End
If NOT(ValidOpen) Then
Call DSLogFatal("Could not create file: ":NewFileName, FunctionName)
End
If DEBUGGING Then Call DSLogInfo("Created output file: ":NewFileName, FunctionName)
OutputFileHandleCount += 1
OutputFileHandles(OutputFileHandleCount) = OutputFileHandle
OutputFileHandlePtr = OutputFileHandleCount
For i=1 to cntr
If OutputFileNames<i> = OutputFileNames<PTR> Then OutputFileHandlePointers<i> = OutputFileHandlePtr
Next cntr
Return
Fini: