extract data from file different num of fields in each row
Moderators: chulett, rschirm, roy
extract data from file different num of fields in each row
Hi,
I have a situation where I have to extract data from a file with rows in the file contain independent set of fields.
ex:
row1 -> rec_type, id,name,qualification
row2 -> rec_type,country,state,place,location
row3 -> rec_type,num_of_rows,file_date,file_time
Depending the rec_type, I have to load different tables. if rec_type= 'H' (header) then the row should go to Table1, if rec_type = 'D', the row should be inserted into Table2 and if rec_type = 'T' this row has to go to Table3.
H -> Header
D -> Detail
T -> Trailer
Please let me know what options should be used in Data Stage
I have a situation where I have to extract data from a file with rows in the file contain independent set of fields.
ex:
row1 -> rec_type, id,name,qualification
row2 -> rec_type,country,state,place,location
row3 -> rec_type,num_of_rows,file_date,file_time
Depending the rec_type, I have to load different tables. if rec_type= 'H' (header) then the row should go to Table1, if rec_type = 'D', the row should be inserted into Table2 and if rec_type = 'T' this row has to go to Table3.
H -> Header
D -> Detail
T -> Trailer
Please let me know what options should be used in Data Stage
Read the Sequential File with a sequential file stage with 5 columns specified in the metadata.
Drag it into the Transformer and Create three out links pointing to the three different tables.
In constraint of each link specify the condition of the rectype.
and you are done.
Drag it into the Transformer and Create three out links pointing to the three different tables.
In constraint of each link specify the condition of the rectype.
and you are done.
Success consists of getting up just one more time than you fall.
-
- Participant
- Posts: 437
- Joined: Fri Oct 15, 2004 6:13 am
- Location: Pune, India
Read the file with two cols only - first one as rectype and second as AllCols. Second should have type as VC with length equal to max record lenght from H, D and T.
Then in Xer, depending on the RecType, create 3 links. On each of the link will have a col AllCols. One these links put the Row Splitter stage. This stage will create the required columns from the AllCols input column.
So for header row, row splitter will take AllCols as in and should create rec_type, id,name,qualification as output.
Follow the same for all and you will get what you want - i.e. H, D and T hitting 3 diff tables.
Then in Xer, depending on the RecType, create 3 links. On each of the link will have a col AllCols. One these links put the Row Splitter stage. This stage will create the required columns from the AllCols input column.
So for header row, row splitter will take AllCols as in and should create rec_type, id,name,qualification as output.
Follow the same for all and you will get what you want - i.e. H, D and T hitting 3 diff tables.
Regards,
S. Kirtikumar.
S. Kirtikumar.
I think I should have stressed more on the Reading Data from the file than the logic which follows after reading the data.
The issue I am facing now is reading data from the file with rows consisting of different number of fields.
H -> 7 Fields
D -> 4 Fields
T -> 8 Fields
Please let me know if there any suggestions for me to implement the logic.
The issue I am facing now is reading data from the file with rows consisting of different number of fields.
H -> 7 Fields
D -> 4 Fields
T -> 8 Fields
Please let me know if there any suggestions for me to implement the logic.
You could read the sequential file with a seq file stage with one column. Set this column to have a delimeter of 000 and a quote of 000.
Connect it to a transformer and select the first character of each row into a stage variable.
Have 3 outputs to sequential files, each containing the one row and set the constraints for each output to be the appropriate value of the stage variable.
Assuming your input is comma delimited
eg Stage Variable: RecTyp = input.col[',',1,1]
Constraint for Header output = RecType='H'
That should work (I think).
Connect it to a transformer and select the first character of each row into a stage variable.
Have 3 outputs to sequential files, each containing the one row and set the constraints for each output to be the appropriate value of the stage variable.
Assuming your input is comma delimited
eg Stage Variable: RecTyp = input.col[',',1,1]
Constraint for Header output = RecType='H'
That should work (I think).
I should mention that the output files should retain the original format of the input files, but will be separated into header, trailer, detail.ShaneMuir wrote:You could read the sequential file with a seq file stage with one column. Set this column to have a delimeter of 000 and a quote of 000.
Connect it to a transformer and select the first character of each row into a stage variable.
Have 3 outputs to sequential files, each containing the one row and set the constraints for each output to be the appropriate value of the stage variable.
Assuming your input is comma delimited
eg Stage Variable: RecTyp = input.col[',',1,1]
Constraint for Header output = RecType='H'
That should work (I think).
You can then used each file as required
Hi All,
There is some more logic to be implemented in this job.
1. Need to check if both Trailer (T) and Header (H) records are present in the file.
2. The Detail (D) records should be processed only if the file consists of both Header and Trailer records.
3. The Trailer will have date, time and ID fields (in addition to the summary fields) which will have to be concatenated as ID_date_time and this value has to be used as a separate field in each Detail (D) row.
Please give me ideas to achieve this solution.
There is some more logic to be implemented in this job.
1. Need to check if both Trailer (T) and Header (H) records are present in the file.
2. The Detail (D) records should be processed only if the file consists of both Header and Trailer records.
3. The Trailer will have date, time and ID fields (in addition to the summary fields) which will have to be concatenated as ID_date_time and this value has to be used as a separate field in each Detail (D) row.
Please give me ideas to achieve this solution.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Are there any indicators which link a header record to a trailer record for each set? ie Is the ID field you mention an indentifier for each subset of records?Lucky wrote:Hi All,
There is some more logic to be implemented in this job.
1. Need to check if both Trailer (T) and Header (H) records are present in the file.
2. The Detail (D) records should be processed only if the file consists of both Header and Trailer records.
3. The Trailer will have date, time and ID fields (in addition to the summary fields) which will have to be concatenated as ID_date_time and this value has to be used as a separate field in each Detail (D) row.
Please give me ideas to achieve this solution.
Or is based solely on the input order, ie you are expecting an input of:
- H,info
D,info
D,info
...
T,info
Hi Ray/Shane,
I have just finished working on this job.
The logic I have followed:
1. Read all the columns in a row as a string into a transformer ( as suggested in the earlier posts by Shane).
2. Use a stage variable to get the Trailer/ Details rec types (the first field in each string record is the an identifier which helps us in determining if the field is a Header/Detail/Trailer)
3. Now used two sorter transforms after transformer and hash file stage. Constraints used are:
a. RecType = 'D' for 1st Sorter
b. RecType = 'T' for 2nd Sorter
c. RecType = 'T' for Hash Stage
4. Join Hash Stage and Detail Sorter transform data by Source File (Passed as parameter to the job) and insert into Table1 (Constraint has been used to make sure that data will be inserted only if there is an entry in hash file).
5. I have used Sorter to accumulate data in the buffer rather than processing rows one at a time. This way I can avoid processing detail records before processing the Trailer record.
I hope I have made my ideas clear.
Please correct me if I have to follow a much better approach.
I have just finished working on this job.
The logic I have followed:
1. Read all the columns in a row as a string into a transformer ( as suggested in the earlier posts by Shane).
2. Use a stage variable to get the Trailer/ Details rec types (the first field in each string record is the an identifier which helps us in determining if the field is a Header/Detail/Trailer)
3. Now used two sorter transforms after transformer and hash file stage. Constraints used are:
a. RecType = 'D' for 1st Sorter
b. RecType = 'T' for 2nd Sorter
c. RecType = 'T' for Hash Stage
4. Join Hash Stage and Detail Sorter transform data by Source File (Passed as parameter to the job) and insert into Table1 (Constraint has been used to make sure that data will be inserted only if there is an entry in hash file).
5. I have used Sorter to accumulate data in the buffer rather than processing rows one at a time. This way I can avoid processing detail records before processing the Trailer record.
I hope I have made my ideas clear.
Please correct me if I have to follow a much better approach.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
By "Sorter" do you mean a Sort stage?
For efficiency I'd be inclined to pre-process this file to count the rows beginning with "H" and the rows beginning with "T" to make sure that they're the same. This could be done quite quickly in a shell script. Of course, it does not guarantee that there are matching pairs of header and trailer records, but it does let you abandon processing quickly where the counts are not identical.
Since you have not specified your required output, it is impossible to determine whether the remainder of your approach is optimal or not. What do you mean by "buffer"? What happens to the detail records?
For efficiency I'd be inclined to pre-process this file to count the rows beginning with "H" and the rows beginning with "T" to make sure that they're the same. This could be done quite quickly in a shell script. Of course, it does not guarantee that there are matching pairs of header and trailer records, but it does let you abandon processing quickly where the counts are not identical.
Since you have not specified your required output, it is impossible to determine whether the remainder of your approach is optimal or not. What do you mean by "buffer"? What happens to the detail records?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Not really that clear unfortunately?Lucky wrote: I hope I have made my ideas clear.
Please correct me if I have to follow a much better approach.
I am assuming that you are writing the Trailer record to a hash file and then using some sort of key value to match that trailer record to a detail record. If there is no match, no records are written to your target?
I am also assuming that this gives the desired result?
Something like
Code: Select all
SeqFile ----> TRF ---> SeqFileDetail ----> TRF ----> Target
| ^
| |
V |
HshFileTrailer ------------------+
Where does the header information come in? Do you require the header and trailer information for purposes other than identifying the start and end of a data set?
Ray,
Yes I meant Sort stage. I agree with your suggestion about the performance issues if we use a sort stage, but I am very new to data stage and still getting used it and still exploring the options available in DS. I have worked on Informatica so I am using that knowledge to create jobs in data stage. Having said that each tool is different in its own way, as you know.
Important thing is, the file will have only one Header and one Trailer records. So now I have to make sure when I process a file, it has to contain a header and trailer with detail records in between. Otherwise I should not process the file and should reject the entire file.
Can you please let me know how we can pre-process the file to check the trailer and headers.
Shane: Your pictorial representation is absolutely correct as far as processing the detail records are concerned except that there should be Sorter Stage before SeqFileDetail stage and Yes this gives me the desired result. If there is no match (match is being done on SRCE_FILE) then I write these records to the reject file. I donot need the header information except that I need to check if the file consists of Header and Trailer records. But I would need Trailer information for reconciling purposes as it consists the Summary information of the Detail records. The trailer record in a file will go Table2 and all the valid detail records to Table1.
Purpose of checking Header and Trailer records: To make sure all the detail records are in a single file.
Yes I meant Sort stage. I agree with your suggestion about the performance issues if we use a sort stage, but I am very new to data stage and still getting used it and still exploring the options available in DS. I have worked on Informatica so I am using that knowledge to create jobs in data stage. Having said that each tool is different in its own way, as you know.
Important thing is, the file will have only one Header and one Trailer records. So now I have to make sure when I process a file, it has to contain a header and trailer with detail records in between. Otherwise I should not process the file and should reject the entire file.
Can you please let me know how we can pre-process the file to check the trailer and headers.
Shane: Your pictorial representation is absolutely correct as far as processing the detail records are concerned except that there should be Sorter Stage before SeqFileDetail stage and Yes this gives me the desired result. If there is no match (match is being done on SRCE_FILE) then I write these records to the reject file. I donot need the header information except that I need to check if the file consists of Header and Trailer records. But I would need Trailer information for reconciling purposes as it consists the Summary information of the Detail records. The trailer record in a file will go Table2 and all the valid detail records to Table1.
Purpose of checking Header and Trailer records: To make sure all the detail records are in a single file.