Flattening a fixed width file
Moderators: chulett, rschirm, roy
There are several approaches you can take. What you need to determine r answer is how you differentiate between your 3 record types. Is the first field of each record type a single digit, "1", "2" or "3"?
If you have such a field then this problem is easily solved using the CFF stage, since this is a typical COBOL type of data layout.
If you have such a field then this problem is easily solved using the CFF stage, since this is a typical COBOL type of data layout.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 73
- Joined: Wed Sep 30, 2009 5:20 am
In that case the CFF Stage is your answer, it will automatically correctly read each record type. The documentation does a barely acceptable job describing how to do this but it should get you started.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 73
- Joined: Wed Sep 30, 2009 5:20 am
I am trying with a sample job and below is my schema.
record
{record_delim='\n', delim=none, quote=none, binary, ascii, native_endian, charset='ISO-8859-1'}
(
complex_flat_file_1_record_id:string[2] {position=0,link,};
complex_flat_file_1_record_type:tagged {reference=complex_flat_file_1_record_id,position=0} (
RECORD_1:subrec {tagcase=01} (
RecordID1:string[2];
SSN:string[9];
Name:string[30];
);
NEWRECORD:subrec {tagcase=02} (
RecordID2:string[2];
AccountNumber:string[30];
Balance:string[20];
);
NEWRECORD2:subrec {tagcase=03} (
RecordID3:string[2];
Occupation:string[30];
Income:string[20];
);
);
)
While running the job i am getting below warning messges.
Complex_Flat_File_1,0: Tag value: 01 does not match any tagcase for tagged field "complex_flat_file_1_record_type"
Complex_Flat_File_1,0: Tag value: 02 does not match any tagcase for tagged field "complex_flat_file_1_record_type"
I have searched the forum and found some posts regarding this but not able to understand the solution. Any help is highly appreciated.
record
{record_delim='\n', delim=none, quote=none, binary, ascii, native_endian, charset='ISO-8859-1'}
(
complex_flat_file_1_record_id:string[2] {position=0,link,};
complex_flat_file_1_record_type:tagged {reference=complex_flat_file_1_record_id,position=0} (
RECORD_1:subrec {tagcase=01} (
RecordID1:string[2];
SSN:string[9];
Name:string[30];
);
NEWRECORD:subrec {tagcase=02} (
RecordID2:string[2];
AccountNumber:string[30];
Balance:string[20];
);
NEWRECORD2:subrec {tagcase=03} (
RecordID3:string[2];
Occupation:string[30];
Income:string[20];
);
);
)
While running the job i am getting below warning messges.
Complex_Flat_File_1,0: Tag value: 01 does not match any tagcase for tagged field "complex_flat_file_1_record_type"
Complex_Flat_File_1,0: Tag value: 02 does not match any tagcase for tagged field "complex_flat_file_1_record_type"
I have searched the forum and found some posts regarding this but not able to understand the solution. Any help is highly appreciated.
Regards,
Kannan
Kannan
Have you tried adding quotes to the column values in the DFF stage and does that make a difference?
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 73
- Joined: Wed Sep 30, 2009 5:20 am
Hi,
Thank you, after adding quote it worked fine when each record type is in separate rows.
Example:
<Row1>01123456789Joe Test 1
<Row2>02Account1 100.00 2
<Row3>03News Anchor 52000.00 3
When there are more than one record type in a row then it is reading only first record type and not the others.
Example:
<Row1>01123456789Joe Test 102Account1
<Row2> 100.00 203News Anchor
<Row3> 52000.00 3
Is it possible to read the above type of record?
Thank you, after adding quote it worked fine when each record type is in separate rows.
Example:
<Row1>01123456789Joe Test 1
<Row2>02Account1 100.00 2
<Row3>03News Anchor 52000.00 3
When there are more than one record type in a row then it is reading only first record type and not the others.
Example:
<Row1>01123456789Joe Test 102Account1
<Row2> 100.00 203News Anchor
<Row3> 52000.00 3
Is it possible to read the above type of record?
Regards,
Kannan
Kannan
You need to change your file to either have only one record type per row OR to remove the line breaks (with sed or similar) and change the CFF definition as well.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 73
- Joined: Wed Sep 30, 2009 5:20 am
Thanks ArndW..
I have one more question.
There are some fields defined as DECIMAL in CFF stage and when I try to view the data or run the job I am getting below warning message.
Field "RECORD_LENGTH" has import error and no default value; data: {0 8 4}, at offset: 0
where RECORD_LENGTH is the first field and defined as DECIMAL(4).
If I change the field definition from DECIMAL to DISPLAY NUMERIC then I can view the data and run the job successfully.
Any suggestion?
I have one more question.
There are some fields defined as DECIMAL in CFF stage and when I try to view the data or run the job I am getting below warning message.
Field "RECORD_LENGTH" has import error and no default value; data: {0 8 4}, at offset: 0
where RECORD_LENGTH is the first field and defined as DECIMAL(4).
If I change the field definition from DECIMAL to DISPLAY NUMERIC then I can view the data and run the job successfully.
Any suggestion?
Regards,
Kannan
Kannan
-
- Participant
- Posts: 73
- Joined: Wed Sep 30, 2009 5:20 am
-
- Participant
- Posts: 50
- Joined: Tue Jan 19, 2010 4:14 am
-
- Participant
- Posts: 73
- Joined: Wed Sep 30, 2009 5:20 am
-
- Participant
- Posts: 73
- Joined: Wed Sep 30, 2009 5:20 am
There is probably a copy stage in the stream that has been removed at runtime by the parallel engine. When that happens the link you see in the job design no longer exists and therefore no record count can be provided.
FYI: The CFF stage, when your job is compiled, is converted into multiple operators based upon the options and record layouts within CFF. You can see this by looking at the Generated OSH after you compile your job.
Regards,
FYI: The CFF stage, when your job is compiled, is converted into multiple operators based upon the options and record layouts within CFF. You can see this by looking at the Generated OSH after you compile your job.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.