Flattening a fixed width file

ArndW · Post by **ArndW** » Thu Jul 05, 2012 6:48 am

There are several approaches you can take. What you need to determine r answer is how you differentiate between your 3 record types. Is the first field of each record type a single digit, "1", "2" or "3"?
If you have such a field then this problem is easily solved using the CFF stage, since this is a typical COBOL type of data layout.

Jayakannan · Post by **Jayakannan** » Thu Jul 05, 2012 7:37 am

Hi,

The first field in each record type is RECORD_LENGTH which is of 4 digits and the record type identifier is at the position of 155 in each record type.

Whatever i have mentioned above is just for an example. My actual file has 50+ record types.

ArndW · Post by **ArndW** » Thu Jul 05, 2012 9:04 am

In that case the CFF Stage is your answer, it will automatically correctly read each record type. The documentation does a barely acceptable job describing how to do this but it should get you started.

Jayakannan · Post by **Jayakannan** » Fri Jul 06, 2012 2:17 am

I am trying with a sample job and below is my schema.

record
{record_delim='\n', delim=none, quote=none, binary, ascii, native_endian, charset='ISO-8859-1'}
(
complex_flat_file_1_record_id:string[2] {position=0,link,};
complex_flat_file_1_record_type:tagged {reference=complex_flat_file_1_record_id,position=0} (
RECORD_1:subrec {tagcase=01} (
RecordID1:string[2];
SSN:string[9];
Name:string[30];
);
NEWRECORD:subrec {tagcase=02} (
RecordID2:string[2];
AccountNumber:string[30];
Balance:string[20];
);
NEWRECORD2:subrec {tagcase=03} (
RecordID3:string[2];
Occupation:string[30];
Income:string[20];
);
);
)

While running the job i am getting below warning messges.

Complex_Flat_File_1,0: Tag value: 01 does not match any tagcase for tagged field "complex_flat_file_1_record_type"
Complex_Flat_File_1,0: Tag value: 02 does not match any tagcase for tagged field "complex_flat_file_1_record_type"

I have searched the forum and found some posts regarding this but not able to understand the solution. Any help is highly appreciated.

ArndW · Post by **ArndW** » Fri Jul 06, 2012 3:59 am

Have you tried adding quotes to the column values in the DFF stage and does that make a difference?

Jayakannan · Post by **Jayakannan** » Fri Jul 06, 2012 4:58 am

Hi,

Thank you, after adding quote it worked fine when each record type is in separate rows.

Example:
<Row1>01123456789Joe Test 1
<Row2>02Account1 100.00 2
<Row3>03News Anchor 52000.00 3

When there are more than one record type in a row then it is reading only first record type and not the others.

Example:
<Row1>01123456789Joe Test 102Account1
<Row2> 100.00 203News Anchor
<Row3> 52000.00 3

Is it possible to read the above type of record?

ArndW · Post by **ArndW** » Fri Jul 06, 2012 8:11 am

You need to change your file to either have only one record type per row OR to remove the line breaks (with sed or similar) and change the CFF definition as well.

Jayakannan · Post by **Jayakannan** » Tue Jul 10, 2012 2:05 am

Thanks ArndW..

I have one more question.

There are some fields defined as DECIMAL in CFF stage and when I try to view the data or run the job I am getting below warning message.

Field "RECORD_LENGTH" has import error and no default value; data: {0 8 4}, at offset: 0
where RECORD_LENGTH is the first field and defined as DECIMAL(4).

If I change the field definition from DECIMAL to DISPLAY NUMERIC then I can view the data and run the job successfully.

Any suggestion?

Jayakannan · Post by **Jayakannan** » Thu Jul 12, 2012 6:31 am

Is there a way to remove the below warning message?

cff_1,0: Tag value: 27 does not match any tagcase for tagged field "cff_1_record_type"

Reason for warning: The record type 27 exists in the input data but we don't need in output, so it is not mapped to output.

nikhil_bhasin · Post by **nikhil_bhasin** » Thu Jul 12, 2012 6:33 am

You can define the record type and record id constraint for '27' but while defining the output constraints, dont map the record id =27. In that way you can get rid of this warning.

Jayakannan · Post by **Jayakannan** » Wed Jul 18, 2012 4:17 am

There are around 120 record types in the file and I need to extract only 11 out of 120, so it deosn't seem to be flexible to define all the 120 record types. Is there any other way around?

Jayakannan · Post by **Jayakannan** » Wed Jul 18, 2012 4:21 am

Job design is CFF stage -> 11 Dataset stages

The records are getting extracted and written into datasets but the link count is showing as 0 for all the 11 datasets but I am able to view the data in the datasets.

Any idea why the link count is showing as 0?

jwiles · Post by **jwiles** » Wed Jul 18, 2012 8:35 am

There is probably a copy stage in the stream that has been removed at runtime by the parallel engine. When that happens the link you see in the job design no longer exists and therefore no record count can be provided.

FYI: The CFF stage, when your job is compiled, is converted into multiple operators based upon the options and record layouts within CFF. You can see this by looking at the Generated OSH after you compile your job.

Regards,