Schema File for Fixed width Files

bart12872 · Post by **bart12872** » Wed Nov 27, 2013 12:07 pm

It should,
Develop step by step.
First step : validate your schema file without rcp, make sure that you read your file correctly.
You can use import table definition and see the layout parallel generated.
Then, make your job with rcp, it should work.

eph · Post by **eph** » Thu Nov 28, 2013 4:41 am

Hi,

First, you cannot use UTF8 for a fixed file. UTF8 is not a fixed length encoding (using 1 to 4 bytes for a character). Your final_delim for instance is encoded on 2 bytes, while other ASCII characters are encoded on 1. You should provide another encoding (or convert the file)

See this technote from IBM

Eric

ArndW · Post by **ArndW** » Thu Nov 28, 2013 4:58 am

It is true that a UTF-8 file may have variable length characters, but it does not necessarily contain any multibyte values. The standard LATIN-1 characters are mapped in a single-byte, making UTF-8 compatible to ASCII for those characters.

This means that UTF-8 can be used for sequential files that contain only single-byte characters, otherwise it may not be used for fixed-length data.

Addendum - I initially thought that the turned question mark was mapped as the single byte 0xBF, but that is in Unicode; in UTF-8 that character is mapped as a 2-byte character as previously noted. I believe that since this character is not present as part of the fixed-length data that it will work correctly when reading the file.

ArndW · Post by **ArndW** » Thu Nov 28, 2013 7:27 am

eph - Thanks for the correction, you are right. I incorrectly noted that the standard LATIN-1 are depicted in 1-byte.

I should have said that the first group of LATIN-1 (0x00 through 0x80, i.e. 7-bit) is one-byte, those are the initial ASCII characters A-Z, a-z, 0-9, and a basic set of punctuation symbols. The full LATIN-1 set has characters that get mapped to more than 1-byte.

prahul4all · Post by **prahul4all** » Mon Sep 08, 2014 5:16 am

http://www.idatastage.com/schemafile-in-datstage/

kduke · Post by **kduke** » Mon Sep 08, 2014 5:47 am

I don't know who owns idatastage.com but they used an IBM trademark as part of their name. Not smart.

hsahay · Post by **hsahay** » Thu Feb 05, 2015 3:48 pm

UTF8 file can be converted to UTF16LE file by reading the entire line in the UTF8 file as one field and then outputting as UTF16LE.

Then use the schema file for UTF16LE as below

// Fixed width file
record
{record_delim='\n',final_delim=none,delim=none,quote=none,charset='UTF-16LE'}
(
ServiceName:USTRING[40];
ServiceUrl:USTRING[40];
OrigTitle:USTRING[35];
Junk:USTRING[35];
OrigWriter:USTRING[35];
OrigArtist:USTRING[35];
UseType:USTRING[2];
PerfType:USTRING[2];
PerfSttDt:USTRING[8];
Duration:USTRING[4];
Plays:USTRING[9];
)

However, now i have another problem, please see my other post.

UPDATE - The other problem i ran into (as explained in the link just above this line) was that i was unable to remove the BOM character from the beginning of the file. I could not find a schema file property that would let me do that. But i found a work around which is to set the 'STRIP BOM' property to TRUE inside the sequential file stage. The schema file does not overwrite anything that is defined in the stage, if it is not also defined in the schema file.

eph · Post by **eph** » Mon Feb 09, 2015 6:46 am

Hi,

As I can see there http://en.wikipedia.org/wiki/UTF-16, utf16 is still a variable length encoding, so you migth end up having the same kind of problem. Of course, it will be ok if you only have characters encoded on one 16bits "code unit" and not two. At least that's my understanding

Eric

hsahay · Post by **hsahay** » Mon Feb 09, 2015 11:20 am

Eric - I am a little confused on the subject.

Actually i implemented the UTF-8 to UTF-16 job after reading this article below that says -

"UTF-16 encoding uses fixed 2-byte character codes to represent characters."

Here is the link to the article -
https://datastagetips.wordpress.com/201 ... f-8-parse/

But then i read in some other forum that

UTF16 (UCS2) - Uses 2 bytes to 4 bytes for each symbol. So it's not really fixed 2 byte encoding as the previous article suggested.

and

UTF32 (UCS4) - uses 4 bytes always for each symbol.

So now i am wondering if it would be better to encode files from UTF-8 multi bytes to UTF-32 4 bytes and then use the schema file ?

eph · Post by **eph** » Tue Feb 17, 2015 4:57 am

According to this site http://unicode.org/faq/utf_bom.html#gen6, it should be safer to use utf32 since it is fixed to 4bytes long.

Eric

DSXchange

Schema File for Fixed width Files

Re: Schema File for Fixed width Files