Schema File for Fixed width Files
Moderators: chulett, rschirm, roy
Hi,
First, you cannot use UTF8 for a fixed file. UTF8 is not a fixed length encoding (using 1 to 4 bytes for a character). Your final_delim for instance is encoded on 2 bytes, while other ASCII characters are encoded on 1. You should provide another encoding (or convert the file)
See this technote from IBM
Eric
First, you cannot use UTF8 for a fixed file. UTF8 is not a fixed length encoding (using 1 to 4 bytes for a character). Your final_delim for instance is encoded on 2 bytes, while other ASCII characters are encoded on 1. You should provide another encoding (or convert the file)
See this technote from IBM
Eric
It is true that a UTF-8 file may have variable length characters, but it does not necessarily contain any multibyte values. The standard LATIN-1 characters are mapped in a single-byte, making UTF-8 compatible to ASCII for those characters.
This means that UTF-8 can be used for sequential files that contain only single-byte characters, otherwise it may not be used for fixed-length data.
Addendum - I initially thought that the turned question mark was mapped as the single byte 0xBF, but that is in Unicode; in UTF-8 that character is mapped as a 2-byte character as previously noted. I believe that since this character is not present as part of the fixed-length data that it will work correctly when reading the file.
This means that UTF-8 can be used for sequential files that contain only single-byte characters, otherwise it may not be used for fixed-length data.
Addendum - I initially thought that the turned question mark was mapped as the single byte 0xBF, but that is in Unicode; in UTF-8 that character is mapped as a 2-byte character as previously noted. I believe that since this character is not present as part of the fixed-length data that it will work correctly when reading the file.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
eph - Thanks for the correction, you are right. I incorrectly noted that the standard LATIN-1 are depicted in 1-byte.
I should have said that the first group of LATIN-1 (0x00 through 0x80, i.e. 7-bit) is one-byte, those are the initial ASCII characters A-Z, a-z, 0-9, and a basic set of punctuation symbols. The full LATIN-1 set has characters that get mapped to more than 1-byte.
I should have said that the first group of LATIN-1 (0x00 through 0x80, i.e. 7-bit) is one-byte, those are the initial ASCII characters A-Z, a-z, 0-9, and a basic set of punctuation symbols. The full LATIN-1 set has characters that get mapped to more than 1-byte.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 5
- Joined: Wed Mar 26, 2008 1:06 am
- Location: trivandrum
- Contact:
UTF8 file can be converted to UTF16LE file by reading the entire line in the UTF8 file as one field and then outputting as UTF16LE.
Then use the schema file for UTF16LE as below
// Fixed width file
record
{record_delim='\n',final_delim=none,delim=none,quote=none,charset='UTF-16LE'}
(
ServiceName:USTRING[40];
ServiceUrl:USTRING[40];
OrigTitle:USTRING[35];
Junk:USTRING[35];
OrigWriter:USTRING[35];
OrigArtist:USTRING[35];
UseType:USTRING[2];
PerfType:USTRING[2];
PerfSttDt:USTRING[8];
Duration:USTRING[4];
Plays:USTRING[9];
)
However, now i have another problem, please see my other post.
UPDATE - The other problem i ran into (as explained in the link just above this line) was that i was unable to remove the BOM character from the beginning of the file. I could not find a schema file property that would let me do that. But i found a work around which is to set the 'STRIP BOM' property to TRUE inside the sequential file stage. The schema file does not overwrite anything that is defined in the stage, if it is not also defined in the schema file.
Then use the schema file for UTF16LE as below
// Fixed width file
record
{record_delim='\n',final_delim=none,delim=none,quote=none,charset='UTF-16LE'}
(
ServiceName:USTRING[40];
ServiceUrl:USTRING[40];
OrigTitle:USTRING[35];
Junk:USTRING[35];
OrigWriter:USTRING[35];
OrigArtist:USTRING[35];
UseType:USTRING[2];
PerfType:USTRING[2];
PerfSttDt:USTRING[8];
Duration:USTRING[4];
Plays:USTRING[9];
)
However, now i have another problem, please see my other post.
UPDATE - The other problem i ran into (as explained in the link just above this line) was that i was unable to remove the BOM character from the beginning of the file. I could not find a schema file property that would let me do that. But i found a work around which is to set the 'STRIP BOM' property to TRUE inside the sequential file stage. The schema file does not overwrite anything that is defined in the stage, if it is not also defined in the schema file.
Last edited by hsahay on Mon Feb 09, 2015 11:23 am, edited 1 time in total.
vishal
Hi,
As I can see there http://en.wikipedia.org/wiki/UTF-16, utf16 is still a variable length encoding, so you migth end up having the same kind of problem. Of course, it will be ok if you only have characters encoded on one 16bits "code unit" and not two. At least that's my understanding
Eric
As I can see there http://en.wikipedia.org/wiki/UTF-16, utf16 is still a variable length encoding, so you migth end up having the same kind of problem. Of course, it will be ok if you only have characters encoded on one 16bits "code unit" and not two. At least that's my understanding
Eric
Eric - I am a little confused on the subject.
Actually i implemented the UTF-8 to UTF-16 job after reading this article below that says -
"UTF-16 encoding uses fixed 2-byte character codes to represent characters."
Here is the link to the article -
https://datastagetips.wordpress.com/201 ... f-8-parse/
But then i read in some other forum that
UTF16 (UCS2) - Uses 2 bytes to 4 bytes for each symbol. So it's not really fixed 2 byte encoding as the previous article suggested.
and
UTF32 (UCS4) - uses 4 bytes always for each symbol.
So now i am wondering if it would be better to encode files from UTF-8 multi bytes to UTF-32 4 bytes and then use the schema file ?
Actually i implemented the UTF-8 to UTF-16 job after reading this article below that says -
"UTF-16 encoding uses fixed 2-byte character codes to represent characters."
Here is the link to the article -
https://datastagetips.wordpress.com/201 ... f-8-parse/
But then i read in some other forum that
UTF16 (UCS2) - Uses 2 bytes to 4 bytes for each symbol. So it's not really fixed 2 byte encoding as the previous article suggested.
and
UTF32 (UCS4) - uses 4 bytes always for each symbol.
So now i am wondering if it would be better to encode files from UTF-8 multi bytes to UTF-32 4 bytes and then use the schema file ?
vishal
According to this site http://unicode.org/faq/utf_bom.html#gen6, it should be safer to use utf32 since it is fixed to 4bytes long.
Eric
Eric