Schema File for Fixed width Files

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bart12872
Participant
Posts: 82
Joined: Fri Jan 19, 2007 5:38 pm

Post by bart12872 »

It should,
Develop step by step.
First step : validate your schema file without rcp, make sure that you read your file correctly.
You can use import table definition and see the layout parallel generated.
Then, make your job with rcp, it should work.
eph
Premium Member
Premium Member
Posts: 110
Joined: Mon Oct 18, 2010 10:25 am

Post by eph »

Hi,

First, you cannot use UTF8 for a fixed file. UTF8 is not a fixed length encoding (using 1 to 4 bytes for a character). Your final_delim for instance is encoded on 2 bytes, while other ASCII characters are encoded on 1. You should provide another encoding (or convert the file)

See this technote from IBM

Eric
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

It is true that a UTF-8 file may have variable length characters, but it does not necessarily contain any multibyte values. The standard LATIN-1 characters are mapped in a single-byte, making UTF-8 compatible to ASCII for those characters.

This means that UTF-8 can be used for sequential files that contain only single-byte characters, otherwise it may not be used for fixed-length data.

Addendum - I initially thought that the turned question mark was mapped as the single byte 0xBF, but that is in Unicode; in UTF-8 that character is mapped as a 2-byte character as previously noted. I believe that since this character is not present as part of the fixed-length data that it will work correctly when reading the file.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

eph - Thanks for the correction, you are right. I incorrectly noted that the standard LATIN-1 are depicted in 1-byte.

I should have said that the first group of LATIN-1 (0x00 through 0x80, i.e. 7-bit) is one-byte, those are the initial ASCII characters A-Z, a-z, 0-9, and a basic set of punctuation symbols. The full LATIN-1 set has characters that get mapped to more than 1-byte.
prahul4all
Participant
Posts: 5
Joined: Wed Mar 26, 2008 1:06 am
Location: trivandrum
Contact:

Re: Schema File for Fixed width Files

Post by prahul4all »

kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I don't know who owns idatastage.com but they used an IBM trademark as part of their name. Not smart.
Mamu Kim
hsahay
Premium Member
Premium Member
Posts: 175
Joined: Wed Mar 21, 2007 9:35 am

Post by hsahay »

UTF8 file can be converted to UTF16LE file by reading the entire line in the UTF8 file as one field and then outputting as UTF16LE.

Then use the schema file for UTF16LE as below

// Fixed width file
record
{record_delim='\n',final_delim=none,delim=none,quote=none,charset='UTF-16LE'}
(
ServiceName:USTRING[40];
ServiceUrl:USTRING[40];
OrigTitle:USTRING[35];
Junk:USTRING[35];
OrigWriter:USTRING[35];
OrigArtist:USTRING[35];
UseType:USTRING[2];
PerfType:USTRING[2];
PerfSttDt:USTRING[8];
Duration:USTRING[4];
Plays:USTRING[9];
)


However, now i have another problem, please see my other post.

UPDATE - The other problem i ran into (as explained in the link just above this line) was that i was unable to remove the BOM character from the beginning of the file. I could not find a schema file property that would let me do that. But i found a work around which is to set the 'STRIP BOM' property to TRUE inside the sequential file stage. The schema file does not overwrite anything that is defined in the stage, if it is not also defined in the schema file.
Last edited by hsahay on Mon Feb 09, 2015 11:23 am, edited 1 time in total.
vishal
eph
Premium Member
Premium Member
Posts: 110
Joined: Mon Oct 18, 2010 10:25 am

Post by eph »

Hi,

As I can see there http://en.wikipedia.org/wiki/UTF-16, utf16 is still a variable length encoding, so you migth end up having the same kind of problem. Of course, it will be ok if you only have characters encoded on one 16bits "code unit" and not two. At least that's my understanding :)

Eric
hsahay
Premium Member
Premium Member
Posts: 175
Joined: Wed Mar 21, 2007 9:35 am

Post by hsahay »

Eric - I am a little confused on the subject.

Actually i implemented the UTF-8 to UTF-16 job after reading this article below that says -

"UTF-16 encoding uses fixed 2-byte character codes to represent characters."

Here is the link to the article -
https://datastagetips.wordpress.com/201 ... f-8-parse/

But then i read in some other forum that

UTF16 (UCS2) - Uses 2 bytes to 4 bytes for each symbol. So it's not really fixed 2 byte encoding as the previous article suggested.

and

UTF32 (UCS4) - uses 4 bytes always for each symbol.

So now i am wondering if it would be better to encode files from UTF-8 multi bytes to UTF-32 4 bytes and then use the schema file ?
vishal
eph
Premium Member
Premium Member
Posts: 110
Joined: Mon Oct 18, 2010 10:25 am

Post by eph »

According to this site http://unicode.org/faq/utf_bom.html#gen6, it should be safer to use utf32 since it is fixed to 4bytes long.

Eric
Post Reply