Page 1 of 1

How to generate file in UTF-8 format

Posted: Fri Jul 01, 2016 12:40 am
by jpraveen
Hi

I am generating a flat file in fixed-width and i need this file in UTF-8 format.

I changes NLS map in sequential stage and also in job properties to UTF-8.

but when i check the file in unix box, it was showing as us-ascii .

i used below command for file format check in unix

File -bi <FF1>

output:-
text/plain; charset=us-ascii


can you let me know how to generate a file in UTF-8 format ?

Posted: Fri Jul 01, 2016 4:15 am
by vinothkumar
You can generate the file in ASCII and convert it to UTF-8 using iconv command in unix.

iconv -f ascii -t utf-8 f1.txt -o f1.utf8.txt

Posted: Fri Jul 01, 2016 7:45 am
by ArndW
Does the file you are checking actually contain any characters that don't map to the single-byte character set? Otherwise you will always get this value from the "file" command.

Posted: Fri Jul 01, 2016 9:04 am
by UCDI
Correct me if I am wrong but I thought UTF-8 is "one of several" extended ascii sets, that is bytes 0-127 are "ascii" and 128-255 are mapped for "non english" characters.

If you don't use any chars over 127, I am not sure that any tool can tell the difference (??) between them, assuming we are talking a pure text file without markup or extensions or some other way to differentiate?

Again, I could be wrong, so I am half asking here...

Posted: Fri Jul 01, 2016 12:59 pm
by Mike
UTF-8 is a Unicode character set where characters are encoded from 1 to 4 bytes. ASCII characters are encoded in UTF-8 the same as they are in ASCII.

So us-ascii is essentially a subset of UTF-8.

Your us-ascii file is also a UTF-8 file.

Mike

Posted: Fri Jul 01, 2016 1:16 pm
by chulett
Yes but isn't there some sort of a magic (maybe 4 byte) header on UTF-8 files? I recently had an issue where a particular set of files would come in either format and my tool when set to UTF-8 could read either without issue but when set to US-ASCII would barf on a UTF-8 file, adding some "garbage" characters to the first field.

Posted: Fri Jul 01, 2016 2:32 pm
by Mike
Seems like a UTF-8 file could come with an optional 3-byte BOM.

But that is no guarantee that it is a UTF-8 file.

I think if you're expecting a UTF-8 file, getting a us-ascii file should be no problem.

If you're expecting a us-ascii file, getting a UTF-8 file with a BOM is going to be a problem even if everything after the BOM is ASCII.

Mike

Posted: Fri Jul 01, 2016 2:58 pm
by chulett
That mirrors my experience.