Page 1 of 1

Posted: Wed Jun 07, 2017 6:36 am
by Mike
UTF8 and bounded varchars come with the sort of baggage that you're seeing. That's due to UTF8 having a variable number of bytes per character. You should switch to nvarchar (or the equivalent of adding Unicode extended property to your varchar).

Also if you look closely at the output that you think is fine, you will see you only have 29 characters. With nvarchar you would have 30 "complete" characters.

Mike

Posted: Wed Jun 07, 2017 7:38 am
by lindatrgeo
Thanks Mike for the response.

In fact, we were using NVARCHAR initially. In that case, DS reads the special character as single byte. When applying SUBSTR [1,30], it extracted 31 bytes of data. But the target Oracle table has size restriction of 30 bytes. So the record was getting rejected by DB for exceeding the length.

While using VARCHAR, the byte size goes in sync with DB.

Posted: Wed Jun 07, 2017 8:11 am
by Mike
As you see, bytes and characters have a variable ratio when you are dealing with UTF8. Characters will range in size from 1 to 4 bytes. Your 30 byte column in Oracle could potentially hold as few as 7 and as many as 30 complete characters. If you don't stick to character semantics, you run the risk of truncating a string in the middle of a multi-byte character.

Posted: Wed Jun 07, 2017 8:51 am
by chulett
I've never understood why so many people who deal with multi-byte characters stick to BYTE semantics in Oracle. Why not alter that column in the table to be VARCHAR2(30 CHAR) instead?

Posted: Wed Jun 07, 2017 9:16 am
by lindatrgeo
Thanks Mike and Craig !

@Mike, That's exactly what happened in the scenario I explained. The string got truncated in the middle of multi byte character. As a workaround, I added a check to see if the 30th character is alphanumeric. If not I extract only the 29 characters to avoid and loading issue.

@Craig, My target system is a legacy one, and they don't allow any change in their DB :(

Posted: Wed Jun 07, 2017 7:41 pm
by ray.wurlod
Just for completeness, the substring functions work with characters, not with bytes.