Page 1 of 1

Data with Embedded Carriage Returns

Posted: Thu May 18, 2006 12:50 pm
by rwierdsm
OK, Folks, I've spent about an hour and a half browsing all the posts and have not found anything that gives me a solution to the particular task I have been set.

Next week I will be receiving a peculiar file that will contain several fields that can be up to 32k wide. These fields can hold almost anything including special characters, hidden characters and most frightening, carriage returns of all formats.

Trouble is, I have to preserve the format of these input fields, so, no removing carriage returns! Yikes!

The guys providing the source have cooked up a field delimiter character combination to separate the fields, as follows: {~^|^~} Yikes again! (still trying to determine if that character combination is suggestive of anything?!?!)

They don't yet realize about the end of line indicator, so I'm trying to cook up something to suggest, but I have yet to get an idea :idea: .

Any suggestions? Also what are the downstream implications? I have to create a flat file as input to the job that actually hits the DB (Teradata in this case), so how am I going to create a load file with these embedded carriage returns? :(

Thought I'd post this now so some ideas can perculate before my deadline.

Rob Wierdsma

Posted: Thu May 18, 2006 1:01 pm
by chulett
The Sequential File stage has a column option for 'Contains Terminators', something you should check out. :wink:

Good luck with that delimiter.

Posted: Thu May 18, 2006 1:09 pm
by rwierdsm
Yessss, Craig, I have seen that, buuuttttt.....

How do I determine which carriage return is actually the correct one?

Thinking through this a bit more, I guess it would have to follow my last delimiter, i.e. I would have to specify that the last column is followed by the delimiter and the 'real' carriage return. Does that sound about right? ....except that the source file also has multiple record types with varying numbers of columns (forgot to tell you that part)....

Craig, will this setting capture all types of carriage return? i.e. DOS (x0D0A), HEX (x0A) and the ^M I've seen in some posts (which I have no idea about).

Thanks,

Rob W

Posted: Thu May 18, 2006 1:36 pm
by DSguru2B
If you have the number of columns that are actually present in the file then you could read the file in a single row. Use the field function, or ereplace function to deal with that martian delimiter and after you find n number of delimiters (where n will be your column count), throw in a carrige return.
Just a thought.

Posted: Thu May 18, 2006 1:38 pm
by chulett
This isn't about determining which carriage return is the 'right one'. All that setting controls is whether 'terminators' inside the field are considered data or if the record ends at that point... and what it considers 'terminators' are based on what 'Line Termination' style you select in the stage.
Rob wrote:....except that the source file also has multiple record types with varying numbers of columns (forgot to tell you that part)....
Ah... that does make it more interesting, that's for sure. You'll need to read it as long long varchar field and then bust it up appropriately based on the record type.

Are you on a Windows server like your first post indicates?

Posted: Thu May 18, 2006 1:56 pm
by rwierdsm
chulett wrote:
Are you on a Windows server like your first post indicates?
Yup.

Posted: Thu May 18, 2006 2:01 pm
by rwierdsm
DSguru2B wrote:If you have the number of columns that are actually present in the file then you could read the file in a single row. Use the field function, or ereplace function to deal with that martian delimiter and after you find n number of delimiters (where n will be your column count), throw in a carrige return.
Just a thought.
Hey, Guru.

We actually used that technique in a previous exercise, however, the varying number of columns per record type makes it a bit more interesting. Also, I'm hesitant to replace the 'martian' delimiter with anything else that may actually show up in the data. I've been thinking about substituting @FM which translates to (xFE). That value is about as likely to show up in the data as the 'martian' combination.

Rob

Posted: Thu May 18, 2006 2:02 pm
by chulett
Just curious because of your ^M comment. You probably aren't familiar with them as they come from a UNIX environment when a DOS file is sent in 'binary' format to UNIX. The terminators are not converted, so the 0A is seen as the terminator and thus the preceeding 0D (^M) is seen as data.

Not something you'll need to deal with.

Posted: Thu May 18, 2006 7:26 pm
by rwierdsm
OK, thanks for the additional info Craig.

Rob

Posted: Thu May 18, 2006 8:27 pm
by ray.wurlod
Have you considered using a Complex Flat File stage, which has the built-in capacity for dealing with multiple record formats?

Or pre-processing your file, adding sufficient delimiter characters so that every record format has the same number of columns?

Posted: Fri May 19, 2006 9:35 am
by rwierdsm
Ray,

We have used CFF in previous projects.

My problem is
1. doing some pre-processing in terms of validation and
2. telling the CFF how to tell where one field ends and an other starts, as well as where one record ends and the next one starts.

Rob

Posted: Fri May 19, 2006 9:27 pm
by kcbland
You'll need to beg for an end-of-record terminator, because of the confusion around the CR/LF not being a true end-of-record terminator as it's embedded within your custom delimiters.

The Sequential stage will fail to recognize a row because you won't be able to declare a single character column delimeter, it will mess up the logic that determines if the column of data is quoted and the CR/LF should maintain the record instead of ending the record prematurely.

If you can get a custom end-of-record terminator, consider writing a custom program (even DS BASIC will work) to read and parse the source file into discrete columns. A hashed file is perfect for storing the results, as the CR/LF can persist in the data. You can use block reads instead of sequential record reads to handle the data file.

Posted: Sat May 20, 2006 6:19 pm
by rwierdsm
Hmmm.....

Some interesting ideas here Kenneth.

Thanks.

Rob W

Posted: Thu May 25, 2006 1:38 pm
by rwierdsm
What if I was to get the extract development team to use ascii(xFE) for field delimiters and ascii(xFF) for EOR markers?

This would bump into the values for @FM and @IM, but if I stay away from hash files, would I be OK?

Just kicking around other ideas....