Data with Embedded Carriage Returns

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Data with Embedded Carriage Returns

Post by rwierdsm »

OK, Folks, I've spent about an hour and a half browsing all the posts and have not found anything that gives me a solution to the particular task I have been set.

Next week I will be receiving a peculiar file that will contain several fields that can be up to 32k wide. These fields can hold almost anything including special characters, hidden characters and most frightening, carriage returns of all formats.

Trouble is, I have to preserve the format of these input fields, so, no removing carriage returns! Yikes!

The guys providing the source have cooked up a field delimiter character combination to separate the fields, as follows: {~^|^~} Yikes again! (still trying to determine if that character combination is suggestive of anything?!?!)

They don't yet realize about the end of line indicator, so I'm trying to cook up something to suggest, but I have yet to get an idea :idea: .

Any suggestions? Also what are the downstream implications? I have to create a flat file as input to the job that actually hits the DB (Teradata in this case), so how am I going to create a load file with these embedded carriage returns? :(

Thought I'd post this now so some ideas can perculate before my deadline.

Rob Wierdsma
Rob Wierdsma
Toronto, Canada
bartonbishop.com
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

The Sequential File stage has a column option for 'Contains Terminators', something you should check out. :wink:

Good luck with that delimiter.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Post by rwierdsm »

Yessss, Craig, I have seen that, buuuttttt.....

How do I determine which carriage return is actually the correct one?

Thinking through this a bit more, I guess it would have to follow my last delimiter, i.e. I would have to specify that the last column is followed by the delimiter and the 'real' carriage return. Does that sound about right? ....except that the source file also has multiple record types with varying numbers of columns (forgot to tell you that part)....

Craig, will this setting capture all types of carriage return? i.e. DOS (x0D0A), HEX (x0A) and the ^M I've seen in some posts (which I have no idea about).

Thanks,

Rob W
Rob Wierdsma
Toronto, Canada
bartonbishop.com
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

If you have the number of columns that are actually present in the file then you could read the file in a single row. Use the field function, or ereplace function to deal with that martian delimiter and after you find n number of delimiters (where n will be your column count), throw in a carrige return.
Just a thought.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

This isn't about determining which carriage return is the 'right one'. All that setting controls is whether 'terminators' inside the field are considered data or if the record ends at that point... and what it considers 'terminators' are based on what 'Line Termination' style you select in the stage.
Rob wrote:....except that the source file also has multiple record types with varying numbers of columns (forgot to tell you that part)....
Ah... that does make it more interesting, that's for sure. You'll need to read it as long long varchar field and then bust it up appropriately based on the record type.

Are you on a Windows server like your first post indicates?
-craig

"You can never have too many knives" -- Logan Nine Fingers
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Post by rwierdsm »

chulett wrote:
Are you on a Windows server like your first post indicates?
Yup.
Rob Wierdsma
Toronto, Canada
bartonbishop.com
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Post by rwierdsm »

DSguru2B wrote:If you have the number of columns that are actually present in the file then you could read the file in a single row. Use the field function, or ereplace function to deal with that martian delimiter and after you find n number of delimiters (where n will be your column count), throw in a carrige return.
Just a thought.
Hey, Guru.

We actually used that technique in a previous exercise, however, the varying number of columns per record type makes it a bit more interesting. Also, I'm hesitant to replace the 'martian' delimiter with anything else that may actually show up in the data. I've been thinking about substituting @FM which translates to (xFE). That value is about as likely to show up in the data as the 'martian' combination.

Rob
Rob Wierdsma
Toronto, Canada
bartonbishop.com
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Just curious because of your ^M comment. You probably aren't familiar with them as they come from a UNIX environment when a DOS file is sent in 'binary' format to UNIX. The terminators are not converted, so the 0A is seen as the terminator and thus the preceeding 0D (^M) is seen as data.

Not something you'll need to deal with.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Post by rwierdsm »

OK, thanks for the additional info Craig.

Rob
Rob Wierdsma
Toronto, Canada
bartonbishop.com
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Have you considered using a Complex Flat File stage, which has the built-in capacity for dealing with multiple record formats?

Or pre-processing your file, adding sufficient delimiter characters so that every record format has the same number of columns?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Post by rwierdsm »

Ray,

We have used CFF in previous projects.

My problem is
1. doing some pre-processing in terms of validation and
2. telling the CFF how to tell where one field ends and an other starts, as well as where one record ends and the next one starts.

Rob
Rob Wierdsma
Toronto, Canada
bartonbishop.com
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You'll need to beg for an end-of-record terminator, because of the confusion around the CR/LF not being a true end-of-record terminator as it's embedded within your custom delimiters.

The Sequential stage will fail to recognize a row because you won't be able to declare a single character column delimeter, it will mess up the logic that determines if the column of data is quoted and the CR/LF should maintain the record instead of ending the record prematurely.

If you can get a custom end-of-record terminator, consider writing a custom program (even DS BASIC will work) to read and parse the source file into discrete columns. A hashed file is perfect for storing the results, as the CR/LF can persist in the data. You can use block reads instead of sequential record reads to handle the data file.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Post by rwierdsm »

Hmmm.....

Some interesting ideas here Kenneth.

Thanks.

Rob W
Rob Wierdsma
Toronto, Canada
bartonbishop.com
rwierdsm
Premium Member
Premium Member
Posts: 209
Joined: Fri Jan 09, 2004 1:14 pm
Location: Toronto, Canada
Contact:

Post by rwierdsm »

What if I was to get the extract development team to use ascii(xFE) for field delimiters and ascii(xFF) for EOR markers?

This would bump into the values for @FM and @IM, but if I stay away from hash files, would I be OK?

Just kicking around other ideas....
Rob Wierdsma
Toronto, Canada
bartonbishop.com
Post Reply