Join heterogenous data

miamichillin99 · Post by **miamichillin99** » Wed Dec 01, 2004 11:37 am

Let me start by saying thanks to the MODS and Admin of this great site. I'm new to DS and this site is going to help me a lot. Great Job!!!

No onto my Q:
I'm trying to join two data sources and retrieve the differences. for example:

Code: Select all

DataSource1
EmployeeID            Name
1                            xx
2                            yy

DataSource2
EmplyeeID             Name
1                           xx
2                           yy
3                           zz

The join would return only the record for EmployeeId #3. Also to add fuel to the fire DataSource1 is a sequential file and DataSource2 is an ODBC stage. So far I've created a sequential file going into a Transformer and the ODBC going to a hashed file and into the same Transformer. I'm trying to make the Transformer handle the join logic but it's not working.

If anyone has any ideas I would really appreciate it.

chucksmith · Post by **chucksmith** » Wed Dec 01, 2004 11:55 am

So you set the output constraint to the link variable:

Code: Select all

DataSource2.NotFound

ray.wurlod · Post by **ray.wurlod** » Wed Dec 01, 2004 2:15 pm

Welcome aboard! :D

Since this is a server job, you really should have posted it on the server job forum.

The most obvious approach would be to load the contents of the text file into a hashed file with two key columns and no non-key columns.

The job to isolate the differences then extracts rows from your database (ODBC) table, performs a lookup (in a Transformer stage) against that hashed file, and sends the original row out of the Transformer only if the lookup succeeds (that is, if the reference input link's NOTFOUND variable is true). You handle this using a constraint expression on the output link from the Transformer stage.

miamichillin99 · Post by **miamichillin99** » Wed Dec 01, 2004 3:51 pm

Thanks chuck and ray. I'm really learning a lot here. I had the lookup going to the ODBC when it was suppose to go to the sequential file. I also added a hashed file to be created from the sequential file for faster lookups. I then added the correct NOTFOUND logic to the Transformer's contraints and now by looking at the rows moving from stage to stage it looks to be working better. But it's not perfect. I'm getting an error going from my FTP plugin (thats the sequential file) to my hashed file:

Code: Select all

IncrementalOperations..HAOldOperations.IDENT2: Unbalanced or unescaped quote character
At row 12457, link "FTPOut", while processing column "DESCRIPTION"
Unbalanced or unescaped quote character

I've looked at the sequential file at row 12457 and found nothing strange. Then when I look at the data in the "View Data..." of the Hashed File stage Inputs tab I see that the data is not in the same order as the sequential file before the FTP plug in. I have the file name in the has stage set to the same for the input and output. Is this correct or am I suppose to point the input to the incoming FTP filename?

So my question is how do I know what record is causing the problem. Is it record 12457 of the sequential file or is it record 12457 of the re-ordered hash file?

Sorry about posting in the wrong place. I'll get it right next time. Mods feel free to move this post to the correct forum.
[/code]

chucksmith · Post by **chucksmith** » Wed Dec 01, 2004 4:32 pm

First, hash file row order does not resemble the order of your input data. It can be ordered on output, but for a lookup it does not matter.

I guess that one of the descriptions has an odd number of quote characters in it.

In your FTP stage, what are the settings on the Output / Properties tab? Fixed width? Quote character? Column delimiter? Escape character?

Does you actual input file have quoted columns. Perhaps you should try setting the Quote character to 000.

miamichillin99 · Post by **miamichillin99** » Wed Dec 01, 2004 4:42 pm

Hi chuck.

My setting are all default for the one's you asked:
Fixed width = No
Quote char = "
delimiter = ,
escape = \ (I also tried leaving this blank)

My sequential file does have quoted columns and is comma delimited.

chucksmith · Post by **chucksmith** » Wed Dec 01, 2004 4:57 pm

I suggest you do a test.

Code: Select all

FTP ---> SEQ

Tell the ftp stage that you have one unknown or varchar column, and set your quote character and column delimiter character to 000.

Let's see what actually appears in the sequential file.

You could also telnet to the location of the source file and use vi (or head, tail, and od) to look at the questionable row. I still think it has an embedded quote in the description text.

miamichillin99 · Post by **miamichillin99** » Wed Dec 01, 2004 5:32 pm

Chuck,

using the debug with breakpoints and watches set up I was able to find the bad record. There was a quote (") in the middle of a column in the sequential file. Actually I found a few of them which are there because the file is comma delimited but has quotes around each column. This file is created using the Microsoft ODBC text driver and it places them by default around each column.

Thanks for all your help.
:D

miamichillin99 · Post by **miamichillin99** » Thu Dec 02, 2004 11:26 am

Has anyone encountered the handling of quotes incoming from an ODBC source stage. I have an ODBC connection that will create a sequential file in the end. But some CHAR fields have quotes inside of them and it's creating a file with quotes which then cannot be used as an input to a hashed file like below:

Code: Select all

EmployeeID, Name
"12345","Dave "Matthews""

As you'll notice Matthews is in quotes but it's part of the entire second column, Name. Any way around Ascential thinking that the quotes within the column Name don't mean the field is ending. I thought Ascential would know that in a comma delimited file the column doesn't end until you hit the comma.

kcbland · Post by **kcbland** » Thu Dec 02, 2004 12:08 pm

Make sure you use the "Get SQL Info" button on the ODBC stage so that it properly registers the information about the ODBC driver in use. You'll see the Quote character text box right next to it probably change values now that it knows more about the ODBC driver.

miamichillin99 · Post by **miamichillin99** » Thu Dec 02, 2004 12:17 pm

kc,

Thanks for your input. Unfortunately I had already done that. I did it again anyways and it left the quote character as a double quote("). I still have the sequential file as I posted previously. It still has quotes within the fields.

chucksmith · Post by **chucksmith** » Thu Dec 02, 2004 1:11 pm

I think you are back to trying to tell DataStage to ignore quotes when reading the file, and then using

Code: Select all

Trim(YourColumn,'"',"A")

to strip out the quotes. You still could run into problems with commas within strings. By the way, that second argument in the Trim() is single quote, double quote, single quote.

miamichillin99 · Post by **miamichillin99** » Thu Dec 02, 2004 5:03 pm

Chuck,

Good to hear from you again. Actually I did try the QUOTE = 000 but just like you said now my problem is comma's inside the fields. So it's a constant problem. Either with commas or double quotes. I didn't try the Trim function though, because I would like to keep the fields as they come from the ODBC. So if they have a double quote or a comma I want to keep those characters in the resulting sequential file. How can I accomplish this?

chucksmith · Post by **chucksmith** » Thu Dec 02, 2004 5:41 pm

Let me see if I understand the job. Does it look like this:

Code: Select all

ODBC ---> XFR ---> SEQ 2 ---> XFR ---> HASH
                               |
                               V
                   SEQ 1 ---> XFR ---> SEQ 3

Where ODBC is DataSource2 and SEQ 1 is DataSource1?

If so, could you try this:

Code: Select all

ODBC ---> XFR ---> HASH
                    |
                    V
        SEQ 1 ---> XFR ---> SEQ 3

If you eliminate SEQ 2, the problem is gone. Otherwise, try changing the quote character in SEQ 2 to something other than double quote, perhaps a pipe.

If my picture is wrong, please revise it.

DSXchange

Join heterogenous data

Join heterogenous data

quotes within a field