Teradata multiset tables and duplicates

bcarlson · Post by **bcarlson** » Tue Jul 29, 2008 9:15 pm

I have 2 purposes for this posting. One is to let people know about an issue we are seeing in case they are encountering the same issue. The second is to glean whatever information is available about this from others that have already worked their way through it.

We have discovered a bug/flaw in how DataStage loads duplicates into a target multiset table. I am wondering if anyone else has noticed this and if so, if you figured out a way around it or if there are any patches available that could fix this. And yes, we are currently working with IBM... but they are pushing to upgrade to v8 and we don't have time, $ or bandwidth to do this yet.

Here is the problem: The source data has duplicates that need to be loaded into the target table. The table is defined as multiset to allow the duplicates to be loaded (tables designated as SET will not allow dups, MULTISET will). What we have seen is that the duplicates are dropped automatically as the target table is loaded, with no notification of the dropped records.

According to IBM and Teradata the issue is a combination of things. Here's how it all works:

* DataStage uses FastLoad to write data to Teradata tables
* FastLoad requires the target table to be empty.
* To ensure an empty table, DataStage creates a "ghost" table that is a DDL copy of the actual target.
* The data is loaded to the ghost table, the copied (via SQL) to the target table.

The problem is that the ghost table is created as a SET table, not a MULTISET table no matter what the target is. As a result, the duplicates are dropped as they are loaded to the ghost table.

Does anyone know of a way to force the ghost table to be created correctly (SET if target is SET, MULTISET if target is MULTISET)?

For now, we are just adding a record counter/surrogate key to keep the records unique. What other work arounds have poeple used? I'd rather not have to add a "useless" key that has to be dropped later on, especially when the volumes are high - one column can add up to a lot of unnecessary space waste.

Thanks!

Brad.

sas · Post by **sas** » Tue Jul 29, 2008 11:25 pm

Just try using Teradata Multiload stage. It may work for you.

bcarlson wrote:I have 2 purposes for this posting. One is to let people know about an issue we are seeing in case they are encountering the same issue. The second is to glean whatever information is available about this from others that have already worked their way through it.

We have discovered a bug/flaw in how DataStage loads duplicates into a target multiset table. I am wondering if anyone else has noticed this and if so, if you figured out a way around it or if there are any patches available that could fix this. And yes, we are currently working with IBM... but they are pushing to upgrade to v8 and we don't have time, $ or bandwidth to do this yet.

Here is the problem: The source data has duplicates that need to be loaded into the target table. The table is defined as multiset to allow the duplicates to be loaded (tables designated as SET will not allow dups, MULTISET will). What we have seen is that the duplicates are dropped automatically as the target table is loaded, with no notification of the dropped records.

According to IBM and Teradata the issue is a combination of things. Here's how it all works:

* DataStage uses FastLoad to write data to Teradata tables
* FastLoad requires the target table to be empty.
* To ensure an empty table, DataStage creates a "ghost" table that is a DDL copy of the actual target.
* The data is loaded to the ghost table, the copied (via SQL) to the target table.
The problem is that the ghost table is created as a SET table, not a MULTISET table no matter what the target is. As a result, the duplicates are dropped as they are loaded to the ghost table.

Does anyone know of a way to force the ghost table to be created correctly (SET if target is SET, MULTISET if target is MULTISET)?

For now, we are just adding a record counter/surrogate key to keep the records unique. What other work arounds have poeple used? I'd rather not have to add a "useless" key that has to be dropped later on, especially when the volumes are high - one column can add up to a lot of unnecessary space waste.

Thanks!

Brad.

bcarlson · Post by **bcarlson** » Wed Jul 30, 2008 5:39 pm

We are intentionally not using the Multiload stage.

The multiload stage is limited to 256 fields and we also had some performance issues there. Our rule has been to use the Enterprise stage for whatever database is our target because the Enterprise stage tends to get better support (from Ascential/IBM) and the database vendors, and also tends to have better performance and is guaranteed to be a parallel stage within DataStage.

Brad.

keshav0307 · Post by **keshav0307** » Thu Jul 31, 2008 1:43 am

Create another multiset table of the same structure as the target table, load this table first in truncate mode. and copy data from this table to the target table(the same thing datastage do in stage 4 , as you have mentioned in the post,The data is loaded to the ghost table, the copied (via SQL) to the target table)

toshea · Post by **toshea** » Thu Aug 14, 2008 10:56 pm

I don't think this is an IBM problem. That is normal FastLoad behavior to silently drop duplicate rows regardless of whether the target is a SET or MULTISET table. You would see the same behavior if you had used the external FastLoad utility. Try creating a temporary MULTISET table as suggested above, and you will probably see duplicate rows are still dropped by the FastLoad interface that Teradata Enterprise uses.

nishadkapadia · Post by **nishadkapadia** » Sat Aug 16, 2008 12:49 pm

Currently, no options are provided to change 'any' settings on orch_xxx tables within DS. Alternative is to create a seq file and then invoking a external mload script to insert into TD table , which will be faster, even after considering the I/O for creating a seq file for huge volumes of data.

As mentioned earlier, Fastload utility behaviour is not to consider full row duplicates and neither capture it in a separate table.

nishadkapadia · Post by **nishadkapadia** » Sat Aug 16, 2008 12:51 pm

Currently, no options are provided to change 'any' settings on orch_xxx tables within DS. Alternative is to create a seq file and then invoking a external mload script to insert into TD table , which will be faster, even after considering the I/O for creating a seq file for huge volumes of data.

As mentioned earlier, Fastload utility behaviour is not to consider full row duplicates and neither capture it in a separate table.

DSXchange

Teradata multiset tables and duplicates

Teradata multiset tables and duplicates

Re: Teradata multiset tables and duplicates