Teradata multiset tables and duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Teradata multiset tables and duplicates

Post by bcarlson »

I have 2 purposes for this posting. One is to let people know about an issue we are seeing in case they are encountering the same issue. The second is to glean whatever information is available about this from others that have already worked their way through it.

We have discovered a bug/flaw in how DataStage loads duplicates into a target multiset table. I am wondering if anyone else has noticed this and if so, if you figured out a way around it or if there are any patches available that could fix this. And yes, we are currently working with IBM... but they are pushing to upgrade to v8 and we don't have time, $ or bandwidth to do this yet.

Here is the problem: The source data has duplicates that need to be loaded into the target table. The table is defined as multiset to allow the duplicates to be loaded (tables designated as SET will not allow dups, MULTISET will). What we have seen is that the duplicates are dropped automatically as the target table is loaded, with no notification of the dropped records.

According to IBM and Teradata the issue is a combination of things. Here's how it all works:
  • * DataStage uses FastLoad to write data to Teradata tables
    * FastLoad requires the target table to be empty.
    * To ensure an empty table, DataStage creates a "ghost" table that is a DDL copy of the actual target.
    * The data is loaded to the ghost table, the copied (via SQL) to the target table.
The problem is that the ghost table is created as a SET table, not a MULTISET table no matter what the target is. As a result, the duplicates are dropped as they are loaded to the ghost table.

Does anyone know of a way to force the ghost table to be created correctly (SET if target is SET, MULTISET if target is MULTISET)?

For now, we are just adding a record counter/surrogate key to keep the records unique. What other work arounds have poeple used? I'd rather not have to add a "useless" key that has to be dropped later on, especially when the volumes are high - one column can add up to a lot of unnecessary space waste.

Thanks!

Brad.
It is not that I am addicted to coffee, it's just that I need it to survive.
sas
Charter Member
Charter Member
Posts: 13
Joined: Sun Mar 05, 2006 12:37 am

Re: Teradata multiset tables and duplicates

Post by sas »

Just try using Teradata Multiload stage. It may work for you.


bcarlson wrote:I have 2 purposes for this posting. One is to let people know about an issue we are seeing in case they are encountering the same issue. The second is to glean whatever information is available about this from others that have already worked their way through it.

We have discovered a bug/flaw in how DataStage loads duplicates into a target multiset table. I am wondering if anyone else has noticed this and if so, if you figured out a way around it or if there are any patches available that could fix this. And yes, we are currently working with IBM... but they are pushing to upgrade to v8 and we don't have time, $ or bandwidth to do this yet.

Here is the problem: The source data has duplicates that need to be loaded into the target table. The table is defined as multiset to allow the duplicates to be loaded (tables designated as SET will not allow dups, MULTISET will). What we have seen is that the duplicates are dropped automatically as the target table is loaded, with no notification of the dropped records.

According to IBM and Teradata the issue is a combination of things. Here's how it all works:
  • * DataStage uses FastLoad to write data to Teradata tables
    * FastLoad requires the target table to be empty.
    * To ensure an empty table, DataStage creates a "ghost" table that is a DDL copy of the actual target.
    * The data is loaded to the ghost table, the copied (via SQL) to the target table.
The problem is that the ghost table is created as a SET table, not a MULTISET table no matter what the target is. As a result, the duplicates are dropped as they are loaded to the ghost table.

Does anyone know of a way to force the ghost table to be created correctly (SET if target is SET, MULTISET if target is MULTISET)?

For now, we are just adding a record counter/surrogate key to keep the records unique. What other work arounds have poeple used? I'd rather not have to add a "useless" key that has to be dropped later on, especially when the volumes are high - one column can add up to a lot of unnecessary space waste.

Thanks!

Brad.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

We are intentionally not using the Multiload stage.

The multiload stage is limited to 256 fields and we also had some performance issues there. Our rule has been to use the Enterprise stage for whatever database is our target because the Enterprise stage tends to get better support (from Ascential/IBM) and the database vendors, and also tends to have better performance and is guaranteed to be a parallel stage within DataStage.

Brad.
It is not that I am addicted to coffee, it's just that I need it to survive.
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

Create another multiset table of the same structure as the target table, load this table first in truncate mode. and copy data from this table to the target table(the same thing datastage do in stage 4 , as you have mentioned in the post,The data is loaded to the ghost table, the copied (via SQL) to the target table)
toshea
Participant
Posts: 79
Joined: Thu Aug 14, 2008 6:46 pm

Post by toshea »

I don't think this is an IBM problem. That is normal FastLoad behavior to silently drop duplicate rows regardless of whether the target is a SET or MULTISET table. You would see the same behavior if you had used the external FastLoad utility. Try creating a temporary MULTISET table as suggested above, and you will probably see duplicate rows are still dropped by the FastLoad interface that Teradata Enterprise uses.
nishadkapadia
Charter Member
Charter Member
Posts: 47
Joined: Fri Mar 18, 2005 5:59 am

Post by nishadkapadia »

Currently, no options are provided to change 'any' settings on orch_xxx tables within DS. Alternative is to create a seq file and then invoking a external mload script to insert into TD table , which will be faster, even after considering the I/O for creating a seq file for huge volumes of data.

As mentioned earlier, Fastload utility behaviour is not to consider full row duplicates and neither capture it in a separate table.
nishadkapadia
Charter Member
Charter Member
Posts: 47
Joined: Fri Mar 18, 2005 5:59 am

Post by nishadkapadia »

Currently, no options are provided to change 'any' settings on orch_xxx tables within DS. Alternative is to create a seq file and then invoking a external mload script to insert into TD table , which will be faster, even after considering the I/O for creating a seq file for huge volumes of data.

As mentioned earlier, Fastload utility behaviour is not to consider full row duplicates and neither capture it in a separate table.
Post Reply