Reading Table through DB2 Enterprise Stage

opdas · Post by **opdas** » Mon Apr 03, 2006 2:43 am

Hi,
I want to use DB2 Enterprise Stage to read data from a specified table and use Partiton Table Property so that the execution is parallel.
Can anyone tell me what shall I specify for the property?

kumar_s · Post by **kumar_s** » Mon Apr 03, 2006 6:59 am

You mean Preserve partition flag or the type of partition?

ray.wurlod · Post by **ray.wurlod** » Mon Apr 03, 2006 4:02 pm

The Partition Table property allows you to specify a table that provides the partitioning algorithm. If you are selecting from a partitioned table, you can use the same table name as the Partition Table property as the table name from which you are selecting.

You can always click on Help to learn more about individual properties; they've gone to a great deal of work to provide this on-line help.

opdas · Post by **opdas** » Mon Apr 03, 2006 10:16 pm

I'm using the same table name as Partition Table and it gives me multiple records for that.In my case 8 ties the data(we are working on 8 nodes)

ray.wurlod wrote:The Partition Table property allows you to specify a table that provides the partitioning algorithm. If you are selecting from a partitioned table, you can use the same table name as the Partition Ta ...

kumar_s · Post by **kumar_s** » Mon Apr 03, 2006 10:30 pm

If you specify a table name, the partition methond is derived from the given table. By default if it is not mentiond it will follow sequential mode.

ray.wurlod · Post by **ray.wurlod** » Tue Apr 04, 2006 1:42 am

Looks like you're selecting from a non-partitioned table using a partitioning algorithm, hence you're getting all records on each node - effectively Entire partitioning.

kumar_s · Post by **kumar_s** » Tue Apr 04, 2006 3:38 am

I guess selecting the same table for Partitioning table forces to Entire partition.

bcarlson · Post by **bcarlson** » Wed Apr 05, 2006 4:43 pm

When we pull from a partitioned table, we use the nodenumber function and current node option in DB2:

Query

Code: Select all

select customer_id, cust_name
from mydb.customer_tb
where nodenumber(customer_id) = current node

Partitioning Table option: mydb.customer_tb

The partitioning table option tells DataStage to retrieve the data in parallel, but if you do not include the 'nodenumber() = current node' clause in your where clause, you will (as you have seen) pull 100% of the data over each partition. That is, if you have 100 rows and 10 partitions, your output will end up being num_partitions * num_rows = 10,000 rows.

I think the field used in the nodenumber() function can be any field, but we usually use the partitioning key field.

If you run the query with no nodenumber clause and no partitioning table specified your output will accurate, but it will take longer to retrieve the data. DB2 will run your query in parallel within the database, but will pool all the data into one stream to return to DataStage.

The other option to try is using the Table query instead. You can specify the select list and a where clause. When executed, DataStage automatically creates a query just like my example above with the nodenumber() = current node and it will also automatically use a partitioning table to achieve a parallel read from the database.

By the way, if you are running user-defined SQL with a join, you can still use the nodenumber() clause. We usually pick the table and partitioning key that'drives' the query. For example, if you are retrieving customer addresses, you may have a query like the following:

Code: Select all

select 
    c.customer_id, 
    c.cust_name, 
    a.street_addr,
    a.city_nm,
    a.state_cd,
    a.zip_9_cd
from mydb.customer_tb c,
    mydb.cust_address_tb a
where c.customer_id = a.customer_id
and nodenumber(c.customer_id) = current node

Partitioning table option: mydb.customer_tb

In this example, The customer table is driving the query, so we use that to partition the data retrieval.

Hope this helps!

Brad.

opdas · Post by **opdas** » Thu Apr 13, 2006 1:12 am

Brad,
That was great..,it helped me fetching data 4 times as fast.
I would like to add here that you also can use :

Code: Select all

where dbpartitionnum(partition_column_name) = current dbpartitionnum for fetch only

in place of

Code: Select all

where nodenumber(partition_column_name) = current node

it works better with DB2v8

opdas · Post by **opdas** » Thu Apr 13, 2006 2:55 am

Hi All,

How to achieve the same thing in Oracle Enterprise Stage??

skcwong · Post by **skcwong** » Mon Jun 26, 2006 5:09 pm

bcarlson wrote:When we pull from a partitioned table, we use the nodenumber function and current node option in DB2:

Query
Code: Select all
select customer_id, cust_name
from mydb.customer_tb
where nodenumber(customer_id) = current node
Code: Select all
select 
    c.customer_id, 
    c.cust_name, 
    a.street_addr,
    a.city_nm,
    a.state_cd,
    a.zip_9_cd
from mydb.customer_tb c,
    mydb.cust_address_tb a
where c.customer_id = a.customer_id
and nodenumber(c.customer_id) = current node
Brad.

I wonder why the DS documents do not mention this? (ie, using nodenumber or dbpartitionnum)

chulett · Post by **chulett** » Mon Jun 26, 2006 5:50 pm

Perhaps because it's not a DataStage feature, but rather a database feature?

miwinter · Post by **miwinter** » Wed Jan 17, 2007 8:37 am

A couple of queries on this topic
=====================

What if a table is partitioned on more than one column....?

You don't seem to be able to issue:

where dbpartitionnum(partition_column_name1,partition_column_name2) = current dbpartitionnum

*************

And also..... what if you have a SQL query joining two or more tables in the DB2EE stage? Is it still possible to use the above in this instance, in some way?

Cheers guys

bcarlson · Post by **bcarlson** » Wed Jan 17, 2007 9:12 am

Not sure about the dbpartitionnum question. Will have to look into that.

As for the join, we do that all the time - but the tables joined in the SQL are partitioned the same. We just pick one table to use when specifying the partitioning out the output.

If the partitioning is different, we usually have 2 SQL reads and join within DataStage where we can repartition at will.

Brad.