Reading Table through DB2 Enterprise Stage
Moderators: chulett, rschirm, roy
Reading Table through DB2 Enterprise Stage
Hi,
I want to use DB2 Enterprise Stage to read data from a specified table and use Partiton Table Property so that the execution is parallel.
Can anyone tell me what shall I specify for the property?
I want to use DB2 Enterprise Stage to read data from a specified table and use Partiton Table Property so that the execution is parallel.
Can anyone tell me what shall I specify for the property?
Om Prakash
"There are things that are known, and there are things that are unknown, and in between there are doors"
"There are things that are known, and there are things that are unknown, and in between there are doors"
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The Partition Table property allows you to specify a table that provides the partitioning algorithm. If you are selecting from a partitioned table, you can use the same table name as the Partition Table property as the table name from which you are selecting.
You can always click on Help to learn more about individual properties; they've gone to a great deal of work to provide this on-line help.
You can always click on Help to learn more about individual properties; they've gone to a great deal of work to provide this on-line help.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I'm using the same table name as Partition Table and it gives me multiple records for that.In my case 8 ties the data(we are working on 8 nodes)
ray.wurlod wrote:The Partition Table property allows you to specify a table that provides the partitioning algorithm. If you are selecting from a partitioned table, you can use the same table name as the Partition Ta ...
Om Prakash
"There are things that are known, and there are things that are unknown, and in between there are doors"
"There are things that are known, and there are things that are unknown, and in between there are doors"
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Looks like you're selecting from a non-partitioned table using a partitioning algorithm, hence you're getting all records on each node - effectively Entire partitioning.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
When we pull from a partitioned table, we use the nodenumber function and current node option in DB2:
Query
Partitioning Table option: mydb.customer_tb
The partitioning table option tells DataStage to retrieve the data in parallel, but if you do not include the 'nodenumber() = current node' clause in your where clause, you will (as you have seen) pull 100% of the data over each partition. That is, if you have 100 rows and 10 partitions, your output will end up being num_partitions * num_rows = 10,000 rows.
I think the field used in the nodenumber() function can be any field, but we usually use the partitioning key field.
If you run the query with no nodenumber clause and no partitioning table specified your output will accurate, but it will take longer to retrieve the data. DB2 will run your query in parallel within the database, but will pool all the data into one stream to return to DataStage.
The other option to try is using the Table query instead. You can specify the select list and a where clause. When executed, DataStage automatically creates a query just like my example above with the nodenumber() = current node and it will also automatically use a partitioning table to achieve a parallel read from the database.
By the way, if you are running user-defined SQL with a join, you can still use the nodenumber() clause. We usually pick the table and partitioning key that'drives' the query. For example, if you are retrieving customer addresses, you may have a query like the following:
Partitioning table option: mydb.customer_tb
In this example, The customer table is driving the query, so we use that to partition the data retrieval.
Hope this helps!
Brad.
Query
Code: Select all
select customer_id, cust_name
from mydb.customer_tb
where nodenumber(customer_id) = current node
The partitioning table option tells DataStage to retrieve the data in parallel, but if you do not include the 'nodenumber() = current node' clause in your where clause, you will (as you have seen) pull 100% of the data over each partition. That is, if you have 100 rows and 10 partitions, your output will end up being num_partitions * num_rows = 10,000 rows.
I think the field used in the nodenumber() function can be any field, but we usually use the partitioning key field.
If you run the query with no nodenumber clause and no partitioning table specified your output will accurate, but it will take longer to retrieve the data. DB2 will run your query in parallel within the database, but will pool all the data into one stream to return to DataStage.
The other option to try is using the Table query instead. You can specify the select list and a where clause. When executed, DataStage automatically creates a query just like my example above with the nodenumber() = current node and it will also automatically use a partitioning table to achieve a parallel read from the database.
By the way, if you are running user-defined SQL with a join, you can still use the nodenumber() clause. We usually pick the table and partitioning key that'drives' the query. For example, if you are retrieving customer addresses, you may have a query like the following:
Code: Select all
select
c.customer_id,
c.cust_name,
a.street_addr,
a.city_nm,
a.state_cd,
a.zip_9_cd
from mydb.customer_tb c,
mydb.cust_address_tb a
where c.customer_id = a.customer_id
and nodenumber(c.customer_id) = current node
In this example, The customer table is driving the query, so we use that to partition the data retrieval.
Hope this helps!
Brad.
Brad,
That was great..,it helped me fetching data 4 times as fast.
I would like to add here that you also can use :
in place of
it works better with DB2v8
That was great..,it helped me fetching data 4 times as fast.
I would like to add here that you also can use :
Code: Select all
where dbpartitionnum(partition_column_name) = current dbpartitionnum for fetch only
Code: Select all
where nodenumber(partition_column_name) = current node
I wonder why the DS documents do not mention this? (ie, using nodenumber or dbpartitionnum)bcarlson wrote:When we pull from a partitioned table, we use the nodenumber function and current node option in DB2:
QueryCode: Select all
select customer_id, cust_name from mydb.customer_tb where nodenumber(customer_id) = current node
Code: Select all
select c.customer_id, c.cust_name, a.street_addr, a.city_nm, a.state_cd, a.zip_9_cd from mydb.customer_tb c, mydb.cust_address_tb a where c.customer_id = a.customer_id and nodenumber(c.customer_id) = current node
Brad.
A couple of queries on this topic
=====================
What if a table is partitioned on more than one column....?
You don't seem to be able to issue:
where dbpartitionnum(partition_column_name1,partition_column_name2) = current dbpartitionnum
*************
And also..... what if you have a SQL query joining two or more tables in the DB2EE stage? Is it still possible to use the above in this instance, in some way?
Cheers guys
=====================
What if a table is partitioned on more than one column....?
You don't seem to be able to issue:
where dbpartitionnum(partition_column_name1,partition_column_name2) = current dbpartitionnum
![Question :?:](./images/smilies/icon_question.gif)
*************
And also..... what if you have a SQL query joining two or more tables in the DB2EE stage? Is it still possible to use the above in this instance, in some way?
Cheers guys
Not sure about the dbpartitionnum question. Will have to look into that.
As for the join, we do that all the time - but the tables joined in the SQL are partitioned the same. We just pick one table to use when specifying the partitioning out the output.
If the partitioning is different, we usually have 2 SQL reads and join within DataStage where we can repartition at will.
Brad.
As for the join, we do that all the time - but the tables joined in the SQL are partitioned the same. We just pick one table to use when specifying the partitioning out the output.
If the partitioning is different, we usually have 2 SQL reads and join within DataStage where we can repartition at will.
Brad.