Help in Job design

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Help in Job design

Post by pandeesh »

Hi,

I have 2 datasets .
The source dataset contains around 22215332.
The reference dataset contains 21532 records.
I am using the lookup stage and loading into teradata table using teradata connector stage.

Is this the correct design,Because its currently taking more than 45 mins to complete?

Is there any way to improve this?

Will Join be faster than lookup in this case?

For loading large data, will teradata enterprise stage be the good choice?

Please help me.

Thanks
pandeeswaran
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Lookup stage seems reasonable, if the smaller Data Set is the reference data and its rows are not exceedingly wide. That is, the reference Data Set can be loaded into memory.

Replace the target stage with a Peek stage temporarily, to isolate which part of the job (lookup or Teradata) is the slow part.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

I believe , reading from source data set itself takes long time.
Whatever data comes from the lookup , it's inserted immediately.
pandeeswaran
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Don't "believe". Check. Test.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

Yes Ray. Whatever record passed lookup stgae, it gets loaded to teradata table.
The problem is reading records from dataset.
For 22 million records in the source, how long it should take?
What's the reasonable time?
pandeeswaran
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What is the source? If it's a Data Set and there's no repartitioning, it should be very fast. On the other hand if it's a database view that's based on a correlated subquery with joins to huge tables, then it's going to be very slow.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

The source is data set .
How can I make sure whether there is any re partitioning or not ?
In the source data set stage, partitioning is set to auto.
pandeeswaran
srinivas.g
Participant
Posts: 251
Joined: Mon Jun 09, 2008 5:52 am

Post by srinivas.g »

While generating source dataset what partitioned you used.
Use the same partition type in your current job instead of auto.
Srinu Gadipudi
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

That data set generated in the previous job.
In the previous job there are 2 data sets joined using hash partitioning and te result is written to this data set. In the target dataset stage I have used auto partition only .
But in the previous join stage i have used hash partition.
Since I have used auto in target data stage ,here also I amusing auto in the source dataset .
Should I use hash or auto partition in the current job?

Thanks
pandeeswaran
suse_dk
Participant
Posts: 93
Joined: Thu Aug 11, 2011 6:18 am
Location: Denmark

Post by suse_dk »

If you want to know if your job is re-partitioning then you can dump the score into the log - or check the row counts on each partition in the monitor.

The key you used for the hash partition in the previous job - is that the same key you need to use for your join in the current job?
If yes, then you most likely could avoid both a re-partition and a sort in the current job by explicit defining this in the job.
_________________
- Susanne
Arun Reddy
Participant
Posts: 5
Joined: Wed Nov 02, 2011 9:08 pm
Location: Hyderabad

Post by Arun Reddy »

Hi pandeesh,

U said in previous job u used hash partition..after that for target dataset u use same partion it wil give best performance then auto ..and continue that dataset as a source dataset in current job u said .. i think it will work..
Arun
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If repartitioning is occurring then the job design will have a "bow tie" link marker icon on the link.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Arun Reddy wrote:Hi pandeesh,

U said in previous job u used hash partition..after that for target dataset u use same partion it wil give best performance then auto ..and continue that dataset as a source dataset in current job u said .. i think it will work..
U is one of our posters. U has had no involvement in this thread.

The second person personal pronoun in English is spelled "you".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

This is my first job:

Code: Select all


Dataset1

               --------join------->target data set.

Dataset2

In data set 1 and data set 2 hash partition is used .
But in the target data set I have used auto partition.
In the second job I am using this target data set as source.
There also I have used auto partition.
In this way it works fine.
When I tried to change the partition to hash in the target data set in the first job, the first Jon works fine.
In the second job also I have used hash partitioning in the source data set.
But the second job got failed due to lookup failure.
Second job design:

Code: Select all



Source Data set -------->lkp stage------>teradata connector
                                   /
                                 /

                             /
        Reference data set
Any thoughts on this?
pandeeswaran
Post Reply