Help in Job design

pandeesh · Post by **pandeesh** » Mon Nov 07, 2011 11:17 am

Hi,

I have 2 datasets .
The source dataset contains around 22215332.
The reference dataset contains 21532 records.
I am using the lookup stage and loading into teradata table using teradata connector stage.

Is this the correct design,Because its currently taking more than 45 mins to complete?

Is there any way to improve this?

Will Join be faster than lookup in this case?

For loading large data, will teradata enterprise stage be the good choice?

Please help me.

Thanks

ray.wurlod · Post by **ray.wurlod** » Mon Nov 07, 2011 3:12 pm

Lookup stage seems reasonable, if the smaller Data Set is the reference data and its rows are not exceedingly wide. That is, the reference Data Set can be loaded into memory.

Replace the target stage with a Peek stage temporarily, to isolate which part of the job (lookup or Teradata) is the slow part.

pandeesh · Post by **pandeesh** » Mon Nov 07, 2011 3:54 pm

I believe , reading from source data set itself takes long time.
Whatever data comes from the lookup , it's inserted immediately.

ray.wurlod · Post by **ray.wurlod** » Mon Nov 07, 2011 6:36 pm

Don't "believe". Check. Test.

pandeesh · Post by **pandeesh** » Mon Nov 07, 2011 10:21 pm

Yes Ray. Whatever record passed lookup stgae, it gets loaded to teradata table.
The problem is reading records from dataset.
For 22 million records in the source, how long it should take?
What's the reasonable time?

ray.wurlod · Post by **ray.wurlod** » Mon Nov 07, 2011 11:14 pm

What is the source? If it's a Data Set and there's no repartitioning, it should be very fast. On the other hand if it's a database view that's based on a correlated subquery with joins to huge tables, then it's going to be very slow.

pandeesh · Post by **pandeesh** » Mon Nov 07, 2011 11:57 pm

The source is data set .
How can I make sure whether there is any re partitioning or not ?
In the source data set stage, partitioning is set to auto.

srinivas.g · Post by **srinivas.g** » Tue Nov 08, 2011 6:33 am

While generating source dataset what partitioned you used.
Use the same partition type in your current job instead of auto.

pandeesh · Post by **pandeesh** » Tue Nov 08, 2011 6:51 am

That data set generated in the previous job.
In the previous job there are 2 data sets joined using hash partitioning and te result is written to this data set. In the target dataset stage I have used auto partition only .
But in the previous join stage i have used hash partition.
Since I have used auto in target data stage ,here also I amusing auto in the source dataset .
Should I use hash or auto partition in the current job?

Thanks

suse_dk · Post by **suse_dk** » Tue Nov 08, 2011 8:11 am

If you want to know if your job is re-partitioning then you can dump the score into the log - or check the row counts on each partition in the monitor.

The key you used for the hash partition in the previous job - is that the same key you need to use for your join in the current job?
If yes, then you most likely could avoid both a re-partition and a sort in the current job by explicit defining this in the job.

Arun Reddy · Post by **Arun Reddy** » Tue Nov 08, 2011 8:37 am

Hi pandeesh,

U said in previous job u used hash partition..after that for target dataset u use same partion it wil give best performance then auto ..and continue that dataset as a source dataset in current job u said .. i think it will work..

ray.wurlod · Post by **ray.wurlod** » Tue Nov 08, 2011 3:31 pm

If repartitioning is occurring then the job design will have a "bow tie" link marker icon on the link.

ray.wurlod · Post by **ray.wurlod** » Tue Nov 08, 2011 3:33 pm

Arun Reddy wrote:Hi pandeesh,

U said in previous job u used hash partition..after that for target dataset u use same partion it wil give best performance then auto ..and continue that dataset as a source dataset in current job u said .. i think it will work..

U is one of our posters. U has had no involvement in this thread.

The second person personal pronoun in English is spelled "you".

pandeesh · Post by **pandeesh** » Tue Nov 08, 2011 3:58 pm

This is my first job:

Code: Select all


Dataset1

               --------join------->target data set.

Dataset2

In data set 1 and data set 2 hash partition is used .
But in the target data set I have used auto partition.
In the second job I am using this target data set as source.
There also I have used auto partition.
In this way it works fine.
When I tried to change the partition to hash in the target data set in the first job, the first Jon works fine.
In the second job also I have used hash partitioning in the source data set.
But the second job got failed due to lookup failure.
Second job design:

Code: Select all



Source Data set -------->lkp stage------>teradata connector
                                   /
                                 /

                             /
        Reference data set

Any thoughts on this?