Join stage takes a long time

pdntsap · Post by **pdntsap** » Thu Sep 01, 2011 7:22 am

I am designing parallel jobs to convert the business logic in SAS to DataStage to get an improvement in the total run time. In one of the jobs, we have two flat files as input each having about 40,000 records. A join stage (inner join) is used and the join stage creates about 1,120,000 records. The join output is then inner joined with another flat file having 30,000 records. The second join produces more than 800 million records and the job takes about seven hours to complete. A similar job takes 15 minutes in SAS to complete. The record counts are similar in SAS and DataStage and each record has about 40 columns with 6-8 columns used as join keys. Some of the key columns are nullable that creates output record count much higher than the input record counts. I need to start understanding the business logic so that duplicates can be eliminated to improve the job performance. Any other suggestions to improve the job performance?

Thanks.

jwiles · Post by **jwiles** » Thu Sep 01, 2011 9:32 am

Several things to look at here:

1) First: What environment are you running DataStage in? Your initial description says "OS: Windows" but includes "Additional info: Server: AIX" The OS line is meant to be the OS your DataStage instance is running in.

2) Describe your job design. Are you properly partitioning and sorting your data prior to the joins? What partitioning options are you using?

3) What are your business rules concerning rows with null keys? Can you remove records with null keys before they go into a join and later merge them back into the main stream?

Regards,

pdntsap · Post by **pdntsap** » Thu Sep 01, 2011 1:11 pm

1)

The OS line is meant to be the OS your DataStage instance is running in

Ths OS is AIX 5.3

2) My job design:

A sequential file input stage that feeds into a transformer to filter based on some conditons and then feeds into the first join stage. A second sequential file stage serves as the right link for the join stage. The output of the first join stage feeds into the second join stage and a third and final sequential file stage serves as the right link of the second join stage. The output of the second join stage writes to a sequential file stage.

Are you properly partitioning and sorting your data prior to the joins?

I am not sorting the data before the joins. So,I will add a sort stage after the two sequential file stages and the transformer stage and sort the records based on the join keys.
I may not be using the right partitioning options also. I set the Preserver Partitioning option to clear in both the join stages. What will be best partitioning option.

3)

Are you properly partitioning and sorting your data prior to the joins?

I am not sure about the business logic yet as all I have access to now is the SAS code.

Can you remove records with null keys before they go into a join and later merge them back into the main stream?

I can definitely try this option but I am afraid that it might produce different results. I understand that keys that have null values are ignored when the join is performed and I guess removing them and then adding them should produce the same output. Is that right?

Thanks.

sandeepgs · Post by **sandeepgs** » Thu Sep 01, 2011 1:55 pm

If I might have correctly understood your design it looks like as mentioned below.

1st step:

File
---------------------- Join
File

2nd step:

JOIN --- JOIN -- OutputFile

File

Join Conditions:

1. Data Should be sorted on the key columns that were joined and should be applied with hash partitioning.

2.If there are more number of duplicates on the key columns that were joined they may kind of result in CROSS join which will degrade the performance. So need to make sure that there are no duplicates on the key column that were joined.

3. The joining key columns should be not nullable.

4. It is better to use File set or a data set to improve the performance of the job as they allow parallel read of data.

Check the above parameters and redesign the job.

Thanks
Sandeepgs

pdntsap · Post by **pdntsap** » Thu Sep 01, 2011 2:26 pm

Yes Sandeep. Your understanding of my job design is right.

1. Data Should be sorted on the key columns that were joined and should be applied with hash partitioning.

I added the sort stages after the input of the sequential file stages. The first join produces the same number of record count as before adding the sort stages. But, the record count from the second join stage reduces considerably when compared to the record count when there was no sort stage. Any thoughts on why this might be happening.

2.

So need to make sure that there are no duplicates on the key column that were joined.

The join key columns are not unique and as I am not sure about the business logic, I am following the SAS code to make sure I get the same number of record counts and data in SAS and DataStage.

3.

The joining key columns should be not nullable

Some key columns are nullable and due to lack of business logic I am not able to remove them.

Thanks.

sandeepgs · Post by **sandeepgs** » Thu Sep 01, 2011 2:36 pm

Hi,

Ok that sounds good.

1. Use a link sort with hash partition instead of using a sort stage.

2. Reduce in the volume of data: Sort will never reduce the volume of data, so once recheck the options that you have specified.

3. Joining key columns are not unique so will sure result in a cross join.

4. Nullable cannot be a part of joining key as they never know what to join and will result in a full table scan for no reason.

5. Use a four node or more than that configuration file to improve the processing of data.

The job will run slow as you have some drawbacks as what I have mentioned in point 3 and 4.

So now as mentioned business should take decision on that.

But for now change the design as suggested in my first quote + Point 1 and 5 included in this quote and re-execute the job.

Thanks,
Sandeep

pdntsap · Post by **pdntsap** » Thu Sep 01, 2011 3:08 pm

Use a link sort with hash partition instead of using a sort stage.

I removed the sort stage and used hash partition and sort in the join stage itself. Now, I get the right record count but the time taken is similar to the time taken before the use of harsh partition and sort. So, I guess the only way to improve performance is to handle null key columns and duplicate key columns.

Use a four node or more than that configuration file to improve the processing of data.

The configuration file is set up to make use of four nodes.

It is better to use File set or a data set to improve the performance of the job as they allow parallel read of data.

A data set is stored in some DataStage specific format that is not human readable. I believe File sets are human readable. Is that right? I have not tried this option yet. But, I believe nulls and duplicates are the issue here.

jwiles · Post by **jwiles** » Thu Sep 01, 2011 4:17 pm

As you are not performing a join within a database, you don't have to worry about "full table scans" in your job. That is a database-specific function and is not applicable to a DataStage job.

Personally, I prefer a Sort Stage rather than the link sort option. Both use the same sort operator and generate the same logic for the same options, but the stage is more configurable. If you were not getting the same results, you may not have had something correct initially with how you implemented the Sort stage in your job.

Your SAS code defines your business logic, however it sounds like you don't have a specifications doc to describe what the SAS program does.

You will need to determine how your SAS code handles the Null keys when performing joins and then replicate that within your DataStage job:
- There may be logic specifically looking for null values (whatever is meant by null: Empty String '' perhaps?)
- There may be no specific logic, in which case SAS itself has some sort of default method for handling rows with one or more null join keys. That means learning what SAS does in those cases and then replicating it in some manner within DataStage. Don't assume that DataStage's Join stages operate exactly the same as SAS's merge statement.

If you have duplicates (multiple rows in a data stream with identical key values) in both join inputs you will get a product of those duplicates (think 2*2=4) on output. Duplicates could include your null values.

Regards,

ArunaDas_Maharana · Post by **ArunaDas_Maharana** » Thu Sep 01, 2011 4:38 pm

my suggestion will be do not touch the code without business knowledge, you might get the sort working temporary but in long run something might have broken.

Take a step back understand the business, nullability and decide on key columns confirm with someone/preferably functional analyst and from there proceed to configure datastage.

sandeepgs · Post by **sandeepgs** » Fri Sep 02, 2011 9:46 am

Hi,

Yes if you are concerned about in which format the data is stored, then use file set.

As what I have mentieond in point 4 of my previous quote , I said that even you use a hash partition, a null value will never get assigned with a hash value. So now at data stage engine scanning level the data will not get identified based on the hash value and will result in a full scan of records to find the record.

Jwiles,

I hape I had answered your concern.

So now if your design dosent show any improvement only way is to avoid the cross join and null values and with application of all desing standards the job should perform better.

DSXchange

Join stage takes a long time

Join stage takes a long time

Re: Join stage takes a long time