Sorting big volume data

tj_user · Post by **tj_user** » Sat Jul 11, 2015 12:50 pm

Hi,
I have got stream with for example 150 milion rows and 150 columns.
My job was shared on 4 jobs.
Every job is ending using data set.
In every Job i use sorts because i need to join with other key and i make many calculations.
Sort stage kill my jobs.
Is it good idea to end every job with direct insert to table which has created clustered index on expected key in next job?
What is more optimal, data set or table with good clustered index?

Regards

rkashyap · Post by **rkashyap** » Sat Jul 11, 2015 4:25 pm

What is more optimal, data set or table with good clustered index?

It depends. Processing capacity of dataStage vs database server, memory available, I/O capacity, latency, network architecture are some of the factors that will impact the performance.

Any gains from not sorting in DataStage may be negated by time it takes to load the tables, build indexes etc.

Can you give some more details on the design of the jobs ... What are sort keys (what is the format)? Are the four jobs using same sort key? How big is the reference data? How many nodes are you running this operation on? How much RAM do you have available on DataStage server? Are you doing any aggregations? Anythings else.

rkashyap · Post by **rkashyap** » Sat Jul 11, 2015 4:49 pm

See these guidelines for tuning sort operations. There are quite a few discussions in dsxchange on this topic also.

tj_user · Post by **tj_user** » Sun Jul 12, 2015 6:16 am

Hi,
My project looks like that:
Input: 75 milions rows 150 columns
->Sort by ID
->Filter
->AGG Max(AID) grooup by ID (16 milions 2 columns)
->AGG2 MAx(AID) group by ID (16 milions 2 columns)
->ALL rows and Sort on other key AID (75 milions)
-> Join AGG with ALL by AID
-> Join AGG2 with ALL by AID
->Join ALL (ALL is 18 milions now) with other Input by BID (4 milions rows)
-> Join ALL with other Input by CID (10 milions)
->Join ALL with other Input by DID (1 milion)
Ever time i have other key to join and i need to sort ii on input or add new sort stage.
How optimize this process?
Best regards

priyadarshikunal · Post by **priyadarshikunal** » Sun Jul 12, 2015 10:26 am

Why do you need your data to be sorted on same key all the time? Or are those keys different?

tj_user · Post by **tj_user** » Sun Jul 12, 2015 12:07 pm

Hi,
These keys are different:
AID,BID,CID,DID and so on.
Regards

rkashyap · Post by **rkashyap** » Sun Jul 12, 2015 6:34 pm

Input: 75 milions rows 150 columns
1->Sort by ID
2->Filter
3->AGG Max(AID) group by ID (16 milions 2 columns)
4->AGG2 MAx(AID) group by ID (16 milions 2 columns)
5->ALL rows and Sort on other key AID (75 millions)
6-> Join AGG with ALL by AID
7-> Join AGG with ALL by AID
8->Join ALL (ALL is 18 millions now) with other Input by BID (4 millions rows)
9-> Join ALL with other Input by CID (10 milions)
10->Join ALL with other Input by DID (1 milion)

First thing that jumps out to me ... Is it possible to swap steps 1 and 2 i.e. Filter before Sort. If input for #1 is from a database, then perform filtration in database.

Typically aggregations work faster in a database that in DataStage. If the source for #3 and 4 is a database, then perform aggregations in database.

I am assuming that input for steps #3 and 4 is "16 million, 2 columns". What is volume of output from these steps after aggregation. If it is "reasonable" then it may be possible to use lookup instead of join in steps #6 and #7, thereby making need of sort in #1 and #5 unnecessary.

What is width of "other input" in #8 thru 10?
------

On a sidenote, what is data-type of AID, BID, CID and DID. (Highly speculative) If it is "Decimal (n,0)", then validate that data is getting distributed across various nodes. See this post.

tj_user · Post by **tj_user** » Mon Jul 13, 2015 1:30 am

Hi,
Input is a dataset and goes to filter then build 3 other streams:
ALL ,AGG1 and AGG2.

ShaneMuir · Post by **ShaneMuir** » Mon Jul 13, 2015 3:40 am

Ok a few questions:

1. If your input is datasets, how are they populated? Is that data coming from a DB (ie can you easily sort your data prior to writing to the dataset?)
2. What is your filter actually doing? I assume it splits into different metrics of the AID? Else why are your steps 3 and 4 doing the same thing (Max(AID))?
3. What sort of joins are you doing to your other inputs? Inner, Outer?

EDIT:
4: And when you say it kills your job, what exactly is happening?

rkashyap · Post by **rkashyap** » Mon Jul 13, 2015 6:58 am

Also can you tell following about datastage setup:
- Topology ... SMP/MPP/Grid.
- Memory ... Installed vs usually available."Usually available" can be checked on operations console (if installed).
- What is width of "other input" in #8 thru 10 above, i.e. no of columns/data-type etc. I feel that #10 and possibly #8 may be candidates for 'lookup'.

tj_user · Post by **tj_user** » Mon Jul 13, 2015 1:48 pm

Hi,
Answers to questions:
1. Data is coming from other job which is ended by dataset
2. It is filtered by some date and the conditions are different, 3 and 4 steps are different ( my mistake i edited it in my post).
3. All joins are left outer joins
4. At the beginning this one sort take a lot of time - it means kill:)

Is it good idea to take to MAIN stream only the keys which will i use in next steps to join and create copy of all columns and rows and connect at the end using earlier created surrogate key?
It will be easier to sort for example 10 columns by 1 key than 150 columns by 1 key.

1. If your input is datasets, how are they populated? Is that data coming from a DB (ie can you easily sort your data prior to writing to the dataset?)
2. What is your filter actually doing? I assume it splits into different metrics of the AID? Else why are your steps 3 and 4 doing the same thing (Max(AID))?
3. What sort of joins are you doing to your other inputs? Inner, Outer?

EDIT:
4: And when you say it kills your job, what exactly is happening?

ShaneMuir · Post by **ShaneMuir** » Tue Jul 14, 2015 6:50 am

tj_user wrote:1. Data is coming from other job which is ended by dataset

Is it possible to sort the data in the job that is creating that dataset? Can you sort the data in the previous job

tj_user wrote:4. At the beginning this one sort take a lot of time - it means kill:)

Are you sure that it that first sort that is taking the time? Can you create a test job where you just sort that input dataset and write it a peek or copy stage, just to get a metric on how long it is taking.

tj_user wrote:Is it good idea to take to MAIN stream only the keys which will i use in next steps to join and create copy of all columns and rows and connect at the end using earlier created surrogate key?
It will be easier to sort for example 10 columns by 1 key than 150 columns by 1 key.

Maybe - give it a try.

tj_user · Post by **tj_user** » Wed Jul 15, 2015 2:56 pm

Hi,
I made sort after filter and it decrease time of running job.
Now I am planning to decrease columns in joins to use only needed and the rest i will send to the end of job usig copy.
I also going to merge these two jobs in one job to avoid loading to dataset and then loading from dataset.
I think ii can help to cut the time of execution.
Regards

ShaneMuir · Post by **ShaneMuir** » Thu Jul 16, 2015 6:32 am

tj_user wrote:I made sort after filter and it decrease time of running job.

Probably - most likely because you are sorting less data.

tj_user wrote:Now I am planning to decrease columns in joins to use only needed and the rest i will send to the end of job usig copy.
I also going to merge these two jobs in one job to avoid loading to dataset and then loading from dataset.
I think ii can help to cut the time of execution.

Ok here is my opinion on these matters.

1. Splitting the data and decreasing the columns on the sort will probably help, but you will need to perform a further join downstream to join the data back together so you are increasing the use of resources required for the job.

2. By joining your jobs together you are more than likely going to consume further resources, as the job will still need to perform the same sorts as before, plus now it will have what ever the requirements of the other job now included.

3. Generally speaking, the ideal way is to minimise the number of sorts and re-partitioning of data that your job performs. Without knowing what your data is and how the different keys interact, and what else the job is doing, its hard to provide any sort of meaningful solutions. Chances are with all your aggregations and joins there have been many sorts inserted either by yourself or by datastage. All of these will use up resources. All I can suggest is sort/partition as early as possible. If you can (this will depend on your data requirements), do not re-sort/partition data downstream (you may need to set the NO SORT INSERT environment variable). This will generally help greatly with your performance. Again though each job is different.

PaulVL · Post by **PaulVL** » Thu Jul 16, 2015 7:36 am

I second the vote for Shane's #1 answer. Split the data up front, then join at the end.

Test and see if it saves any time.