Page 1 of 1

join stage for huge data

Posted: Mon Jun 29, 2015 12:32 am
by wuruima
In business requirement, we need to join 2 file by 2 keys: key_a and key_b,
each file contains more than 10,000,000 records.

is there any good idea to tune the performance ?

Posted: Mon Jun 29, 2015 1:31 am
by priyadarshikunal
hash partition, sort and join. Do you have any question in particular regarding join? 10 million is not huge IMO.

Posted: Mon Jun 29, 2015 1:54 am
by ArndW
The join stage is a small and efficient stage that is very, very fast. What can take time when processing large amounts of data is the sorting and partitioning that needs to take place in order for the join to do its job.

If possible, do your sorting where it is quickest - that can be in the initial data select or within DataStage.

Posted: Mon Jun 29, 2015 8:18 pm
by wuruima
Yes, it takes long time to sort the input links. So is there a solution to tune the sorting

Posted: Tue Jun 30, 2015 1:40 am
by ArndW
If your database is on another server and has spare capacity, then sort on your SELECT; otherwise use the sort stage and look into the settings you can specify on that stage.

Posted: Tue Jun 30, 2015 6:43 am
by chulett
If sorting large files is something you need to do a lot and the DataStage sort doesn't seem fast enough for you, you could look into leveraging a 3rd party "high speed" sort package like SyncSort or CoSORT. From what I recall the latter has a DataStage module.

Posted: Tue Jun 30, 2015 10:51 am
by rkashyap
If you decide to sort in the database, then to prevent DataStage from adding tsort operator ... either add sort stage with the option 'Don't sort if previously sorted' or add environment variable APT_SORT_INSERTION_CHECK_ONLY(and set it to True) in the job.

Posted: Wed Jul 01, 2015 2:06 am
by ArndW
Adding the environment variable APT_SORT_INSERTION_CHECK_ONLY is only one of the options available, the preferred method is indeed to add what I call a "dummy" sort stage which adds the appropriate "don't sort" option. One still needs to make sure that the data is correctly partitioned when running in parallel.