In business requirement, we need to join 2 file by 2 keys: key_a and key_b,
each file contains more than 10,000,000 records.
is there any good idea to tune the performance ?
join stage for huge data
Moderators: chulett, rschirm, roy
join stage for huge data
wuruimao
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
The join stage is a small and efficient stage that is very, very fast. What can take time when processing large amounts of data is the sorting and partitioning that needs to take place in order for the join to do its job.
If possible, do your sorting where it is quickest - that can be in the initial data select or within DataStage.
If possible, do your sorting where it is quickest - that can be in the initial data select or within DataStage.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
If your database is on another server and has spare capacity, then sort on your SELECT; otherwise use the sort stage and look into the settings you can specify on that stage.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
If sorting large files is something you need to do a lot and the DataStage sort doesn't seem fast enough for you, you could look into leveraging a 3rd party "high speed" sort package like SyncSort or CoSORT. From what I recall the latter has a DataStage module.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Adding the environment variable APT_SORT_INSERTION_CHECK_ONLY is only one of the options available, the preferred method is indeed to add what I call a "dummy" sort stage which adds the appropriate "don't sort" option. One still needs to make sure that the data is correctly partitioned when running in parallel.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>