DataStage vs. Ab Initio

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

michaeld
Premium Member
Premium Member
Posts: 88
Joined: Tue Apr 04, 2006 8:42 am
Location: Toronto, Canada

DataStage vs. Ab Initio

Post by michaeld »

Everybody has been telling me for years that Ab Initio is the best ETL tool. I finally got a chance to use it and it is sweet! From a developers point of view it is much more versatile, intuitive, and stable. You have to do more scripting, but it's simple. I ran a couple of tests to see how it performs. These tests were not run on the same machine (license reasons), but on similar machines. It's true, Ab Initio performs significantly better than DataStage.

DataStage vs AbInitio

Environment:

DataStage 8.1:
OS: Unix AIX
4GB RAM
4 CPU
4 SMP Nodes each on it's own disk.

Ab Initio 2.14:
OS: Unix AIX
4GB RAM
4 CPU
4 SMP Nodes each on it's own disk.

Simple test 1:

Generate 2 data sets / MFS files with 100 million records each. In Data set 1 the key columns takes the values 1 to 100 000 000. In Data set 2 the key columns takes the values 100 000 000 to 1. Other data is random.

schema:

key: decimal 15
data1: string 5
data2: string 5
data3: string 5

Ab Initio: 14 min 01 sec
DataStage: 16 min 28 sec


Simple test 2:

Use these data sets to perform a lookup (stage/component). One data set is the input and the other is the lookup. The target is another data set.

Ab Initio: 5 min 33 sec
DataStage: 21 min 45 sec
Mike
michaeld
Premium Member
Premium Member
Posts: 88
Joined: Tue Apr 04, 2006 8:42 am
Location: Toronto, Canada

Here is the config file

Post by michaeld »

{
node "node0"
{
fastname "localhost"
pools ""
resource disk "mask/partn/00" {pools ""}
resource disk "mask/partn/01" {pools ""}
resource disk "mask/partn/02" {pools ""}
resource disk "mask/partn/03" {pools ""}
resource scratchdisk "mask/scrtch/00" {pools ""}
resource scratchdisk "mask/scrtch/01" {pools ""}
resource scratchdisk "mask/scrtch/02" {pools ""}
resource scratchdisk "mask/scrtch/03" {pools ""}
}
node "node1"
{
fastname "localhost"
pools ""
resource disk "mask/partn/03" {pools ""}
resource disk "mask/partn/02" {pools ""}
resource disk "mask/partn/01" {pools ""}
resource disk "mask/partn/00" {pools ""}
resource scratchdisk "mask/scrtch/03" {pools ""}
resource scratchdisk "mask/scrtch/02" {pools ""}
resource scratchdisk "mask/scrtch/01" {pools ""}
resource scratchdisk "mask/scrtch/00" {pools ""}
}
node "node2"
{
fastname "localhost"
pools ""
resource disk "mask/partn/01" {pools ""}
resource disk "mask/partn/02" {pools ""}
resource disk "mask/partn/03" {pools ""}
resource disk "mask/partn/00" {pools ""}
resource scratchdisk "mask/scrtch/01" {pools ""}
resource scratchdisk "mask/scrtch/02" {pools ""}
resource scratchdisk "mask/scrtch/03" {pools ""}
resource scratchdisk "mask/scrtch/00" {pools ""}
}
node "node3"
{
fastname "localhost"
pools ""
resource disk "mask/partn/02" {pools ""}
resource disk "mask/partn/03" {pools ""}
resource disk "mask/partn/00" {pools ""}
resource disk "mask/partn/01" {pools ""}
resource scratchdisk "mask/scrtch/02" {pools ""}
resource scratchdisk "mask/scrtch/03" {pools ""}
resource scratchdisk "mask/scrtch/00" {pools ""}
resource scratchdisk "mask/scrtch/01" {pools ""}
}
}
Mike
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Care to include a price comparison?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Anything that runs in less than an hour is not a valid test.
Mamu Kim
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

You used the wrong stage - a lookup for 100m rows? You've got 4G of RAM not 400G. Use the Join stage. Are both jobs doing the same thing? Is DataStage inserting a tsort automatically or have you got insert sort turned off? Is the data already sorted? How come you only have four nodes for a 4CPU box? Have you tried the same jobs on 16 nodes - you could potential run at least twice as fast. Why are the 4 nodes using the same four disks? Have you tried putting each node on a different disk?
hamzaqk
Participant
Posts: 249
Joined: Tue Apr 17, 2007 5:50 am
Location: islamabad

Post by hamzaqk »

I agree. The DS job is not optimized and Lookup stage is def not a stage to be used for so many records. No paritioning ? No sorting of the data ? tsort enforced?

And regarding Abinitio. I have done training for both and the architecture of Abinitio is not much different from DS i have seen. Abinitio is file based which generates a C code at the background before actually executing it, which is similar to the DS parallel edition. The degree of parellism can be controled in the same manner as in DS. The components which comes with both tools are almost similar, just with different names. We call it a job they call it a graph etc.

Abinitio is more of a mystery tool for which they do not publish much and it comes with too many restrictions. They also make it a must that an Abinitio consultant from Abinito is present on every site it is installed and while the project runs. The price is ridicuously high as compared to other ETL tools.

They also claim that they can provide you with the entire DWH solution without involving any third party database.
Teradata Certified Master V2R5
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

AbInitio generates shell scripts in the background. So the performance gain is from the OS itself.

It will be useful to locate and follow the scripts to see the activities it is using to make it performance efficient.

There is a claim of AbInitio being good for large volume migration and DW loads.
michaeld
Premium Member
Premium Member
Posts: 88
Joined: Tue Apr 04, 2006 8:42 am
Location: Toronto, Canada

Post by michaeld »

I know I should use the join stage in a real life scenario, but I'm trying to see which tool can handle large lookups better in a test scenario.

Switching from 4 nodes to 16 node helped a bit, but not even close to 2x.

Here are some more results:

# records: 1 million

1 node (\w disk pool) 00:31
2 node (\w disk pool) 00:18
4 node (\w disk pool) 00:17
4 node (\wo disk pool) 00:17


# records: 10 million

2 node (\w disk pool) 02:38
4 node (\wo disk pool) 02:05
16 node (\wo disk pool) 01:49

Ab Initio 01:22

# records: 100 million

16 node (\wo disk pool) 19:02


* by disk pool I mean each node has all 4 disks in its pool. The order is staggered between nodes.
** without disk pool means 1 isolated disk per node.
*** in 16 node setup 1 disk is shared between 4 nodes since the system only has 4 disks.

The memory ulimit in DataStage is set to 128MB
The memory ulimit in Ab Initio is set to 32MB
Mike
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Moving to 16 nodes wont give you a big improvement if you leave the bottleneck of the oversized lookup stage in the job. Join and Lookup do the same thing - don't get confused by semantics - Lookup is easier to use but wants the reference data in RAM memory so if you have a massive reference source you should switch to a join stage. Since you are using datasets as your sources you need to switch off repartitioning and sorting. Would be surprised if the processing time wasn't the same as Ab Initio.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I found one of my old blog posts on lookup versus join - on just 3 million rows and a 2 CPU box a join was twice as fast as a lookup. On 100 million rows and 4 CPUs and 16 nodes the gap should be much wider - as long as you avoid repartitioning and sorting your reference files.

http://it.toolbox.com/blogs/infosphere/datastage-tip-for-beginners-parallel-lookup-types-7183
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

Joins are much better than lookup but the disadvantage is that 'joins' do not have a reject link

Regards
Sreeni
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

Joins are much better than lookup but the disadvantage is that 'joins' do not have a reject link

Regards
Sreeni
michaeld
Premium Member
Premium Member
Posts: 88
Joined: Tue Apr 04, 2006 8:42 am
Location: Toronto, Canada

Post by michaeld »

I'm going to get some official specs on the hardware after the new year... I think the abinitio hardware is better. In particular the I/O speed.

Re: join

one cool thing about joins in ab initio is that you can overwrite the default comparison methods. So you can program how the comparison between two records is performed. You can redirect the output anywhere you want. Also most abinitio components allow full use of the abinitio transformation language (not just in a transformer). Ab Inito transformer language is more complete. It allows for loops, functions, reusable packages, etc.. all within the GUI.
Mike
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

1.Developement time in 'datastage' is less compared to 'ab-initio'.
2.As Mentioned by 'Duke' its not proper to compare jobs finishing within 1 hr.

Regards
Sreeni
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I really believe everything is going to be on a grid in the future. That is where the 2 products should be compared. I think development time is critical to the cost of a data warehouse.
Mamu Kim
Post Reply