DataStage vs. Ab Initio

michaeld · Post by **michaeld** » Tue Dec 22, 2009 4:05 pm

Everybody has been telling me for years that Ab Initio is the best ETL tool. I finally got a chance to use it and it is sweet! From a developers point of view it is much more versatile, intuitive, and stable. You have to do more scripting, but it's simple. I ran a couple of tests to see how it performs. These tests were not run on the same machine (license reasons), but on similar machines. It's true, Ab Initio performs significantly better than DataStage.

DataStage vs AbInitio

Environment:

DataStage 8.1:
OS: Unix AIX
4GB RAM
4 CPU
4 SMP Nodes each on it's own disk.

Ab Initio 2.14:
OS: Unix AIX
4GB RAM
4 CPU
4 SMP Nodes each on it's own disk.

Simple test 1:

Generate 2 data sets / MFS files with 100 million records each. In Data set 1 the key columns takes the values 1 to 100 000 000. In Data set 2 the key columns takes the values 100 000 000 to 1. Other data is random.

schema:

key: decimal 15
data1: string 5
data2: string 5
data3: string 5

Ab Initio: 14 min 01 sec
DataStage: 16 min 28 sec

Simple test 2:

Use these data sets to perform a lookup (stage/component). One data set is the input and the other is the lookup. The target is another data set.

Ab Initio: 5 min 33 sec
DataStage: 21 min 45 sec

michaeld · Post by **michaeld** » Tue Dec 22, 2009 4:12 pm

{
node "node0"
{
fastname "localhost"
pools ""
resource disk "mask/partn/00" {pools ""}
resource disk "mask/partn/01" {pools ""}
resource disk "mask/partn/02" {pools ""}
resource disk "mask/partn/03" {pools ""}
resource scratchdisk "mask/scrtch/00" {pools ""}
resource scratchdisk "mask/scrtch/01" {pools ""}
resource scratchdisk "mask/scrtch/02" {pools ""}
resource scratchdisk "mask/scrtch/03" {pools ""}
}
node "node1"
{
fastname "localhost"
pools ""
resource disk "mask/partn/03" {pools ""}
resource disk "mask/partn/02" {pools ""}
resource disk "mask/partn/01" {pools ""}
resource disk "mask/partn/00" {pools ""}
resource scratchdisk "mask/scrtch/03" {pools ""}
resource scratchdisk "mask/scrtch/02" {pools ""}
resource scratchdisk "mask/scrtch/01" {pools ""}
resource scratchdisk "mask/scrtch/00" {pools ""}
}
node "node2"
{
fastname "localhost"
pools ""
resource disk "mask/partn/01" {pools ""}
resource disk "mask/partn/02" {pools ""}
resource disk "mask/partn/03" {pools ""}
resource disk "mask/partn/00" {pools ""}
resource scratchdisk "mask/scrtch/01" {pools ""}
resource scratchdisk "mask/scrtch/02" {pools ""}
resource scratchdisk "mask/scrtch/03" {pools ""}
resource scratchdisk "mask/scrtch/00" {pools ""}
}
node "node3"
{
fastname "localhost"
pools ""
resource disk "mask/partn/02" {pools ""}
resource disk "mask/partn/03" {pools ""}
resource disk "mask/partn/00" {pools ""}
resource disk "mask/partn/01" {pools ""}
resource scratchdisk "mask/scrtch/02" {pools ""}
resource scratchdisk "mask/scrtch/03" {pools ""}
resource scratchdisk "mask/scrtch/00" {pools ""}
resource scratchdisk "mask/scrtch/01" {pools ""}
}
}

ray.wurlod · Post by **ray.wurlod** » Tue Dec 22, 2009 4:32 pm

Care to include a price comparison?

kduke · Post by **kduke** » Tue Dec 22, 2009 7:30 pm

Anything that runs in less than an hour is not a valid test.

vmcburney · Post by **vmcburney** » Tue Dec 22, 2009 8:46 pm

You used the wrong stage - a lookup for 100m rows? You've got 4G of RAM not 400G. Use the Join stage. Are both jobs doing the same thing? Is DataStage inserting a tsort automatically or have you got insert sort turned off? Is the data already sorted? How come you only have four nodes for a 4CPU box? Have you tried the same jobs on 16 nodes - you could potential run at least twice as fast. Why are the 4 nodes using the same four disks? Have you tried putting each node on a different disk?

hamzaqk · Post by **hamzaqk** » Wed Dec 23, 2009 1:04 am

I agree. The DS job is not optimized and Lookup stage is def not a stage to be used for so many records. No paritioning ? No sorting of the data ? tsort enforced?

And regarding Abinitio. I have done training for both and the architecture of Abinitio is not much different from DS i have seen. Abinitio is file based which generates a C code at the background before actually executing it, which is similar to the DS parallel edition. The degree of parellism can be controled in the same manner as in DS. The components which comes with both tools are almost similar, just with different names. We call it a job they call it a graph etc.

Abinitio is more of a mystery tool for which they do not publish much and it comes with too many restrictions. They also make it a must that an Abinitio consultant from Abinito is present on every site it is installed and while the project runs. The price is ridicuously high as compared to other ETL tools.

They also claim that they can provide you with the entire DWH solution without involving any third party database.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Wed Dec 23, 2009 2:59 am

AbInitio generates shell scripts in the background. So the performance gain is from the OS itself.

It will be useful to locate and follow the scripts to see the activities it is using to make it performance efficient.

There is a claim of AbInitio being good for large volume migration and DW loads.

michaeld · Post by **michaeld** » Wed Dec 23, 2009 10:50 am

I know I should use the join stage in a real life scenario, but I'm trying to see which tool can handle large lookups better in a test scenario.

Switching from 4 nodes to 16 node helped a bit, but not even close to 2x.

Here are some more results:

# records: 1 million

1 node (\w disk pool) 00:31
2 node (\w disk pool) 00:18
4 node (\w disk pool) 00:17
4 node (\wo disk pool) 00:17

# records: 10 million

2 node (\w disk pool) 02:38
4 node (\wo disk pool) 02:05
16 node (\wo disk pool) 01:49

Ab Initio 01:22

# records: 100 million

16 node (\wo disk pool) 19:02

* by disk pool I mean each node has all 4 disks in its pool. The order is staggered between nodes.
** without disk pool means 1 isolated disk per node.
*** in 16 node setup 1 disk is shared between 4 nodes since the system only has 4 disks.

The memory ulimit in DataStage is set to 128MB
The memory ulimit in Ab Initio is set to 32MB

vmcburney · Post by **vmcburney** » Wed Dec 23, 2009 8:51 pm

Moving to 16 nodes wont give you a big improvement if you leave the bottleneck of the oversized lookup stage in the job. Join and Lookup do the same thing - don't get confused by semantics - Lookup is easier to use but wants the reference data in RAM memory so if you have a massive reference source you should switch to a join stage. Since you are using datasets as your sources you need to switch off repartitioning and sorting. Would be surprised if the processing time wasn't the same as Ab Initio.

vmcburney · Post by **vmcburney** » Wed Dec 23, 2009 10:04 pm

I found one of my old blog posts on lookup versus join - on just 3 million rows and a 2 CPU box a join was twice as fast as a lookup. On 100 million rows and 4 CPUs and 16 nodes the gap should be much wider - as long as you avoid repartitioning and sorting your reference files.

http://it.toolbox.com/blogs/infosphere/datastage-tip-for-beginners-parallel-lookup-types-7183

Sreenivasulu · Post by **Sreenivasulu** » Sat Dec 26, 2009 10:43 am

Joins are much better than lookup but the disadvantage is that 'joins' do not have a reject link

Regards
Sreeni

Sreenivasulu · Post by **Sreenivasulu** » Sat Dec 26, 2009 10:44 am

Joins are much better than lookup but the disadvantage is that 'joins' do not have a reject link

Regards
Sreeni

michaeld · Post by **michaeld** » Sat Dec 26, 2009 4:35 pm

I'm going to get some official specs on the hardware after the new year... I think the abinitio hardware is better. In particular the I/O speed.

Re: join

one cool thing about joins in ab initio is that you can overwrite the default comparison methods. So you can program how the comparison between two records is performed. You can redirect the output anywhere you want. Also most abinitio components allow full use of the abinitio transformation language (not just in a transformer). Ab Inito transformer language is more complete. It allows for loops, functions, reusable packages, etc.. all within the GUI.

Sreenivasulu · Post by **Sreenivasulu** » Sun Dec 27, 2009 1:00 pm

1.Developement time in 'datastage' is less compared to 'ab-initio'.
2.As Mentioned by 'Duke' its not proper to compare jobs finishing within 1 hr.

Regards
Sreeni

kduke · Post by **kduke** » Sun Dec 27, 2009 11:12 pm

I really believe everything is going to be on a grid in the future. That is where the 2 products should be compared. I think development time is critical to the cost of a data warehouse.

DSXchange

DataStage vs. Ab Initio

DataStage vs. Ab Initio

Here is the config file