DataStage 8.5 FP1 - Bloom Filter implementation?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
buzzylee
Premium Member
Premium Member
Posts: 37
Joined: Thu Jul 09, 2009 6:58 am
Location: Sydney, Australia

DataStage 8.5 FP1 - Bloom Filter implementation?

Post by buzzylee »

Hi experts,

When playing with new features of Information Server 8.5 FP1 I met interesting new stage available in "Processing" tab - BloomFilter stage.

It's not documented anywhere, the only "official" trace of its existence can be found on Tony Curcio's blog:

https://www-304.ibm.com/connections/blo ... lang=en_us

Out of curiosity - has anyone experimented with this new functionality? Bloom filter concept is an exciting feature but unfortunately I was unsuccessful in using it - DataStage was firing lots of unclear internal errors...

It was just one attempt, maybe I was doing something wrong and I'm wondering what's your experience with it.

Cheers
Buzz
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How about emailing Tony? After all, he's the product manager.

Or maybe Tony will drift by here sometime soon.

I agree about the lack of documentation for this stage type, which I too found after installing Fix Pack 1.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
buzzylee
Premium Member
Premium Member
Posts: 37
Joined: Thu Jul 09, 2009 6:58 am
Location: Sydney, Australia

Post by buzzylee »

Email to Tony sent, thanks for brilliant idea Ray :)

I will post an answer as soon as he responses.

Cheers
Buzz
buzzylee
Premium Member
Premium Member
Posts: 37
Joined: Thu Jul 09, 2009 6:58 am
Location: Sydney, Australia

Post by buzzylee »

I asked:
When playing with new features of Information Server 8.5 FP1 I met interesting new stage available in "Processing" tab - BloomFilter stage.

Bloom filter concept is an exciting feature but unfortunately I was unsuccessful in using it - DataStage was firing lots of unclear internal errors... It was just one attempt, maybe I was doing something wrong and I'm wondering what's your experience with it.

Unfortunately the new stage is not documented at all, could you please shed some light on it's "official" availability and usage guidelines?

Also - on your blog you had a short note mentioning it's to be used for efficient duplicate keys filtering. I'm very curious how it works as the whole "bloom filter" concept is probabilistic algorithm with some positive negatives manifesting from time to time. The other vendors I know (like Oracle) use it only for pre-filtering purposes in join algorithms and inter-process communication when executing processes in parallel.
...and Tony has spoken:
The Bloom Filter stage is officially supported, so you will be able to file a PMR against it. The documentation is coming shortly. If you have some questions on features, I can assist.

The bloom filter uses a very efficient key store that allows for very large reference sets in memory. The algorithm may create a false positive, but will not create false negatives. So, in some cases, you may want to combine this with other deduplication stages, once the bulk has been reduced with bloom.
Post Reply