Page 1 of 1

NUMERIC Comparison Takes Too Long

Posted: Thu Jan 01, 2015 11:09 am
by asyafrudin
Hi,

I have 2 match specifications, MS1 & MS2. Both match specifications use only 1 match pass and both are basically the same except for 2 match commands, MCA & MCB. MCA & MCB compares columns containing numeric codes, so I decided to use the NUMERIC comparison method for both match commands in MS1. Out of curiosity, I wanted to compare the result using NUMERIC with UNCERT so I decided to use the UNCERT comparison method for both match commands in MS2. So now I have MS1 with NUMERIC method for MCA & MCB and MS2 with UNCERT method for MCA & MCB.

I test both match spefications in an identical environment using the same data source. What's funny is that MS1 took 183 hours to finish, while MS2 only took 44 hours to finish. Does using the NUMERIC function really made it significantly longer for the matching process to finish or is it something else? Anyone care to shed some light on this matter? I've tried googling to no avail.

Posted: Thu Jan 01, 2015 9:21 pm
by qt_ky
That is a lot of hours. How many records are involved?

Were the two methods run at the same time as each other?

Were there other jobs or a varying workload running during either test?

Are there a lot of other commands or steps? If so, it would help to create a test that only does that one step using NUMERIC and another that only does the one step using UNCERT. If you can run them on an identical sample set, such as 5%, then it won't take nearly as long to run several more comparisons.

Posted: Fri Jan 02, 2015 1:58 am
by asyafrudin
Hi, qt_ky. Thanks for the reply.

I'm using 6,136,023 data records and 25,757,890 reference records. I'm running both methods separately and I made sure there are no other jobs running. I'm using a server dedicated for InfoSphere, so I don't think there are any other apps or processes with significant workload running in the background.

The job only has on two-source match stage with typical dataset input and output. I am planning to isolate the NUMERIC comparison method and test it with a smaller sample as you suggested, but couldn't find the time to do that. Nevertheless, I'm still confused to why two NUMERIC match commands could make such a huge difference in running time. Any other pointers aside from the "smaller" test?

Posted: Fri Jan 02, 2015 4:45 am
by ray.wurlod
Can you take a look at the job log to see whether any particular process takes a very long time? Perhaps enable some of the reporting environment variables (especially APT_PLAYER_TIMING)?

Posted: Fri Jan 02, 2015 10:48 am
by rjdickson
Hi,

I wonder if the time is being spent in converting strings to numerics (ref: http://www-01.ibm.com/support/knowledge ... ison_.html)

I agree that is a lot of time, though.

Is your interest academic, or do you have a specific matching problem you are trying to solve? UNCERT is typically used on strings, and CNT_DIFF is designed for numbers (ref: http://www-01.ibm.com/support/knowledge ... isons.html)

Posted: Sat Jan 03, 2015 8:25 pm
by asyafrudin
ray.wurlod wrote:Can you take a look at the job log to see whether any particular process takes a very long time?
I'll give it a shot perhaps some time next week. I'll post the results here when I have them.

Posted: Sat Jan 03, 2015 8:34 pm
by asyafrudin
rjdickson wrote:I wonder if the time is being spent in converting strings to numerics
I was also wondering about the same thing, however I don't think it made any sense that the job took so long just to convert strings to numbers. This made me really curious about the conversion process, but I can't find any information about it. Perhaps because it's proprietary?

The columns in MCA and MCB contains strings, but the strings are just numbers padded with zeros in front of them. What's annoying is that there's not standard regarding the amount of padded zeros. It's like someone can enter "001", "01", or just "1". All three of these examples are the same thing. That's why I'm using NUMERIC. I was hoping that NUMERIC would remove those padded zeros before making the comparison. I don't think CNT_DIFF is suitable for my case.

Posted: Sun Jan 04, 2015 7:09 am
by rjdickson
Hi,

Thanks for the explanation of the problem - it helps a lot!

You are correct that CNT_DIFF will not help. UNCERT is not designed for this issue either.

You are also correct in that NUMERIC will strip leading zeros. At least it did in my quick tests :)

One option is to use other stages to remove the leading zeros before the match. The two that come to mind immediately are in the Pattern Action Language (if this field is part of a Standardization), or a Transformer.

You can also open a PMR to IBM, but they will likely want to see your data to reproduce the test, so you may need to reduce the test to 'just' this numeric compare.