Page 1 of 2

how to make a SOA services with deduplication

Posted: Fri Jan 30, 2009 5:55 am
by ascen
Hi,

We want to create a job which takes on input one address and try to find an associated addresses in a reference data set. Job works fine in a standard batch way. But when I developed it as a SOA service: WISD input and WISD output it stopped working.
It gave me the following error:
"Frequencies must not be generated in a real time environment"
But the reference match stage needs four input links includes two frequency sets.
Do you have any idea how to create such SOA service?
Thanks in advance.

Best regards

Re: how to make a SOA services with deduplication

Posted: Fri Jan 30, 2009 8:40 am
by JRodriguez
Hi,

Match Frequency stage is not a real time stage - also it will be time consuming - so you would like to generate the frequency data set in a separate job that can be execute as frequently as your reference data change

To fulfill the requirement of the match stage to have two frequency data set - one for the input and another for the reference- use the reference frequency data set generated in the separate job two times


This will do the trick

Posted: Fri Jan 30, 2009 9:15 am
by ascen
I'm not sure if you understood what is the problem.
My primary data comes in real time from some app, which is calling a DS SOA web service. So I cannot compute frequency data set for that data before calling this routine.
Currently we generate the same data which would be generated by match frequency stage in transformer stage and then pass it through to primary frequency input link to reference match stage. It works fine now.
But it's just some workaround. Not a solution.

Posted: Fri Jan 30, 2009 10:06 am
by JRodriguez
Ascen,

What I mean is that because you can't generate frequency data in a real time job, then:

1) Standardized and generate frequency for you reference data in a separate job - no a real time one. Execute this job as part of a weekly batch (Really depend on how frequently your reference data change)

2) In your real time job, do not generate frequency data at all

3) In your real time job, use the reference frequency data from step1 two times - one to represent the frequency for your input, the other to repesent the frequency data for the reference

This way you avoid generating frequency for your input every time the web service is invoked

Posted: Mon Feb 02, 2009 2:45 am
by ascen
JRodriguez wrote: 1) Standardized and generate frequency for you reference data in a separate job - no a real time one. Execute this job as part of a weekly batch (Really depend on how frequently your reference data change)
this was already done
3) In your real time job, use the reference frequency data from step1 two times - one to represent the frequency for your input, the other to repesent the frequency data for the reference

This way you avoid generating frequency for your input every time the web service is invoked
I'll check if that works fine - if it generates the good results.
Do you mean that for input I can pass "any" frequency data set?

Posted: Mon Feb 02, 2009 5:26 am
by eostic
No he means pass the "same" one twice. It applies to the same kind of data as your incoming request. For what you are doing, it will be fine. There isn't enough data on the incoming to skew anything out of the ordinary.

Ernie

Posted: Mon Feb 02, 2009 5:56 am
by ascen
What if I have a different structure of data on input and reference links? Do I have to make it the same structure?

Posted: Thu Feb 05, 2009 4:16 am
by ascen
OK. The solution with the same frequency DS on input and reference link works fine.
But still I have problem with such a job deployes as a Web service.
On input I have one column, then I matched it against reference DS with address data and then I pass to the output link records that were matched.
I deployed this job as a Web Service with bindings SOAP over HTTP.
Then I wrote a simple app in C# which invoke this WS. And it works fine only once after the WS is deployed. After the first invocation it looks that the WS is not even executed and there are no results.
It is very weird, because I create almost the same WS, but only with standarization stage and it works perfectly fine, not only for the first time, but also for thousands of records.
Do you know what is wrong? I tried many configuration of reference match stage, different input and outpu columns structures, different WS settings and nothing changed.
Did somebody create a properly working WS with reference match stage?

Posted: Thu Feb 05, 2009 12:01 pm
by ray.wurlod
Did you deploy as "always running"? How many instances did you deploy?

Posted: Thu Feb 05, 2009 12:57 pm
by JRodriguez
Ascen,

A couple of tips:

- Use a single node config file

- If you designed this job as "always on" (WISD Input and WISDOuput stages present) the WISD Input must be the driving stream for the job, just make sure that all input streams are driving by the WISD Input stage - look for Eostic's post or a best practices document that he wrote a while ago

Posted: Thu Feb 05, 2009 1:23 pm
by eostic
JRodriguez nailed it. You can't have a static input to the Reference Match Stage..... it only works once because there is nothing to trigger a 're-read' of that input. This is not unique to the Reference Match, but could occur in many situations where two independent paths come together. Drive the Reference Input by using a lookup that is based on some of your blocking factors.

Ernie

Posted: Fri Feb 06, 2009 3:22 am
by ascen
What do you mean by:
"Drive the Reference Input by using a lookup that is based on some of your blocking factors." ?

So if I want to use reference match in real time job then I should design sth like this:

Code: Select all

                ---------             ---------
               | FREQ_DS1|           | FREQ_DS2|
                ---------             ---------
                       |                     |
         --------      -----------         --------
       | WISD_IN |----| REF MATCH |-------|WISD_OUT|
         --------      -----------         --------
                             |
                        -----------
                       |WISD_REF_IN|
                        -----------
Did I understand it properly?
Currently I have a DataSet in place of WISD_REF_IN.

Posted: Fri Feb 06, 2009 9:41 am
by JRodriguez
ascen,

It's a bit different ......

"Drive the Reference Input by using a lookup that is based on some of your blocking factors." mean that instead of having a static reference data set feeding the match reference stage, you would like to generate the reference data dynamically using a lookup stage base on one or more of your blocking factors ( This will improve performance because the match reference stage will be dealing with less records)

One way to do it will be branching the stream after the WISD Input into two branch (use a transformer or copy stage): one to a lookup stage and the other to the match reference stage. The lookup will look for all associated addresses records base on your blocking factor. The output link returning from the lookup should feed the match reference stage to close the cycle to make the WISD Input the driving stream

Posted: Fri Feb 06, 2009 9:48 am
by ascen
OK. I understood that this will improve performance, but I don't have any issues with my performance. It runs pretty fast.
I need to solve my problem :)

Posted: Fri Feb 06, 2009 9:48 am
by ascen
OK. I understood that this will improve performance, but I don't have any issues with my performance. It runs pretty fast.
I need to solve my problem :)