how to make a SOA services with deduplication

Infosphere's Quality Product

Moderators: chulett, rschirm

ascen
Premium Member
Premium Member
Posts: 67
Joined: Wed Apr 12, 2006 6:47 am
Contact:

how to make a SOA services with deduplication

Post by ascen »

Hi,

We want to create a job which takes on input one address and try to find an associated addresses in a reference data set. Job works fine in a standard batch way. But when I developed it as a SOA service: WISD input and WISD output it stopped working.
It gave me the following error:
"Frequencies must not be generated in a real time environment"
But the reference match stage needs four input links includes two frequency sets.
Do you have any idea how to create such SOA service?
Thanks in advance.

Best regards
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Re: how to make a SOA services with deduplication

Post by JRodriguez »

Hi,

Match Frequency stage is not a real time stage - also it will be time consuming - so you would like to generate the frequency data set in a separate job that can be execute as frequently as your reference data change

To fulfill the requirement of the match stage to have two frequency data set - one for the input and another for the reference- use the reference frequency data set generated in the separate job two times


This will do the trick
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
ascen
Premium Member
Premium Member
Posts: 67
Joined: Wed Apr 12, 2006 6:47 am
Contact:

Post by ascen »

I'm not sure if you understood what is the problem.
My primary data comes in real time from some app, which is calling a DS SOA web service. So I cannot compute frequency data set for that data before calling this routine.
Currently we generate the same data which would be generated by match frequency stage in transformer stage and then pass it through to primary frequency input link to reference match stage. It works fine now.
But it's just some workaround. Not a solution.
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Ascen,

What I mean is that because you can't generate frequency data in a real time job, then:

1) Standardized and generate frequency for you reference data in a separate job - no a real time one. Execute this job as part of a weekly batch (Really depend on how frequently your reference data change)

2) In your real time job, do not generate frequency data at all

3) In your real time job, use the reference frequency data from step1 two times - one to represent the frequency for your input, the other to repesent the frequency data for the reference

This way you avoid generating frequency for your input every time the web service is invoked
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
ascen
Premium Member
Premium Member
Posts: 67
Joined: Wed Apr 12, 2006 6:47 am
Contact:

Post by ascen »

JRodriguez wrote: 1) Standardized and generate frequency for you reference data in a separate job - no a real time one. Execute this job as part of a weekly batch (Really depend on how frequently your reference data change)
this was already done
3) In your real time job, use the reference frequency data from step1 two times - one to represent the frequency for your input, the other to repesent the frequency data for the reference

This way you avoid generating frequency for your input every time the web service is invoked
I'll check if that works fine - if it generates the good results.
Do you mean that for input I can pass "any" frequency data set?
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

No he means pass the "same" one twice. It applies to the same kind of data as your incoming request. For what you are doing, it will be fine. There isn't enough data on the incoming to skew anything out of the ordinary.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
ascen
Premium Member
Premium Member
Posts: 67
Joined: Wed Apr 12, 2006 6:47 am
Contact:

Post by ascen »

What if I have a different structure of data on input and reference links? Do I have to make it the same structure?
ascen
Premium Member
Premium Member
Posts: 67
Joined: Wed Apr 12, 2006 6:47 am
Contact:

Post by ascen »

OK. The solution with the same frequency DS on input and reference link works fine.
But still I have problem with such a job deployes as a Web service.
On input I have one column, then I matched it against reference DS with address data and then I pass to the output link records that were matched.
I deployed this job as a Web Service with bindings SOAP over HTTP.
Then I wrote a simple app in C# which invoke this WS. And it works fine only once after the WS is deployed. After the first invocation it looks that the WS is not even executed and there are no results.
It is very weird, because I create almost the same WS, but only with standarization stage and it works perfectly fine, not only for the first time, but also for thousands of records.
Do you know what is wrong? I tried many configuration of reference match stage, different input and outpu columns structures, different WS settings and nothing changed.
Did somebody create a properly working WS with reference match stage?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Did you deploy as "always running"? How many instances did you deploy?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Ascen,

A couple of tips:

- Use a single node config file

- If you designed this job as "always on" (WISD Input and WISDOuput stages present) the WISD Input must be the driving stream for the job, just make sure that all input streams are driving by the WISD Input stage - look for Eostic's post or a best practices document that he wrote a while ago
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

JRodriguez nailed it. You can't have a static input to the Reference Match Stage..... it only works once because there is nothing to trigger a 're-read' of that input. This is not unique to the Reference Match, but could occur in many situations where two independent paths come together. Drive the Reference Input by using a lookup that is based on some of your blocking factors.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
ascen
Premium Member
Premium Member
Posts: 67
Joined: Wed Apr 12, 2006 6:47 am
Contact:

Post by ascen »

What do you mean by:
"Drive the Reference Input by using a lookup that is based on some of your blocking factors." ?

So if I want to use reference match in real time job then I should design sth like this:

Code: Select all

                ---------             ---------
               | FREQ_DS1|           | FREQ_DS2|
                ---------             ---------
                       |                     |
         --------      -----------         --------
       | WISD_IN |----| REF MATCH |-------|WISD_OUT|
         --------      -----------         --------
                             |
                        -----------
                       |WISD_REF_IN|
                        -----------
Did I understand it properly?
Currently I have a DataSet in place of WISD_REF_IN.
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

ascen,

It's a bit different ......

"Drive the Reference Input by using a lookup that is based on some of your blocking factors." mean that instead of having a static reference data set feeding the match reference stage, you would like to generate the reference data dynamically using a lookup stage base on one or more of your blocking factors ( This will improve performance because the match reference stage will be dealing with less records)

One way to do it will be branching the stream after the WISD Input into two branch (use a transformer or copy stage): one to a lookup stage and the other to the match reference stage. The lookup will look for all associated addresses records base on your blocking factor. The output link returning from the lookup should feed the match reference stage to close the cycle to make the WISD Input the driving stream
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
ascen
Premium Member
Premium Member
Posts: 67
Joined: Wed Apr 12, 2006 6:47 am
Contact:

Post by ascen »

OK. I understood that this will improve performance, but I don't have any issues with my performance. It runs pretty fast.
I need to solve my problem :)
ascen
Premium Member
Premium Member
Posts: 67
Joined: Wed Apr 12, 2006 6:47 am
Contact:

Post by ascen »

OK. I understood that this will improve performance, but I don't have any issues with my performance. It runs pretty fast.
I need to solve my problem :)
Post Reply