Page 1 of 1

Match free format text

Posted: Sun Jun 29, 2003 7:52 am
by sdf
Hi there,

Is there a way (using Integrity) to extract and match names and addreses from a free format text file, e.g *.txt files or file with line type char(1000)



Lady S.

Posted: Sun Jun 29, 2003 4:59 pm
by ray.wurlod
Basically no.
Because of its mainframe heritage, INTEGRITY (at least at current version) can work only with fixed-width format files.
A good plan is to use DataStage to convert to fixed-width format, call INTEGRITY passing the stream of data and receive the result, using the INTEGRITY stage type. By this means, you do not have to create any on-disk files (unless you explicitly choose to).

Posted: Mon Jun 30, 2003 1:25 pm
by AmosR
Hi Guys,

If I have a description field that can contains free text such as product names, packeging desc ,persons names and addresses (or in other words, everything is possible)

Can I use the integrity stage to extract some sense out of it ??
(assuming it's all in the same known field)

Did anyone try it ... how good it is?

Posted: Tue Jul 01, 2003 11:44 pm
by ray.wurlod
Yes, but you still have to set up the rules within INTEGRITY, including possible redefinitions (overlays) of data format.

Posted: Mon Jul 14, 2003 1:45 pm
by timwalsh
Integrity works as well if not better at free-form data investigation, standardization and matching as its competitors.

However, realize that you must write and create custom rules no matter what tool you are using. Depending on the complexities of your data, extracting value from it can sometimes be difficult.

Be prepared to investigate your source data before your can identify trends and start looking at patterns. You also have to have master data list so that you can match your data after you standardize it.

Have you contacted someone from Ascential or a Data Cleansing expert to evaluate your situation and offer a solution? I would suggest a combination of Integrity and DataStage, if you have the luxury of having both tools!

Please let us know if you need more info!

Tim

Posted: Mon Jul 14, 2003 6:25 pm
by ray.wurlod
The "two tools" solution is excellent. DataStage (6.0 or later) can reformat the data into a fixed-width format required by INTEGRITY (within these fields data can be free format, but rules must have been created for making sense of these). From DataStage you can invoke INTEGRITY through a stage, which means that the data do not touch down on disk. The results are returned to become the output of that stage, again meaning that the data do not touch down on disk. Throughput is excellent. Since the Parallel Extender architecture underpins both products, the advantages of this technology can be obtained too, allowing efficient processing of huge volumes of data.