How can we handle unstructured data?

Madhavan VM · Post by **Madhavan VM** » Wed Nov 23, 2005 2:49 am

Unstructured data has an unstructured format at the automatic level. The metadata is not defined and the data is unformatted. Data which resides in mails, PDF documents, Microsoft Excel spreadsheets and word documents can be said as unstructured data.

Structured data has a known format like char, integer and so on and can be queried to get the desired result.

To quote Bill Inmon who is considered as the father of Data warehousing: "The challenge of integrating critical knowledge coordinates buried in volumes of unstructured data may become the single largest issue for IT organizations in coming years."

It is also said that 80% of data are in the unstructured manner. Which means the data that we are populating from the structured data amounts only to 20% of data in the warehouse.

This deals us with the question as to how we are handling data in an unstructured manner? How do we extract the data? If somebody has some good insights into this topic, could they through some light for the below points

I want to know how unstrucutured data in:
1. Mails are being handled? How do we take care of attachments in mails, if any?
2. In Word documents, Microsoft Excel spreadsheets and PDF documents handled?

Apart from the areas mentioned, Is there any other area where the unstructured data resides?

Madhavan VM · Post by **Madhavan VM** » Wed Nov 23, 2005 2:52 am

To add to the above topic can we handle data which resides in picture? what about audio and video files?

ArndW · Post by **ArndW** » Wed Nov 23, 2005 3:04 am

The concept of "structured" data is always relevant to setting. If I look at a table with 2 columns containing a name and a date of birth I consider this to be structured; if this table is a in an Excel sheet then it is structured for Excel; if it is a digital picture of a screen shot of the same document it is generally considered to be non-structured.

Much of the work done any database and ETL process is the conversion of data from a "unstructured" (for the target system) form into one that is.

Most databases have constructs or datatypes that allow large amounts of data to be shoveled in without regard to their content (look at BLOBs). This can be used to store data that has no structure for the DB.

I think you might need to refine your question - there are too many possibilities to really answer easily.

kduke · Post by **kduke** » Wed Nov 23, 2005 5:06 am

I think the issue is what information you want from these documents. Google can index anything in these documents. Until you know what structured data you want then you have no starting point. There has to be something that needs to end up in a data warehouse table that someone thinks is valuable before you will ever need to tackle this problem.

Now gives a target table with target columns.

vmcburney · Post by **vmcburney** » Wed Nov 23, 2005 10:54 am

There was a session on unstructured data as IILive2005. IBM have various projects going ahead using a unified unstructured information management architecture (UIMA). They have software research efforts looking at multi-media, taxonomy generation, translation, search, text analysis, applications and semantics. The question isn't how is unstructured data handled but what do you want to do with it?

A UIMA SDK was released on Alphaworks last year.

WebSphere Information Integrator Content Edition is aimed a unstructured data and can retrieve information from image and document management systems, report managements systems, web content, network file systems as well as the structured database sources.

ray.wurlod · Post by **ray.wurlod** » Thu Nov 24, 2005 2:20 am

With the new common connectors, DataStage EE (in Hawk) has been extended to handle BLOBs. This support allows BLOBs to be moved from source to target without paying a huge performance penalty. Only a reference to the BLOB goes through DataStage; the BLOB is not actually moved until the target is written.

There will never be support for BLOBs in server edition.

vmcburney · Post by **vmcburney** » Thu Nov 24, 2005 5:17 pm

IBM see Information Integration accessing federated and unstructured data and the Ascential suite processing those parts of it that it can handle. Adding BLOB support to DataStage is part of this. There are some highly specialised tools out there for identifying text in pictures or converting legacy documents (eg. Word Perfect) to PDF. You can find them with web searches.

trokosz · Post by **trokosz** » Tue Nov 29, 2005 4:41 pm

You may want to checkout DataStage TX as a possibility