How can we handle unstructured data?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Madhavan VM
Participant
Posts: 33
Joined: Sat Jul 02, 2005 2:27 am
Location: Bangalore

How can we handle unstructured data?

Post by Madhavan VM »

Unstructured data has an unstructured format at the automatic level. The metadata is not defined and the data is unformatted. Data which resides in mails, PDF documents, Microsoft Excel spreadsheets and word documents can be said as unstructured data.

Structured data has a known format like char, integer and so on and can be queried to get the desired result.

To quote Bill Inmon who is considered as the father of Data warehousing: "The challenge of integrating critical knowledge coordinates buried in volumes of unstructured data may become the single largest issue for IT organizations in coming years."

It is also said that 80% of data are in the unstructured manner. Which means the data that we are populating from the structured data amounts only to 20% of data in the warehouse. :!:

This deals us with the question as to how we are handling data in an unstructured manner? How do we extract the data? If somebody has some good insights into this topic, could they through some light for the below points :?:

I want to know how unstrucutured data in:
1. Mails are being handled? How do we take care of attachments in mails, if any?
2. In Word documents, Microsoft Excel spreadsheets and PDF documents handled?

Apart from the areas mentioned, Is there any other area where the unstructured data resides?
warm regards,
Ajith GK
Madhavan VM
Participant
Posts: 33
Joined: Sat Jul 02, 2005 2:27 am
Location: Bangalore

Post by Madhavan VM »

To add to the above topic can we handle data which resides in picture? what about audio and video files?
warm regards,
Ajith GK
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The concept of "structured" data is always relevant to setting. If I look at a table with 2 columns containing a name and a date of birth I consider this to be structured; if this table is a in an Excel sheet then it is structured for Excel; if it is a digital picture of a screen shot of the same document it is generally considered to be non-structured.

Much of the work done any database and ETL process is the conversion of data from a "unstructured" (for the target system) form into one that is.

Most databases have constructs or datatypes that allow large amounts of data to be shoveled in without regard to their content (look at BLOBs). This can be used to store data that has no structure for the DB.

I think you might need to refine your question - there are too many possibilities to really answer easily.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I think the issue is what information you want from these documents. Google can index anything in these documents. Until you know what structured data you want then you have no starting point. There has to be something that needs to end up in a data warehouse table that someone thinks is valuable before you will ever need to tackle this problem.

Now gives a target table with target columns.
Mamu Kim
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

There was a session on unstructured data as IILive2005. IBM have various projects going ahead using a unified unstructured information management architecture (UIMA). They have software research efforts looking at multi-media, taxonomy generation, translation, search, text analysis, applications and semantics. The question isn't how is unstructured data handled but what do you want to do with it?

A UIMA SDK was released on Alphaworks last year.

WebSphere Information Integrator Content Edition is aimed a unstructured data and can retrieve information from image and document management systems, report managements systems, web content, network file systems as well as the structured database sources.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

With the new common connectors, DataStage EE (in Hawk) has been extended to handle BLOBs. This support allows BLOBs to be moved from source to target without paying a huge performance penalty. Only a reference to the BLOB goes through DataStage; the BLOB is not actually moved until the target is written.

There will never be support for BLOBs in server edition.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

IBM see Information Integration accessing federated and unstructured data and the Ascential suite processing those parts of it that it can handle. Adding BLOB support to DataStage is part of this. There are some highly specialised tools out there for identifying text in pictures or converting legacy documents (eg. Word Perfect) to PDF. You can find them with web searches.
trokosz
Premium Member
Premium Member
Posts: 188
Joined: Thu Sep 16, 2004 6:38 pm
Contact:

Post by trokosz »

You may want to checkout DataStage TX as a possibility
Post Reply