Unstructured data

pillip · Post by **pillip** » Fri Nov 07, 2014 8:06 pm

Hi,

Can datastage read unstructured data? When i mean unstructured it means, tweets or facebook messages.My understanding is we can read them but we will not be able to process them. Can you confirm my understanding.

Thank you

eostic · Post by **eostic** » Fri Nov 07, 2014 10:12 pm

The words "unstructured data" mean too many things to too many people....so it becomes critical that such discussions be very specific......twitter data, for example can be received in json format.....fully readable by datastage.....formal excel files are fully readable by datastage......image data has had various strategies over the years, at least for "moving it" or "pointing to it"....

What exactly is the format of the data you need to consume. ?

Ernie

chulett · Post by **chulett** » Sat Nov 08, 2014 8:07 am

I'd also be curious what you mean by "process them"? What kind of fate did you have in mind for this data once you've read it? Or is this just an academic question.

pillip · Post by **pillip** » Sat Nov 08, 2014 8:02 pm

Its a kind of an academic question which go this way... Can DataStage process all kinds of unstructured data available today. Can it be a replacement of Hadoop?

Thank you

ray.wurlod · Post by **ray.wurlod** » Mon Nov 10, 2014 12:56 am

Basically yes - there is an Unstructured Data stage (most people are using this to read directly from Excel). There is also a Big Data File stage (which connects to Hadoop distributed file system), and various other mechanisms as well. Why not research on IBM web site and/or on your favourite search engine?

ray.wurlod · Post by **ray.wurlod** » Tue Nov 11, 2014 6:38 pm

Big Data File stage generates MapReduce under the covers.

vmcburney · Post by **vmcburney** » Wed Nov 12, 2014 10:47 pm

Information Server 11.3 added some additional support for this type of data by giving you the ability in the XML input stage to read from an API layer. So for Twitter you would have an API to read Tweets in a specific XML format. Typically you can buy from Twitter a subset of Tweet content filtered by region or topic or user type. Facebook is harder to read because content is not open (which is why IBM announced a partnership with Twitter and not Facebook). You could connect to Facebook but you can only get content from pages you are allowed to see. Again you would connect via an API and read it in as XML.

Once data is in DataStage as XML you can flatten it to relational data or output it as XML or write it to NoSQL or a Hadoop distributed file system. DataStage cannot do a lot with the content - it cannot do sentiment analysis or text analytics - you would write it out and then use SPSS or BigInsights to analyse the content.