extract data from Emails, PDF files ?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
chandra_babu_999
Participant
Posts: 13
Joined: Thu Jul 17, 2008 4:11 am

extract data from Emails, PDF files ?

Post by chandra_babu_999 »

How can we be able to extract data from Emails, PDF files?
Chandra Sekhar
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Short answer? No. You might want to expand on what 'extract data' means, however. Any specifics you'd care to share?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

When's the interview?

IBM has lots of information it wants to share with you about accessing unstructured data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

There's that. There was also a ClickPack available that would allow you to access emails and weblogs from what I recall, but I say "was" because AFAIK support for it has been dropped and it's not a part of the 8.x release. It was kind of interesting as it brought Perl support into DataStage.

I don't think you could even do much with a pdf. Unless, perhaps, you had your hands on a third-party tool / bridge / something that could read them. [shrug]
-craig

"You can never have too many knives" -- Logan Nine Fingers
John Smith
Charter Member
Charter Member
Posts: 193
Joined: Tue Sep 05, 2006 8:01 pm
Location: Australia

Post by John Smith »

Hi,
PDF files are usually hard to extract but it is possible. There are software (not Datastage though) that you can use to read PDF and then convert into Excel or CSV files. One that comes to mind is called Able2Extract. Just Google it.
Once it's in CSV format then you can use Datastage to process it.

As for Emails, it depends on what email system.The underlying email database may be queryable.

Cheers,
JS
chandra_babu_999
Participant
Posts: 13
Joined: Thu Jul 17, 2008 4:11 am

Post by chandra_babu_999 »

ray.wurlod wrote:When's the interview?

IBM has lots of information it wants to share with you about accessing unstructured data. ...
Hi ray,

Thanks for the reply.

Actually we are evaluating different etl tools to suggest to my client.My client has a requirement to extract the data from MS-Excel,XML,E-Mails and PDF's.
As per my knowledge Datastage is satisfying all the requiements but need to clarify wether it can able to extract Email and PDF data.

I was ot sure what kind of data does Email system has at this point.
Chandra Sekhar
chandra_babu_999
Participant
Posts: 13
Joined: Thu Jul 17, 2008 4:11 am

Post by chandra_babu_999 »

ray.wurlod wrote:When's the interview?

IBM has lots of information it wants to share with you about accessing unstructured data. ...
Hi ray,

Thanks for the reply.

Actually we are evaluating different etl tools to suggest to my client.My client has a requirement to extract the data from MS-Excel,XML,E-Mails and PDF's.
As per my knowledge Datastage is satisfying all the requiements but need to clarify wether it can able to extract Email and PDF data.

I was ot sure what kind of data does Email system has at this point.
Chandra Sekhar
chandra_babu_999
Participant
Posts: 13
Joined: Thu Jul 17, 2008 4:11 am

Post by chandra_babu_999 »

John Smith,

Thanks for the reply..
Chandra Sekhar
Post Reply