Reading PDF document

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
das_nirmalya
Participant
Posts: 59
Joined: Thu Mar 20, 2008 12:11 am

Reading PDF document

Post by das_nirmalya »

We have a requirement to read the pdf which is embedded in XML document.
We need to parse the XML document to read the PDF content using Datastage 9.1.

Please let me know the wayout.
nsd
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Okay... while you should be able to parse the XML using DataStage and perhaps even retrieve the pdf document in some fashion, we'd need more details to provide cogent help. How is it stored in the XML? And what specifically do you mean by "read the PDF content" in an ETL context?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ozgurgul
Premium Member
Premium Member
Posts: 9
Joined: Tue Jan 31, 2006 9:07 am

Post by ozgurgul »

Hi - After pulling out PDF document from XML, then you may to convert the file to text using one of PDF2TEXT and then read the text and filter the content as you like.

Hope this will help you out in high level.

Regards,
Ozgur
Ozgur GUL
Assumption is the mother of all mistakes!
ozgurgul
Premium Member
Premium Member
Posts: 9
Joined: Tue Jan 31, 2006 9:07 am

Post by ozgurgul »

Alternatively,

Re: Extract an embedded pdf file from xml

1) load the xml files into db
2) extract the clob data from xml and store them in a clob
3) convert the clob data from base64 to binary and store result in blob
4) write the blob data to o/s file using UTL_FILE writing in raw mode
5) use java,phyton,pdf2txt etc to convert the pdf file to text and filter what you're seeking

https://community.oracle.com/thread/1114383

Regards,
Ozgur
Ozgur GUL
Assumption is the mother of all mistakes!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Why must you do this "using DataStage"?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply