Page 1 of 1

Reading PDF document

Posted: Wed Jun 03, 2015 12:05 am
by das_nirmalya
We have a requirement to read the pdf which is embedded in XML document.
We need to parse the XML document to read the PDF content using Datastage 9.1.

Please let me know the wayout.

Posted: Wed Jun 03, 2015 10:12 am
by chulett
Okay... while you should be able to parse the XML using DataStage and perhaps even retrieve the pdf document in some fashion, we'd need more details to provide cogent help. How is it stored in the XML? And what specifically do you mean by "read the PDF content" in an ETL context?

Posted: Thu Jun 25, 2015 9:36 am
by ozgurgul
Hi - After pulling out PDF document from XML, then you may to convert the file to text using one of PDF2TEXT and then read the text and filter the content as you like.

Hope this will help you out in high level.

Regards,
Ozgur

Posted: Thu Jun 25, 2015 9:50 am
by ozgurgul
Alternatively,

Re: Extract an embedded pdf file from xml

1) load the xml files into db
2) extract the clob data from xml and store them in a clob
3) convert the clob data from base64 to binary and store result in blob
4) write the blob data to o/s file using UTL_FILE writing in raw mode
5) use java,phyton,pdf2txt etc to convert the pdf file to text and filter what you're seeking

https://community.oracle.com/thread/1114383

Regards,
Ozgur

Posted: Thu Jun 25, 2015 4:02 pm
by ray.wurlod
Why must you do this "using DataStage"?