Page 1 of 1

Importing PDF's into SQL Database

Posted: Thu Feb 12, 2004 7:05 am
by ewartpm
:? Hi guys

Is it possible, using DataStage, to store a PDF document in a SQL Server database. If it is possible, what stage would I use to do this?

Many Thanks.

Posted: Thu Feb 12, 2004 8:35 am
by kcbland
You DO NOT want to EVER pass blobs, clobs, whatever thru an ETL tool. I hope I'm clear enough on this matter.

You should be passing tokens such as a filename in its place. This means that you don't have to juggle megabytes of data and pass it around. Think about the underlying mechanics of what's going on. You cannot use bulk loaders, you cannot use sequential (text files remember) files. Every byte of data has to pass thru the ETL pipeline. You do not have, and IMHO, should have, the ability to clobs/blobs in the tool.

You're going to have to deal with manually moving this type of data into the database. I would suggest preserving the token (filename) as a distinct column, and then after-load updating the data to move clobs/blobs into the database table.

Posted: Thu Feb 12, 2004 3:20 pm
by ray.wurlod
I second Ken's remarks. :idea:

If you must do it, pre-encode it with something that converts it to a text-compatible encoding, use Sequential File stages, don't attempt to transform it, and decode the resulting text file prior to bulk loading into SQL Server. Yuk.

Posted: Thu Feb 12, 2004 6:34 pm
by vmcburney
SQL Server has the ntext field which can accept text with up to 2 ^ 30 - 1 characters, which is just over 1 billion for those of you who, like me, can't compute those numbers in our heads. This should be sufficient if you want to encode your PDF files into a text stream. Something you can do quite easily in .Net. It's a pretty common technique when you want to stick a binary file, such as a PDF, into an XML object.

You have also got the BLOB data type but again you need something external to the ETL tool to read and write the BLOB. Have a look at this Microsoft support article on using .Net to do this:
http://support.microsoft.com/?id=326502

You can also leave your PDFs on the filesystem and put a link to them in the SQL Server database. This is the easiest approach but you've got issues with security and access.

Posted: Fri Feb 13, 2004 6:45 am
by ewartpm
Thanks guys, for the comments, it's much appreciated. I think I will go with the 'link' option and then take it from there.