Compress datasets

devidotcom · Post by **devidotcom** » Tue May 13, 2008 8:48 am

Hi All,

We create a huge dataset close to 25GB in size which is later used for processing.

We have along with these many small datasets and flat files created. The size estimated is really huge as the source file sizes are close to 100GB too.

We are looking at compressing the huge dataset which is close to 25GB using a compress stage and use in the later jobs by uncompressing the dataset using the expand stage.

I would like to know if there will be any performance lack as we may save the space but whenever we read this uncompressed dataset we may spend time uncompressing it.

Will it be a really a signficant amount of time that will be taken.

Thanks
Devi

ArndW · Post by **ArndW** » Tue May 13, 2008 9:52 am

The Compress stage uses a UNIX command to compress data in the stream. Normally you would output to a sequential file, but I think you can pass data downstream as well. The compress stage as a "pass-through" would only help if you had few but very wide columns.

You can use sequential files with compression instead of datasets to save disk space. You could also look at using unbounded strings in your dataset definitions to save space.

I've worked on systems with slow disk I/O transfer rates but with lots of space CPU capacity, and in that case we sped up performance significantly by using compressed files.

devidotcom · Post by **devidotcom** » Tue May 13, 2008 9:40 pm

Whenever we use a compress stage on the output tab wherein we need to mention the column name and the datatype. I am not sure what we need to define. Is there a way we calculate and then define some columns based on the number of columns input to the compress stage.

Thanks
Devi

devidotcom · Post by **devidotcom** » Tue May 13, 2008 11:28 pm

Thank you ArndW for your reply.

Sure will take your inputs. But our dataset has just 8 columns but the number of records is huge. Hence I may have to compress the dataset along with the sequential files.

ArndW · Post by **ArndW** » Wed May 14, 2008 2:56 am

My point is that if compression is necessary, then use sequential files instead of data sets to keep your data.

devidotcom · Post by **devidotcom** » Wed May 14, 2008 5:58 am

Why is that ArndW? Why not store the compressed file in a dataset?

ArndW · Post by **ArndW** » Wed May 14, 2008 6:02 am

A DataSet isn't a single UNIX object, it is a descriptor that points to data files and those can't be compressed automagically. So compressing the descriptor file doesn't do anything at all.

The columns could be compressed, but the gain of compressing a string of 100 is going to be much less than compressing a file with many thousands of 100 byte records (the compression algorithm does huffman-like encoding that gets better the more data it gets).