Compress datasets

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
devidotcom
Participant
Posts: 247
Joined: Thu Apr 27, 2006 6:38 am
Location: Hyderabad

Compress datasets

Post by devidotcom »

Hi All,

We create a huge dataset close to 25GB in size which is later used for processing.

We have along with these many small datasets and flat files created. The size estimated is really huge as the source file sizes are close to 100GB too.

We are looking at compressing the huge dataset which is close to 25GB using a compress stage and use in the later jobs by uncompressing the dataset using the expand stage.

I would like to know if there will be any performance lack as we may save the space but whenever we read this uncompressed dataset we may spend time uncompressing it.

Will it be a really a signficant amount of time that will be taken.

Thanks
Devi
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The Compress stage uses a UNIX command to compress data in the stream. Normally you would output to a sequential file, but I think you can pass data downstream as well. The compress stage as a "pass-through" would only help if you had few but very wide columns.

You can use sequential files with compression instead of datasets to save disk space. You could also look at using unbounded strings in your dataset definitions to save space.

I've worked on systems with slow disk I/O transfer rates but with lots of space CPU capacity, and in that case we sped up performance significantly by using compressed files.
devidotcom
Participant
Posts: 247
Joined: Thu Apr 27, 2006 6:38 am
Location: Hyderabad

Post by devidotcom »

Whenever we use a compress stage on the output tab wherein we need to mention the column name and the datatype. I am not sure what we need to define. Is there a way we calculate and then define some columns based on the number of columns input to the compress stage.

Thanks
Devi
devidotcom
Participant
Posts: 247
Joined: Thu Apr 27, 2006 6:38 am
Location: Hyderabad

Post by devidotcom »

Thank you ArndW for your reply.

Sure will take your inputs. But our dataset has just 8 columns but the number of records is huge. Hence I may have to compress the dataset along with the sequential files.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

My point is that if compression is necessary, then use sequential files instead of data sets to keep your data.
devidotcom
Participant
Posts: 247
Joined: Thu Apr 27, 2006 6:38 am
Location: Hyderabad

Post by devidotcom »

Why is that ArndW? Why not store the compressed file in a dataset?
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

A DataSet isn't a single UNIX object, it is a descriptor that points to data files and those can't be compressed automagically. So compressing the descriptor file doesn't do anything at all.

The columns could be compressed, but the gain of compressing a string of 100 is going to be much less than compressing a file with many thousands of 100 byte records (the compression algorithm does huffman-like encoding that gets better the more data it gets).
Post Reply