Hi All,
We create a huge dataset close to 25GB in size which is later used for processing.
We have along with these many small datasets and flat files created. The size estimated is really huge as the source file sizes are close to 100GB too.
We are looking at compressing the huge dataset which is close to 25GB using a compress stage and use in the later jobs by uncompressing the dataset using the expand stage.
I would like to know if there will be any performance lack as we may save the space but whenever we read this uncompressed dataset we may spend time uncompressing it.
Will it be a really a signficant amount of time that will be taken.
Thanks
Devi
Compress datasets
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 247
- Joined: Thu Apr 27, 2006 6:38 am
- Location: Hyderabad
The Compress stage uses a UNIX command to compress data in the stream. Normally you would output to a sequential file, but I think you can pass data downstream as well. The compress stage as a "pass-through" would only help if you had few but very wide columns.
You can use sequential files with compression instead of datasets to save disk space. You could also look at using unbounded strings in your dataset definitions to save space.
I've worked on systems with slow disk I/O transfer rates but with lots of space CPU capacity, and in that case we sped up performance significantly by using compressed files.
You can use sequential files with compression instead of datasets to save disk space. You could also look at using unbounded strings in your dataset definitions to save space.
I've worked on systems with slow disk I/O transfer rates but with lots of space CPU capacity, and in that case we sped up performance significantly by using compressed files.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 247
- Joined: Thu Apr 27, 2006 6:38 am
- Location: Hyderabad
-
- Participant
- Posts: 247
- Joined: Thu Apr 27, 2006 6:38 am
- Location: Hyderabad
My point is that if compression is necessary, then use sequential files instead of data sets to keep your data.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 247
- Joined: Thu Apr 27, 2006 6:38 am
- Location: Hyderabad
A DataSet isn't a single UNIX object, it is a descriptor that points to data files and those can't be compressed automagically. So compressing the descriptor file doesn't do anything at all.
The columns could be compressed, but the gain of compressing a string of 100 is going to be much less than compressing a file with many thousands of 100 byte records (the compression algorithm does huffman-like encoding that gets better the more data it gets).
The columns could be compressed, but the gain of compressing a string of 100 is going to be much less than compressing a file with many thousands of 100 byte records (the compression algorithm does huffman-like encoding that gets better the more data it gets).
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>