Page 1 of 1

How to Unzip files using datastage?

Posted: Mon Mar 25, 2013 9:34 am
by pkll
Hi,

I have a requirement where in need of unzip the files using Unix/through datastage execute command or before or after subroutines .

For example my file name is test.gz
I have 271,339 KB of zipped data.I have tried to unzip that data manually,after unzip the file size is 4.5GB(I have done manually).

But,i need to unzip that file by using datastage not manually and read the file by using sequential file stage(it is not supporting read more than 3GB.So, i need to split that 4.5 GB as two files(file1(2.25GB),file2(2.25GB)) Also By using datastage ....

Could you please help me ?

Posted: Mon Mar 25, 2013 10:40 am
by PaulVL
Are you on a stand-alone server or arranged in a cluster environment where you have a conductor and some compute nodes?

If you are cluster, farm off that unzip to the compute nodes, as an admin I would slap you if you did it on the conductor.

Easy way to farm that off is to open an external source stage in a job and limit it to one node.

Understand your data before you break it up into two files.

The way you manipulate the data will help you determine how your split will impact your ETL logic.

Example: If you split the file in two, and process the ETL job twice. How will your remove duplicate work? You'd have to push that to the DB side if you are interacting with one.

Sorting the data? trouble.

etc...

Posted: Mon Mar 25, 2013 11:44 am
by chulett
I'd suggest taking DataStage out of the picture for now. How would you do this in general, what approach would you take? Once you figure that out then it would be easy to implement those steps in the tool, meaning it executes and monitors them. Unless you're thinking you want a "pure DataStage" solution? Not sure that should be the path here...

Re: How to Unzip files using datastage?

Posted: Mon Mar 25, 2013 12:34 pm
by priyadarshikunal
pkll wrote:But,i need to unzip that file by using datastage not manually and read the file by using sequential file stage(it is not supporting read more than 3GB.
And what is this 3 GB limit for sequential file stage?

Also understand your data and on what basis you need to split? is the file fixed width or delimited? Will you require any sorting/aggregation or any key based operation?

After getting information for these points, leave datastage and just think on steps as what needs to be done and bother about implementation after that.

Re: How to Unzip files using datastage?

Posted: Mon Mar 25, 2013 3:41 pm
by ray.wurlod
priyadarshikunal wrote:And what is this 3 GB limit for sequential file stage?
Probably a limit on file size in 32-bit Windows operating system.

Have you (pkll) investigated the Expand stage?

Re: How to Unzip files using datastage?

Posted: Mon Mar 25, 2013 5:40 pm
by spoilt
Split the file into multiple files using UNIX command for particular word count.

For example: File1.txt, File2.txt , File3.txt etc.

Use <File pattern> in sequential file stage => File*.txt

It will read all the data from multiple files as if they are only one.

Re: How to Unzip files using datastage?

Posted: Mon Mar 25, 2013 8:18 pm
by pkll
Hi priyadarshikunal,Ray,

Yes, Ray is correct I am trying to expand data and extract that data from source.
I am using 32 bit windows operating system and datastage 8.5. I am getting file from Client in zip format like (test.gz). After unzip The file the size is 4.5GB. But, as per my requirment i don't want to unzip data manually, i need to unzip data by using datastage (like execute command activity,etc...). After i have to use that data (4.5GB) as a source.

Source sequential file is not supporting to read data above 3GB. If Source will support 4.5GB no need to split data. The problem is i am using 32-bit and source is not supporting more than 3GB. The Actual file size is TEST.GZ(271,339 KB)

Please help me how to expand and how to read that data from source?

Posted: Mon Mar 25, 2013 10:07 pm
by chulett
Look into the gunzip -c option which unzips the file to std out which the the Sequential File stage can leverage via the Filter option. If you really need to unzip to files, pipe it to the split -b command which lets you control the size in bytes of each chunk.


-b n: Split a file into pieces n bytes in size.
-b n k: Split a file into pieces n*1024 bytes in size.
-b n m: Split a file into pieces n*1048576 bytes in size.

Posted: Mon Mar 25, 2013 10:47 pm
by ray.wurlod
Expand stage.

Posted: Tue Mar 26, 2013 6:59 am
by chulett
Of course, there's always the Expand stage. :wink:

A clarification, Mr. Wurlod. The docs say "The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set." - is that a generic usage of the term "data set"? For some reason I thought it literally needed to be a compressed data set as in the DataStage object.

Posted: Tue Mar 26, 2013 1:37 pm
by ray.wurlod
Every link in a parallel job consists of a virtual Data Set.

You can see this in the generated OSH and in the score.

Posted: Mon Dec 30, 2013 12:11 am
by chandra.shekhar@tcs.com
If the file name remains the same, then you can use gunzip command in Before Subroutine. This will automatically take care of the requirement.