How to Unzip files using datastage?

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
pkll
Participant
Posts: 73
Joined: Thu Oct 25, 2012 9:45 pm

How to Unzip files using datastage?

Post by pkll »

Hi,

I have a requirement where in need of unzip the files using Unix/through datastage execute command or before or after subroutines .

For example my file name is test.gz
I have 271,339 KB of zipped data.I have tried to unzip that data manually,after unzip the file size is 4.5GB(I have done manually).

But,i need to unzip that file by using datastage not manually and read the file by using sequential file stage(it is not supporting read more than 3GB.So, i need to split that 4.5 GB as two files(file1(2.25GB),file2(2.25GB)) Also By using datastage ....

Could you please help me ?
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Are you on a stand-alone server or arranged in a cluster environment where you have a conductor and some compute nodes?

If you are cluster, farm off that unzip to the compute nodes, as an admin I would slap you if you did it on the conductor.

Easy way to farm that off is to open an external source stage in a job and limit it to one node.

Understand your data before you break it up into two files.

The way you manipulate the data will help you determine how your split will impact your ETL logic.

Example: If you split the file in two, and process the ETL job twice. How will your remove duplicate work? You'd have to push that to the DB side if you are interacting with one.

Sorting the data? trouble.

etc...
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'd suggest taking DataStage out of the picture for now. How would you do this in general, what approach would you take? Once you figure that out then it would be easy to implement those steps in the tool, meaning it executes and monitors them. Unless you're thinking you want a "pure DataStage" solution? Not sure that should be the path here...
-craig

"You can never have too many knives" -- Logan Nine Fingers
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Re: How to Unzip files using datastage?

Post by priyadarshikunal »

pkll wrote:But,i need to unzip that file by using datastage not manually and read the file by using sequential file stage(it is not supporting read more than 3GB.
And what is this 3 GB limit for sequential file stage?

Also understand your data and on what basis you need to split? is the file fixed width or delimited? Will you require any sorting/aggregation or any key based operation?

After getting information for these points, leave datastage and just think on steps as what needs to be done and bother about implementation after that.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Re: How to Unzip files using datastage?

Post by ray.wurlod »

priyadarshikunal wrote:And what is this 3 GB limit for sequential file stage?
Probably a limit on file size in 32-bit Windows operating system.

Have you (pkll) investigated the Expand stage?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
spoilt
Participant
Posts: 7
Joined: Mon Mar 25, 2013 7:17 am

Re: How to Unzip files using datastage?

Post by spoilt »

Split the file into multiple files using UNIX command for particular word count.

For example: File1.txt, File2.txt , File3.txt etc.

Use <File pattern> in sequential file stage => File*.txt

It will read all the data from multiple files as if they are only one.
pkll
Participant
Posts: 73
Joined: Thu Oct 25, 2012 9:45 pm

Re: How to Unzip files using datastage?

Post by pkll »

Hi priyadarshikunal,Ray,

Yes, Ray is correct I am trying to expand data and extract that data from source.
I am using 32 bit windows operating system and datastage 8.5. I am getting file from Client in zip format like (test.gz). After unzip The file the size is 4.5GB. But, as per my requirment i don't want to unzip data manually, i need to unzip data by using datastage (like execute command activity,etc...). After i have to use that data (4.5GB) as a source.

Source sequential file is not supporting to read data above 3GB. If Source will support 4.5GB no need to split data. The problem is i am using 32-bit and source is not supporting more than 3GB. The Actual file size is TEST.GZ(271,339 KB)

Please help me how to expand and how to read that data from source?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Look into the gunzip -c option which unzips the file to std out which the the Sequential File stage can leverage via the Filter option. If you really need to unzip to files, pipe it to the split -b command which lets you control the size in bytes of each chunk.


-b n: Split a file into pieces n bytes in size.
-b n k: Split a file into pieces n*1024 bytes in size.
-b n m: Split a file into pieces n*1048576 bytes in size.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Expand stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Of course, there's always the Expand stage. :wink:

A clarification, Mr. Wurlod. The docs say "The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set." - is that a generic usage of the term "data set"? For some reason I thought it literally needed to be a compressed data set as in the DataStage object.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Every link in a parallel job consists of a virtual Data Set.

You can see this in the generated OSH and in the score.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

If the file name remains the same, then you can use gunzip command in Before Subroutine. This will automatically take care of the requirement.
Thanx and Regards,
ETL User
Post Reply