Split - Output XML Compressed File

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Most compression utilities have options that allow the output file to be split on size.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
joycerecacho
Participant
Posts: 298
Joined: Tue Aug 26, 2008 12:17 pm

Post by joycerecacho »

Hi!
Did u mean an utility from Operation system for example?
Like in a script shell on unix?
Joyce A. Recacho
São Paulo/SP
Brazil
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

There is one available on UNIX simply called zip, also known as Info-ZIP. The one I found was bundled with a database client, not bundled with UNIX. That one has a split on size option. The command would be "zip -s50m file" to compress the file into multiple compressed files no larger than 50 MB each. It would not alter the content of the files with sequential numbers. In other words, it does all the splitting for you without all that extra effort.
Choose a job you love, and you will never have to work a day in your life. - Confucius
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I meant an option on the compression utility, such as Eric exemplified.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Will these pieces be put back together before being processed? If they'll be processed individually then you can't simply "split" them.
-craig

"You can never have too many knives" -- Logan Nine Fingers
joycerecacho
Participant
Posts: 298
Joined: Tue Aug 26, 2008 12:17 pm

Post by joycerecacho »

Imagine guys I have a really big Xml file. A huge one.
After zip it and realize that it has more then 50Mb compressed, I have to split the file in files with 50Mb each one, but compressed.
The thing is that I will return these files to the customer and he wants a sequential inside the parts, like an ID.
And the file names must have this ID.

Got it?

In short, you are suggesting me to use an Unix utility because DataStage cant help me, right?

Chulett I didnt get What u mean.

Thank u guys,
Joyce A. Recacho
São Paulo/SP
Brazil
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Not really sure why you felt the need to repeat all that nor how DataStage would have any clue what size the file will be once it has been zipped.

And all I meant was the fact that an XML file cannot simply be chopped up into pieces and still have those pieces usable as individual files. So I was asking if your customer will put the unzipped pieces back together before they process one big XML file or are they expecting to process the smaller files individually? I imagine the 'headers' you need to add indicate the former but I'd like to see that spelled out.

ps. Those headers mean you need to do far more than just simply split it.
-craig

"You can never have too many knives" -- Logan Nine Fingers
joycerecacho
Participant
Posts: 298
Joined: Tue Aug 26, 2008 12:17 pm

Post by joycerecacho »

Ah, ok.
Actually the splited files need to have a correct sintaxe, of course, respecting the open and closed tags.
I cant split the big one without care if for example the Open tag is in the first file and the rest of the register is at the second file.
I'll have to pay attention to the sintaxe and include the header at the begining and close its header tag at the end of each file.
The files will not become one again but it is supposed to process them individually.
Joyce A. Recacho
São Paulo/SP
Brazil
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

That certainly is an interesting requirement. I am not a big fan of XML. It carries a huge amount of overhead, so it is incredibly inefficient. Using it causes heavier loads on processors and networks. Many people do not realize the labor costs involved with all the extra effort in dealing with it as input and once again as output. Anyhow, enough ranting...

I am not saying that DataStage cannot be used to do it; it depends on your XML. I guess I am not understanding yet how you said the split files will be processed individually, yet the open tag can be in file1 and the rest may fall into file2. Generally if you split XML in the middle then it will make it invalid, and simply appending a closing tag mid-stream does not seem wise. The downstream consumer of file1 may start to wonder where the rest of their data went...

The other part I am not sure about is how to take a counter number and apply it into a zip file name from within DataStage. These are the reasons I had exemplified the compression tool splitting that Ray suggested. It seems like a much better solution because it would not compromise the integrity of the data.
Choose a job you love, and you will never have to work a day in your life. - Confucius
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

'Processed individually' confirms my fear - they cannot be spilt up as a post activity - they have to be built in those smaller pieces so they are viable XML files. I had to do something like this "back in the day" when producing search results for Google, they imposed a limit on the file size but not zipped or any other shenanigans. With a maximum of X we made no attempt to create files of exactly X but rather made sure that each was less than that. Neither of us cared how much less.

We calculated a maximum record count that would keep us under our limit and then enforced that with the Trigger Column functionality. You could even include that column as your "sequence number" and include the header in your design so it's all automatic. Start it at 1, bump it to 2 when the record count limit is hit, etc etc.

We also had some specific naming conventions we had to use and couldn't use them as generated so had to build something to take the names DataStage generated and then renamed them to Google's flavor as a post process. You'll need to do the zipping post and perhaps some of that as well.

Food for thought at the very least.

And totally agree with the rant, Eric. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

I feel better now :)
Choose a job you love, and you will never have to work a day in your life. - Confucius
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

8)
-craig

"You can never have too many knives" -- Logan Nine Fingers
joycerecacho
Participant
Posts: 298
Joined: Tue Aug 26, 2008 12:17 pm

Post by joycerecacho »

Chulett,

Thank you for your answer.

When u say "the Trigger Column functionality", what are you talking about?!
Joyce A. Recacho
São Paulo/SP
Brazil
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

It's a documented option in the old 'XML Output' stage, when the value changes it switches to a new filename. If you are using the new XML stage then Ernie has spelled out how that works here.
-craig

"You can never have too many knives" -- Logan Nine Fingers
joycerecacho
Participant
Posts: 298
Joined: Tue Aug 26, 2008 12:17 pm

Post by joycerecacho »

Hmmm ... actually the job design is like:


DB2 => Transformer => XMLOutputPX
.............................. => XMLOutputPX

These 2 'XMLOutputPX ' are about the same file, have the same name, but the first one generates the Header, and the other one generates the content.

This functionality you meant above is located at:
... => XMLOutputPX Stage/ Transformation Settings / Output Mode / 'Use Trigger Column' ?

If yes, which column should I choose?

Thank you so much guys,
Joyce A. Recacho
São Paulo/SP
Brazil
Post Reply