Split - Output XML Compressed File

ray.wurlod · Post by **ray.wurlod** » Fri Sep 12, 2014 3:17 pm

Most compression utilities have options that allow the output file to be split on size.

joycerecacho · Post by **joycerecacho** » Fri Sep 12, 2014 6:32 pm

Hi!
Did u mean an utility from Operation system for example?
Like in a script shell on unix?

qt_ky · Post by **qt_ky** » Fri Sep 12, 2014 6:51 pm

There is one available on UNIX simply called zip, also known as Info-ZIP. The one I found was bundled with a database client, not bundled with UNIX. That one has a split on size option. The command would be "zip -s50m file" to compress the file into multiple compressed files no larger than 50 MB each. It would not alter the content of the files with sequential numbers. In other words, it does all the splitting for you without all that extra effort.

ray.wurlod · Post by **ray.wurlod** » Fri Sep 12, 2014 9:31 pm

I meant an option on the compression utility, such as Eric exemplified.

chulett · Post by **chulett** » Fri Sep 12, 2014 10:17 pm

Will these pieces be put back together before being processed? If they'll be processed individually then you can't simply "split" them.

joycerecacho · Post by **joycerecacho** » Fri Sep 12, 2014 10:47 pm

Imagine guys I have a really big Xml file. A huge one.
After zip it and realize that it has more then 50Mb compressed, I have to split the file in files with 50Mb each one, but compressed.
The thing is that I will return these files to the customer and he wants a sequential inside the parts, like an ID.
And the file names must have this ID.

Got it?

In short, you are suggesting me to use an Unix utility because DataStage cant help me, right?

Chulett I didnt get What u mean.

Thank u guys,

chulett · Post by **chulett** » Fri Sep 12, 2014 11:16 pm

Not really sure why you felt the need to repeat all that nor how DataStage would have any clue what size the file will be once it has been zipped.

And all I meant was the fact that an XML file cannot simply be chopped up into pieces and still have those pieces usable as individual files. So I was asking if your customer will put the unzipped pieces back together before they process one big XML file or are they expecting to process the smaller files individually? I imagine the 'headers' you need to add indicate the former but I'd like to see that spelled out.

ps. Those headers mean you need to do far more than just simply split it.

joycerecacho · Post by **joycerecacho** » Fri Sep 12, 2014 11:50 pm

Ah, ok.
Actually the splited files need to have a correct sintaxe, of course, respecting the open and closed tags.
I cant split the big one without care if for example the Open tag is in the first file and the rest of the register is at the second file.
I'll have to pay attention to the sintaxe and include the header at the begining and close its header tag at the end of each file.
The files will not become one again but it is supposed to process them individually.

qt_ky · Post by **qt_ky** » Sat Sep 13, 2014 6:58 am

That certainly is an interesting requirement. I am not a big fan of XML. It carries a huge amount of overhead, so it is incredibly inefficient. Using it causes heavier loads on processors and networks. Many people do not realize the labor costs involved with all the extra effort in dealing with it as input and once again as output. Anyhow, enough ranting...

I am not saying that DataStage cannot be used to do it; it depends on your XML. I guess I am not understanding yet how you said the split files will be processed individually, yet the open tag can be in file1 and the rest may fall into file2. Generally if you split XML in the middle then it will make it invalid, and simply appending a closing tag mid-stream does not seem wise. The downstream consumer of file1 may start to wonder where the rest of their data went...

The other part I am not sure about is how to take a counter number and apply it into a zip file name from within DataStage. These are the reasons I had exemplified the compression tool splitting that Ray suggested. It seems like a much better solution because it would not compromise the integrity of the data.

chulett · Post by **chulett** » Sat Sep 13, 2014 7:31 am

'Processed individually' confirms my fear - they cannot be spilt up as a post activity - they have to be built in those smaller pieces so they are viable XML files. I had to do something like this "back in the day" when producing search results for Google, they imposed a limit on the file size but not zipped or any other shenanigans. With a maximum of X we made no attempt to create files of exactly X but rather made sure that each was less than that. Neither of us cared how much less.

We calculated a maximum record count that would keep us under our limit and then enforced that with the Trigger Column functionality. You could even include that column as your "sequence number" and include the header in your design so it's all automatic. Start it at 1, bump it to 2 when the record count limit is hit, etc etc.

We also had some specific naming conventions we had to use and couldn't use them as generated so had to build something to take the names DataStage generated and then renamed them to Google's flavor as a post process. You'll need to do the zipping post and perhaps some of that as well.

Food for thought at the very least.

And totally agree with the rant, Eric.

qt_ky · Post by **qt_ky** » Sat Sep 13, 2014 6:29 pm

I feel better now

chulett · Post by **chulett** » Sun Sep 14, 2014 9:30 am

joycerecacho · Post by **joycerecacho** » Mon Sep 15, 2014 9:17 am

Chulett,

Thank you for your answer.

When u say "the Trigger Column functionality", what are you talking about?!

chulett · Post by **chulett** » Mon Sep 15, 2014 9:38 am

It's a documented option in the old 'XML Output' stage, when the value changes it switches to a new filename. If you are using the new XML stage then Ernie has spelled out how that works here.

joycerecacho · Post by **joycerecacho** » Mon Sep 15, 2014 11:02 am

Hmmm ... actually the job design is like:

DB2 => Transformer => XMLOutputPX
.............................. => XMLOutputPX

These 2 'XMLOutputPX ' are about the same file, have the same name, but the first one generates the Header, and the other one generates the content.

This functionality you meant above is located at:
... => XMLOutputPX Stage/ Transformation Settings / Output Mode / 'Use Trigger Column' ?

If yes, which column should I choose?

Thank you so much guys,