how to concadinate multiple xml file into one file

knowledge · Post by **knowledge** » Fri Jan 25, 2008 3:23 pm

Hi all,

I have job ,

folder stage ------> input xml stage ---------> seq file stage .

I have one xml file for every patient report , job is working file , I am creating many seq file depending on the key from above job to populate cross walk tables ,

but it is very very slow , its taking 4 hrs 15 mins to process 1700 xml records ,
I have to start my production load from next week , have to load around 2 mil records , which will take weeks to load ,dont have enough time .

Is there any way I can combine all xml files or atleast few thousands files and create one big file , that way I think job will run faster compared to the time its taking now ,

can anyone pl suggest how to combine multiple files into single xml file ,

I think it is taking lot more time to initialize every file ,process that file and then go to next file , if it is single file it would take much less time ,

Please suggest,

Thanks .

eostic · Post by **eostic** » Sat Jan 26, 2008 10:33 pm

hmm.. Tough call. I'd hate to see you try to merge the files together before trying out a lot of other things... for starters, XML processing takes a long time --- it's a good bet that time is being eaten there and not in the folder reading the files from disk. Further, if you have a lot of files, and manage to combine them via other physical means (it' not that simple becaues you'd have to remove headers and things), you might end up running into document size issues.

How many columns on your XMLInput Stage output link? are you using ALL of them downstream? If you are dropping them later or ignoring them in a Transformer later in the job, then remove them here. You only need the ones that you need later. Too many and you'll be doing a ton of extra unnecessary work.

Also, you might want to experiment with parallelizing this flow. This is NOT to be taken lightly, but a job like this may benefit from interprocess row buffering, either forced via the IPC stages, or by using the setting in the performance tab of Job Properties. We haven't talked about your job, and there are a LOT of caveats here (ordering of your rows and documents, the machine you are on, etc.), but it's something to consider. You have a lot of flexibility not only in separating the XMLInput Stage into its own process, but even further by experimenting with manual splitting of the flow into more than one XMLInput Stage. I can't emphasize enough how you have to be careful here, and I'm sure it's discussed in many places in the forum, because the IPC stages and settings can impact things in the job that you didn't expect, but you need to gain some experience with it and thoroughly test whatever you come up with.

Ernie

knowledge · Post by **knowledge** » Mon Jan 28, 2008 10:15 am

Hi ernie,

thanks a lot for reply,

I am dropping all columns and processing only columns which are required in particular file.

I mean ,

folder stage ---> input xml stage --------> 40 seq files stages attached to one input xml stage , at the output of xml input stage i m processing only required columns for particular file (for ex, for file one I have only key columns and 2 or 3 reuired columns for that file ) I have 40 files attached to one xml input stage.

As you suggested ,
I will try to modulazise this job and only design with 5 or 10 seq file stage in one job ,

folder stage -------> input xml ---------> 10 seq files stages (in one job)

So I will duplicate four more job , just to see how fast it will run ,

I did not understand other option you have suggested ,

Pl tell me , the way I am planning to do will save time ?

Thanks in advance .

eostic · Post by **eostic** » Mon Jan 28, 2008 4:37 pm

...let me be sure I understand... I was under the impression that you have a large number of XML document instances in a subdirectory, and are reading those in a job that is "folder --- XMLInput ---- more Stages"....... and I made the assumption (apologies if it was incorrect) that these were all of the exact same schema.... with a single link coming from the XMLInput Stage. In that case, if you have, say 600 columns in the XMLInput Stage output link, and only need 20 of them, then you are spending a lot of wasted energy in the plugin..... better to delete them there than dropping them later via Transformer.

..and of course, my other concerns were for the I/O of the 40 vs better ways of processing all the rows. This would become especially true if say, you had 10,000 xml documents sitting in the subdirectory with a total of several million overall rows...

If what you are saying is that you have 40 separate "links", then this is different altogether. You are processing these links, in Server, entirely serially. Do they "need" be processed serially? (in order)? Are the "40" documents predictable? Do they have the same higher level elements but differ only in lower nodes? or are they entirely separate structures? Are they unique, or some identical in structure and just multiple instances of one another?

Some of the same techiniques proposed above may apply, but we need to know more about the circumstances.

Ernie

knowledge · Post by **knowledge** » Mon Jan 28, 2008 8:20 pm

Hi Ernie,

U understood right , but let me explain ,

I have suppose 100 xml files in sub directory(for each patient i have separate xml file) , each file has multiple instances of data(loading into cross walks) , for ex names of the crew member attended on patient .In this case I made crew member id as a key in putput link of xml input stage so it processes multiple instances of crew member from one xml file and collects into sequencial file stage .

folder stage -->input xml stage -->crew member sequencial file .

like crew member sequencial file, I am creating many files at the same time for different cross walk tables , and at the output link of xml stage I have only columns which i need ,

for ex in crew member seq file ,I have key and memebr id and name ,and not all columns from the xml file .

i am not wasting resources in processing all columns but then too it is very slow

pl suggest

eostic · Post by **eostic** » Mon Jan 28, 2008 8:52 pm

ok... then reviewing my first response above, consider parallelizing the flow. Do some reading on, and then playing with, the IPC stages. How large is your machine? Using IPC and the interprocess row buffering functions will let you run things concurrently.....like running multiple jobs, but with more flexibility.

Using multiple jobs is fine too, and not a bad way to determine if this is the right direction to move in .....

you should also get familiar with, if you aren't already, multi-instancing. Of course, using multiple jobs or multiple instances of the same job will require that you look into different ways of managing the collection of source documents, but will also help illustrate whether processing the xml documents concurrently is going to help.

Ernie

knowledge · Post by **knowledge** » Mon Jan 28, 2008 9:13 pm

thanks Ernie,
I will look into it ,
I have to look into multi instances of the same job.

if I create all(40) sequencial files at one place and then create many instances of the same job, running all instances at the same time and appending into those files , will it do .?

I will take care of source as I have patient report number as xml file name , so I can run for ex 11*.xml collecting all files starts with 11, then same for 12 etc , runnng these job parallelly .

thats the only solu i have now , i will look into ipc stage too .......

thanks a lot for ur help ,