Re: Splitting huge XML Files into fixsized wellformed parts



On 17 Mrz., 13:37, Janis Papanagnou <Janis_Papanag...@xxxxxxxxxxx>
wrote:
Malapha wrote:
Hi,

I am kind of depressed :-) I want to split xml-files with sizes
greater than 2 gb into smaler chunks. As I dont want to end up with
billions of files, I want those splitted files to have configurable
sizes like 250 MB. Each file should be well formed having an exact
copy of the header (and footer as the closing of the header) from the
original file. Forthermore, a table should be generated were I can
see, that the File X is seperated into Part N with timestamp:

A nice and well described little homework with clear requirements.

I'd abstain from splitting the file according to file sizes in MB
but suggest to take a more simple measure for splitting, like number
of XML-blocks or number of lines.


I totally agree with you. Using numbers of XML block as an
approximation for filesize is well enough.
The problem I see is, using linecounts works in cases where an EOL is
implemented in the xml document. In case the input data file has no
EOL I run into problems. So I came to the solution to use the xgawk
framework in order to make use of the "node hopping" technique. This
gives me the possibility to count the Offers without having to solve
the problems mentioned above.


All in all I ended up with reading the XML processing docus with gawk,
but as it seems I am lacking some deeper programming skills..

Given your data above you can solve that all with basic awk pattern
matching capabilities, no deeper skills required. What have you tried
so far?

As I come from the VBA world - I tried to get familiar with awk. What
I do have is theoretical solution in form of a structured process
diagram :-)

Copy Header and Footer from Original to Var
Set Start_Offer = First Offer (from <Offer> to </Offer>)
Set End_Transaction = 0
Set Part = 0
Set FileSize = 0
Set MaxFileSize = 250
while not Start_Offer < EOF(OriginalXMLFile)
Part=part+1
Open NewFile OriginalXMLFileName + Part + ".xml"
Paste Header from Var to NewFile
While filesize(NewFile)<MaxFileSize do
Copy Offer (Start_Offer) from OriginalXMLDatei to NewFile
Start_Offer=Start_Offer + 1
wend
Paste Footer from Var to NewFile
wend

I am right now trying to translate this into awk.. Please dont ask me
how far i am, its frustrating :-)


Save everything in a variable until you match the /Headerelement/.
Write that header to a file whose name contains a variable as number.
Write everything until the end of the block /<\/OfferInfo>/ to the
file whose name contains a variable as number, while counting lines.
If the number of lines exceeded some constant value write the constant
trailer, and close() the file, and increase the variable that counts
the files. To create a separate table just write out the information
you already have to a file with fixed name (use awk's date functions
or if unavailable an external date program and getline).

This looks very much like my approach - so I am quite happy that I am
not that wrong...


.