Re: Three broken compressors




Matt Mahoney wrote:
cr88192 wrote:
"Jim Leonard" <MobyGamer@xxxxxxxxx> wrote in message
news:1148053977.407041.279760@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
cr88192 wrote:
why is the contents of wikipedia a single huge xml file rather than a
huge
number of small files?...

It's a database, actually, a very large distributed database across
hundreds of servers. It's not one single XML file -- that's just the
EXPORT you get if you ask for the data.

ok, this makes sense...

thought:
a database can be easily enough exported to xml. a problem would be a
database itself based on generalized xml. I would suspect that this would
need at least some manner of simplistic schema system, or embedding special
attributes in the tags, or similar...

dunno, misc really...

it is a thought, I have done binary xml formats before, but have not used
them that heavily since then. the realization eventually became that xml
can't effectively represent some things. it is easier to represent xml in
other data, than other data in xml (even if xml can be used for nearly all
the data in a format, what little can not maps over very poorly).

luckily, now, I have better solutions (eg, xml as chunks within a more
generalized container format).

or such...

Wikipedia actually makes very little use of XML. It is mainly used for
titles, timestamps, and user IDs on articles. The article is just one
long string in a <text> tag. All of the structure such as headings,
links, tables, lists, etc. use special characters embedded in the text.

I am not sure if using a xml file for text compression comparison is a
good idea.

What about XML aware compressors? They can probably beat generic
compressors in your benchmark. Would you include XML parsing
compressors (like xmlppm) in your benchmark?

Sachin Garg [India]
http://www.sachingarg.com

.



Relevant Pages

  • Re: XML parser and writer
    ... them on a calendar. ... Therefore I will need to both easily parse and write new XML files. ... why not some database technology? ... an advanced user can edit the XML file directly at ...
    (comp.lang.java.programmer)
  • Re: Special Characters not resolving
    ... starting data at the origin in an ORacle database is 2000 characters. ... When the XML isdelivered to me on disk and I load an ... Obviously I need to find either a way to have the XML file provider strip ...
    (microsoft.public.dotnet.xml)
  • Re: dynamic class instantiation
    ... I think mixing XML, ... LDF to python parser is a LOT ... This xml file is then fed into the database at the ...
    (comp.lang.python)
  • Re: XML and the Datagrid
    ... I find it, when it is in memory, exactly as good as the database, because it ... I wished that the complexity from an XML file was in a real database system. ... It is in my opinion the best thing to describe and hold data until now and I ...
    (microsoft.public.dotnet.languages.vb)
  • Re: DISCOVER_XML_METADATA
    ... expansion of ASSL XML returned by the server. ... You could do ExpandObject for the server in step 1. ... you could request ExpandObject for that database -- this is step ... all cubes and nothing else. ...
    (microsoft.public.sqlserver.olap)