Re: a good TeX parser for use by software that needs to read TeX?
- From: <corff@xxxxxxxxxxxxxxxxxx>
- Date: 13 May 2006 20:35:53 GMT
Ben Crowell <"crowell06 at lightSPAMandISmatterEVIL.com"> wrote:
: Does anyone know of a good TeX parser for use by software that needs to
: read TeX? In particular, has anyone used the perl Text::TeX parser,
: http://search.cpan.org/~ilyaz/etext.1.6.3/eText/utils/Text/TeX.pm ?
Looks pretty old, I think CPAN states it is of 1997. I may be mistaken.
: I have a 1500-page set of latex documents that I want to be able to
: generate html versions of, and the latex->html converters I've been
: able to find don't seem flexible enough, so I'm planning to write my
: own code in perl. Having a good parser already written for me would
This is a recurring problem. As long as there is no other parser parsing
TeX/LaTeX than TeX with the LaTeX macro package is around, there are two
strategies:
1. the practical one, based on a number of assumptions. It may well be
that, despite the volume of 1500-odd pages, all your documents may
pertain to a given domain of limited scope. Make a domain analysis
and write some simple statistics in Perl which TeX/LaTeX commands
actually occur in your text, and how often.
Make a regex which expresses the typical TeX command, catch it
(all in Perl) by searching for if (/\\([A-Za-z]+)/g), and if found, $1
will contain a hit, say $found=$1, and make simple bookkeeping like
$ccount{$found}++. In addition, push $found into a list @commands,
so that you can say
for $command (sort @commands) {print "$command: $ccount{$command}\n"}
This is a very untested idea expressed in Perl. You now have an
idea which commands appear most often, and as long as no catcodes munge
everything up, you have a straightforward measure whether you can
write your own ad-hoc "system" or tune an existing solution.
Admittedly, this solution works best as long as you do not want to
parse math or any other 2-dimensional material, and if the main thing
you are interested in is text flow in paragraphs. For reading text in
paragraph spoonfuls, set the record separator in Perl to "", like
local $/="";
while ($para=<TEXFILE>) {
# $para contains full paragraphs of TeX
# material preserving line breaks, though.
}
2. The potentially not very usable one, and most like totally way-off
distraction: In analogy to dvips, write your own dvi "driver" which
takes TeX/LaTeX output and translates it into a combination of
text, positional mark-up and font information. The skeleton is in the
web2c sources, the dvi format is documented, and there are tools on
CTAN like undvi (spelling?) which attempt to extract the naked text
from dvi files, maybe this is a way to go?
: things. The version number is 0.01, which makes me wonder whether
: it's really just a project that was never completed successfully.
See above.
Depending on the type of material you want to process, a one-time Q&D
solution may be a useful approach.
Oliver.
--
Dr. Oliver Corff e-mail: corff@xxxxxxxxxxxxxxxxxx
.
- Follow-Ups:
- Re: a good TeX parser for use by software that needs to read TeX?
- From: Ben Crowell
- Re: a good TeX parser for use by software that needs to read TeX?
- References:
- a good TeX parser for use by software that needs to read TeX?
- From: Ben Crowell
- a good TeX parser for use by software that needs to read TeX?
- Prev by Date: Re: LaTeX, misallocated effort?
- Next by Date: Re: Help Making My Own Class
- Previous by thread: Re: a good TeX parser for use by software that needs to read TeX?
- Next by thread: Re: a good TeX parser for use by software that needs to read TeX?
- Index(es):
Relevant Pages
|