Re: a good TeX parser for use by software that needs to read TeX?



Ben Crowell <"crowell06 at lightSPAMandISmatterEVIL.com"> wrote:
: Does anyone know of a good TeX parser for use by software that needs to
: read TeX? In particular, has anyone used the perl Text::TeX parser,
: http://search.cpan.org/~ilyaz/etext.1.6.3/eText/utils/Text/TeX.pm ?

Looks pretty old, I think CPAN states it is of 1997. I may be mistaken.

: I have a 1500-page set of latex documents that I want to be able to
: generate html versions of, and the latex->html converters I've been
: able to find don't seem flexible enough, so I'm planning to write my
: own code in perl. Having a good parser already written for me would

This is a recurring problem. As long as there is no other parser parsing
TeX/LaTeX than TeX with the LaTeX macro package is around, there are two
strategies:

1. the practical one, based on a number of assumptions. It may well be
that, despite the volume of 1500-odd pages, all your documents may
pertain to a given domain of limited scope. Make a domain analysis
and write some simple statistics in Perl which TeX/LaTeX commands
actually occur in your text, and how often.

Make a regex which expresses the typical TeX command, catch it
(all in Perl) by searching for if (/\\([A-Za-z]+)/g), and if found, $1
will contain a hit, say $found=$1, and make simple bookkeeping like
$ccount{$found}++. In addition, push $found into a list @commands,
so that you can say

for $command (sort @commands) {print "$command: $ccount{$command}\n"}

This is a very untested idea expressed in Perl. You now have an
idea which commands appear most often, and as long as no catcodes munge
everything up, you have a straightforward measure whether you can
write your own ad-hoc "system" or tune an existing solution.

Admittedly, this solution works best as long as you do not want to
parse math or any other 2-dimensional material, and if the main thing
you are interested in is text flow in paragraphs. For reading text in
paragraph spoonfuls, set the record separator in Perl to "", like

local $/="";
while ($para=<TEXFILE>) {
# $para contains full paragraphs of TeX
# material preserving line breaks, though.
}

2. The potentially not very usable one, and most like totally way-off
distraction: In analogy to dvips, write your own dvi "driver" which
takes TeX/LaTeX output and translates it into a combination of
text, positional mark-up and font information. The skeleton is in the
web2c sources, the dvi format is documented, and there are tools on
CTAN like undvi (spelling?) which attempt to extract the naked text
from dvi files, maybe this is a way to go?

: things. The version number is 0.01, which makes me wonder whether
: it's really just a project that was never completed successfully.

See above.

Depending on the type of material you want to process, a one-time Q&D
solution may be a useful approach.

Oliver.
--
Dr. Oliver Corff e-mail: corff@xxxxxxxxxxxxxxxxxx
.



Relevant Pages

  • Re: a good TeX parser for use by software that needs to read TeX?
    ... In particular, has anyone used the perl Text::TeX parser, ... I have a 1500-page set of latex documents that I want to be able to ... The claim has been made that the only good parser for TeX is TeX itself. ...
    (comp.text.tex)
  • a good TeX parser for use by software that needs to read TeX?
    ... Does anyone know of a good TeX parser for use by software that needs to ... In particular, has anyone used the perl Text::TeX parser, ... This is a new paragraph. ...
    (comp.text.tex)
  • Re: Sane Syntax
    ... the transition to documentclass in latex2e was a great one. ... you can still write old perl ... please, give us all a sane markup language, please, and convince your ... TeX hacker age may well be 50. ...
    (comp.text.tex)
  • Re: Perl and TeX
    ... and in fact I don't see any advantage Perl would offer ... >>not map comfortably to anything Perl would offer. ... parsing rules of TeX really don't look relevant here. ... So what would Perl6 offer for TeX and its ilk that would be less ...
    (comp.lang.perl.misc)
  • Re: syntax extension, was Why context-free?
    ... because TeX doesn't have a precise description (merely a guide on ... Perl has its sweet spot domains (as a child of sed, ... with every branch tested both for correctness of the ... But there are plenty of wonderful languages ...
    (comp.compilers)