Re: conversion of MSWord files to (la)tex.



Alan Ristow wrote:
wexfordpress wrote:
On Jul 26, 11:48 am, anon k <nos...@xxxxxxx> wrote:
wexfordpress wrote:
Currently I save MSword doc files as rtf from a word processor like
Open Office Writer and then convert them to LaTeX with rtf2latex2e.
Unfortunately this process does not preserve page dimensions. Other
programs which purport to save doc files as LaTex (e.g., Abiword,
Kword) either don't save the page dimensions at all or else save them
ineffectively. When processed through pdflatex they take the default
page size, e.g., 6 x 9 becomes 8.5 x 11.
Taking the rtf2latex2e method, is it a defect in the rtf2latex2e
program that fails to keep the page size or is it missing from the rtf
file to start with?
All replies appreciated.
John Culleton
If your Word documents are not particularly complicated, it may be
faster to write a macro to generate the LaTeX, than to search for a tool
that does just the right job. I haven't programmed in Word for a while
but am fairly sure that you could extract the page dimensions and use
them to construct parameters for \usepackage{geometry} or for
\documentclass{memoir}.

The rest is not much more than replacing the structure styles with
\chapter{}, \section{} and so on, and italics with \emph{}. A lot of
that can be done with wildcard calls to search-and-replace.

If you have to deal with tables, lists and master documents, however,
the time investment might suddenly increase.

It wouldn't surprise me if there's already a macro or two out there that
you could extend with some page-formatting code.

I take it you think that the MSWord code would be easier to decode
than RTF. Having looked at both I am not sure I agree. RTF code is
very complicated but MSWord code is expressed in non-printing binary
coding in many places.

I think he was referring to using Word's built-in Visual Basic interpreter to parse the Word document and dump it to a .tex file (or maybe not, but that's the first thing I thought of). In that case, you don't need to interpret the non-printing binary characters because Word will do it for you. Of course, it means learning Visual Basic. If you use Word's style sheets then you could write your Visual Basic code to associate certain styles with certain LaTeX commands (e.g., \section) and that sort of thing. I imagine it would be a bit messy, though, especially if your Word documents aren't structured (to the extent that Word can structure them, anyway) ...

I should have mentioned it is not "my" Word documents I am concerned
with but rather word documents from customers etc.

... and that just about guarantees lack of structure, or at least lack of *consistent* structure.

Yes, I was thinking of writing Word macros, just as I thought I'd said. No need to decode binaries; Word does that when you load the document. Visual Basic macros are generally very easy to write and decode if you approach them in word-processing terms (e.g. use the native search-and-replace method, not a string-comparison algorithm of your own). Word is able to search by style or by formatting, if I remember rightly, and to replace the found text in clever ways. You can easily have it look for things like <em>*</em> in html, for example, strip off the tags, and italicize the remnant. And you can have it do the opposite, too.

It would be very frustrating, however, to write such a macro for an amateurish document in which the headings are all manually formatted, perhaps inconsistently so. And even worse if Word's automatic formatting system has assigned to them a plethora of inappropriate structure-related styles, in accord with Microsoft's policy of helpfulness.

.



Relevant Pages

  • Re: conversion of MSWord files to (la)tex.
    ... Open Office Writer and then convert them to LaTeX with rtf2latex2e. ... Unfortunately this process does not preserve page dimensions. ... I imagine it would be a bit messy, though, especially if your Word documents aren't structured... ...
    (comp.text.tex)
  • Re: conversion of MSWord files to (la)tex.
    ... Open Office Writer and then convert them to LaTeX with rtf2latex2e. ... Unfortunately this process does not preserve page dimensions. ... than RTF. ... I should have mentioned it is not "my" Word documents I am concerned ...
    (comp.text.tex)
  • Re: conversion of MSWord files to (la)tex.
    ... Open Office Writer and then convert them to LaTeX with rtf2latex2e. ... Unfortunately this process does not preserve page dimensions. ... If your Word documents are not particularly complicated, it may be faster to write a macro to generate the LaTeX, than to search for a tool that does just the right job. ...
    (comp.text.tex)
  • Re: Why use the TeX suite?
    ... computer-specific jargon. ... Annotation is indeed a problem in LaTeX (I use the latexdiff Perl ... I have been finding that a fair few journals have crossed over into ... because they can import Word documents into some other publishing or ...
    (comp.text.tex)
  • Re: how do I copy a section of a word document including line numb
    ... I shouldn't have left out an alternative: a macro to help with the ... .InsertBreak Type:=wdSectionBreakContinuous ... entries as I am taking sections from six different word documents:o(... ... "Jay Freedman" wrote: ...
    (microsoft.public.word.docmanagement)