Re: Word + win32ole - how to find formatting of a word?



HI! I'm trying to use Ruby and win32ole to parse a Word document. So
far, I'm able to extract the style and text of each paragraph. That
works great to convert it into individual divs (in the HTML CSS sense).

Now, inside the paragraphs, there are certain words that have special
formatting (for e.g. the name of a command which is in monospace) - I'm
trying to find how to extract those special cases. Does anyone know how
to achieve that?


Dear Mohit,

you could save the Word file as an html and then extract the relevant information...
I did that using OpenOffice and got a file containing the font information in the following form.


<BODY LANG="en-US" DIR="LTR">
<P STYLE="margin-bottom: 0in">A command in <FONT FACE="Linux Libertine">Linux
Libertine</FONT></P>
<P STYLE="margin-bottom: 0in">A text in <FONT FACE="Bitstream Charter, serif">Bitstream
Charter</FONT></P>
</BODY>

If you read in the text of that file as a String, you can then find the relevant bits using regexps.

Best regards,

Axel

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

.



Relevant Pages

  • Re: Word + win32ole - how to find formatting of a word?
    ... I'm able to extract the style and text of each paragraph. ... That works great to convert it into individual divs (in the HTML CSS ... Ist Ihr Browser Vista-kompatibel? ...
    (comp.lang.ruby)
  • Re: Word + win32ole - how to find formatting of a word?
    ... I'm able to extract the style and text of each paragraph. ... That works great to convert it into individual divs (in the HTML CSS ... Word to return style information about the paragraph is a lot less ...
    (comp.lang.ruby)
  • Re: Word + win32ole - how to find formatting of a word?
    ... I'm able to extract the style and text of each paragraph. ... you could save the Word file as an html and then extract the relevant information... ... I did that using OpenOffice and got a file containing the font information in the following form. ... Word to return style information about the paragraph is a lot less work ...
    (comp.lang.ruby)
  • Re: Please help Ubuntu 11.04
    ... think I have the driver for broadcom4322 but cant extract it. ... put into the terminal window that might debug the error - unfortunately ... The reason the word file came out as gibberish is that Word files are ... but are in fact gibberish (if interpreted as text ...
    (Ubuntu)
  • RE: Looking Python script to compare two files
    ... extract text from PDF file or Word file. ... > There some Python scripts that can extract text ...
    (comp.lang.python)