Re: Splitting a text file into sentences



Ryan Leavengood wrote:
On 11/29/05, basi <basi_lio@xxxxxxxxxxx> wrote:

Yes, I learned this convention when I took a keyboarding (i.e.,
typing) lesson in high school. Sometime ago, a style manual for
word processing appeared, and one of the advice is to use only one
space to separate sentences. The reason given is that in a
justified format, those two spaces can become four spaces, or even
more. Anyway, a lot of text now has one or two spaces between
sentences, and this wouldn't be a reliable indicator of sentence
boundary.


I too learned the two space after a period convention years ago and also recently learned that with modern fonts and word processors it
is not necessary. It was tricky to retrain myself, but I did, and
have been using just one space ever since.


So like you say, that isn't a reliable way to discern sentences.

I would recommend following the advice of first filtering out false positives (possibly even replacing them with temporary markers, Mr. becomes $MISTER$ or similar), then splitting on punctuation. If you then test on various sample texts you should be able to find more false positives that you might have missed.
Which will not help you at all with foreign languages. And don't forget putting i.e., e.g. or etc. in the list.
This is an ongoing problem (think about the auto-correction 'feature' of capitalizing the first letter of every sentence in Openoffice or Word - something I always turn off because it is so insistent when it's wrong)
Cheers,
V.-
--
http://www.braveworld.net/riva


____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.


.



Relevant Pages

  • Re: Splitting a text file into sentences
    ... a style manual for word processing ... > reliable indicator of sentence boundary. ... I would recommend following the advice of first filtering out false ... false positives that you might have missed. ...
    (comp.lang.ruby)
  • Switching off word
    ... Hi folks, I need some advice if anyone would be so kind as to offer me ... the mistakes. ... switch off the word processing ability of "Word" and use it as a blank ...
    (microsoft.public.word.newusers)