Re: RegExp split for Spell Check



Dr J R Stockton said the following on 11/27/2007 5:06 PM:
In comp.lang.javascript message <9M2dnRdB7ulq7dbaRVn_vwA@xxxxxxxxxxxx>,
Mon, 26 Nov 2007 20:26:36, Randy Webb <HikksNotAtHome@xxxxxxx> posted:
Dr J R Stockton said the following on 11/26/2007 11:27 AM:

I have wanted a customized personal dictionary of my own for a while
now. The biggest problem I have had was trying to find a text file that
had a word list in it that I could trust.

Try Google for a combination of two or three entirely unrelated unusual
words, and you'll start finding possible lists. Taghairm Octothorpe
seems a bit too obscure a pair; but maybe you don't do taghairm in the
USA (it's not in my Websters).

Not technically speaking, but we do have a practice that is close. We call them Political Elections.

But the following code generates an Object containing about 100,000
arbitrarily-named properties, and looks up one of them. The lookup
^^^^^^^^^^^
That bit applies to an earlier version of the code :-(.
appears to take no time at all, meaning under about 15 ms. That's good
enough (caveat - building the dictionary is slower!). P4/3G, XP.
Dic = {}
for (J=0 ; J<1e5; J++) Dic[String(Math.sin(J))] = J
T = new Date()
J=1e4 ; while (J--) X = Dic[String(Math.sin(98765-3*J))]
Y = [new Date()-T, X]
Y becomes [172, 98765] or [156, 98765]; each lookup takes about 16
microseconds in a dictionary of 100,000- or is the code not testing it
correctly?
It isn't doing a comparison. It creates the dictionary then it sets the
var X to the value of a possible entry 1000 times. It is looking up
1,000 entries which is 999 more than it has to look up.

The code is designed to do lookups for timing, without bothering with
the trivial matter of reporting success in an appropriate form.

The "looks up one of them" is what lead me to believe that your testing was flawed because that isn't what it did.

I'd not thought it necessary to explain that doing 1000 different
lookups was in order to take a measurable total time. X is the value of
the last lookup, as a check. BTW, changing the 1e4 did change the time
proportionately (as was confidently expected) and changing the 1e5
changed it much less (as was less confidently expected).

As you know, the 1e5 change would alter the creation time. The lookup time is pretty standard and testing with a 215,000 entry dictionary bears that out for me.

In the timed part, words that should be present are found. For those
who don't make too many errors, the time for a failed lookup is less
important. New test : insert +0.5 after 3*J . Virtually all lookups
now fail. Time taken is unchanged.

That is what I would expect. The only way I would expect a time change would be if the dictionary got so large that it was consuming a large portion of the available RAM to the point it was slowing the entire PC down. Or, if you were multi-tasking heavily and slowing the entire PC down.

The flaw in the test is what made me realize how to do a dictionary
and make it simple and fast. Instead of setting the Dic entry to J, set
it to 1. Then, to find out if a word exists in the dictionary or not
you simply test for it:

if(Dic['word here'])

I merely found it more convincing for the lookup to find the position.
Your code fragment only does the lookup, and does it in the same way as
my code. One can set the entries to true, and return either true or
undefined.

All that is needed for spell checking is success or failure, without knowing where it is in the dictionary listing.

If the entry is there, it will return 1, convert it to true. If the
entry doesn't exist, then it returns undefined and converts it to
false. Let the browser do the lookup.

That suggests that neither dictionary size nor lookup time need be a
major problem, if the dictionary is not vast.
To correct myself, and admit I was thinking about it wrong, I don't
think the lookup is a problem. A 214,000 word dictionary is roughly 4.5
mbs so a 25,000 word dictionary should, guessing, be around 500kb or
so. Not bad on a broadband connection but murder on a dialup
connection.

A dictionary should compress automatically over modern dial-up, if in
alphabetical order; and one can write algorithms to compress this
special case better.

500kb is still a huge file on dialup, even with compression.

For example, if the first N letters of a word are the same as the first N of the previous word (including the 26 instances
of N=0), replace them by N encoded in base-36. So I think 500kB, if you
mean that, is an over-estimate.

I guessed at the 500kb based on 215,000 entries being 4.5Mb. Creating a test file with 25,000 entries in it where each entry is 6 characters long - to create an "average" word length - the file was 439Kb so I wasn't far off. Of course, the actual size would depend on the 25,000 words you used.

It's still a lot for an arbitrary Web page; but not unreasonable even on dial-up if fetched on knowing demand and cached.

How bad the download time is on something is directly related to how bad the user wants it :)

Any idea where to find a reliable 25,000 word list?

If you are prepared to consider the spell-checker in a word processor
reliable, then just grab large quantities of plain text off the Net
(Project Gutenberg should have largely correct spellings, as should the
reports of your legislature), sort, deduplicate, and edit in the word
processor. If you take only lower-case words, you'll miss most proper
names.

I ordered an electronic Scrabble dictionary two days ago, should have it in a week or so. Now, it is more of a mental exercise than anything else. The major problem with word processor spell checkers is in trying to share the word list. That is what prompted me to order the Scrabble dictionary and be done with it.

--
Randy
Chance Favors The Prepared Mind
comp.lang.javascript FAQ - http://jibbering.com/faq/index.html
Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/
.



Relevant Pages

  • Re: [BUG] Something goes wrong with timer statistics.
    ... into the list to avoid a race with the fastpath lookup. ... can still point into the entries array. ... initialization of a new entry is finished upon insertion of that entry. ...
    (Linux-Kernel)
  • Re: [BUG] Something goes wrong with timer statistics.
    ... into the list to avoid a race with the fastpath lookup. ... can still point into the entries array. ... initialization of a new entry is finished upon insertion of that entry. ...
    (Linux-Kernel)
  • On-disk indexing for "Project Ideas" page
    ... The kernel usually does the lookup by starting at the beginning ... of the directory and going through, comparing each entry in turn. ... name cache described in Section 6.6. ...
    (freebsd-current)
  • Re: mailc entry in /etc/hosts
    ... I have following entries in a customer /etc/hosts file: ... mailc aaa.com hostname ... DW entry in /etc/sendmail.cf references the hostname given in the ... Lookup of the address for 'mailc' will always give 10.8.197.6. ...
    (comp.mail.sendmail)
  • Re: is vlookup with an inverted start point possible?
    ... Valko" wrote: ... The way that LOOKUP works is if the lookup_value is greater than all the ... Microsoft Excel MVP ... most recent entry. ...
    (microsoft.public.excel.worksheet.functions)