Re: How to make Forth interesting?



William James wrote:

Count the occurrences of distinct sequences of letters ("words") in
a text file, sorting the results primarily by the counts and
secondarily by the "words".

Ruby:

h = Hash.new(0)
IO.read("Bible--kjv10.txt").scan(/[a-z]+/i){|w| h[w] += 1}
puts h.map{|k,v| [v,k]}.sort.map{|a| a.join " "}

Standard Forth doesn't give you all the tools to do that. My natural
thought here is to set up some new wordlists whose hash function is the
first four characters. So they'll be mostly sorted. For each new word,
you look it up
ihttp://mycplus.com/source-code/java/count-the-words-in-a-file-java/n
each wordlist and if it isn't found anywhere, put it in the first one.
If it's found, remove it from that wordlist and add it to the next one.
If it's found in the last wordlist, make a new wordlist and add it to
that one.

There's a limitation on word length, maybe 31 characters or maybe 63.
There's the extra work of sorting the words past the first four
characters. And you have to know a lot about how the particular Forth
dictionary works, and use various nonstandard functions.

OK, you can put each word in one wordset and leave it there with a
count. You do a variant of WORDS once for each number of occurences up
to the last, checking each word to see whether it has the right number
of occurrences and printing its name if so, after you check whether the
next word also matches the first four characters and if so, compare to
see which goes first, recursively.

Well, you can leave the hash function alone and just give each name a
count and an entry into a linked list. After all the words are entered
you can make an array of number-of-occurrences and make a linked list of
words for each of them. The only nonstandard part is getting the name
from the body address. You could store the names again too, and make it
standard.

So, you have to make the linked lists yourself. Not nearly as compact as
the Ruby version. John Passaniti says if you have a language that's
particularly good for your problem, then use it. Forth as it's usually
provided is not particularly good for this problem. Better to use Ruby
unless it takes too long to figure out how to do make that scan function
do just what you want and test it.

But if you want to get a solution really quick, it's better to use a
library. I put

Count the occurrences of distinct sequences of letters ("words") in a
text file

into Google and got some leads.

http://mycplus.com/source-code/java/count-the-words-in-a-file-java/
claims to do pretty much what you want, but at first glance the code
appears to be missing.

http://www.hermetic.ch/wfca/wfca.php
advanced version that does much more than you want with a full GUI
interface and many bells and whistles for $42.25

http://www.faqs.org/docs/javap/source/WordCount.java
The missing code from the first reference.

http://crl.nmsu.edu/cgi-bin/Tools/CLR/clrcat#L0
WLIST does part of what you want but I didn't check whether it did it
all.

http://www.scribd.com/doc/5557479/Extremely-Fast-Text-Feature-Extraction-for-Classification-and-Indexing#document_metadata
They claim they have an efficient algorithm but probably don't actually
output what you want.

http://www.allbusiness.com/accommodation-food-services/498941-1.html
They attempt to sell an advanced version of this program to the hotel
industry to analyse customer comments.

http://www.cs.utah.edu/dept/old/texinfo/gawk/gawk_19.html
They describe how to do this in unix using tr awk and sort.

My natural thought here is to use the java version and if necessary
massage it slightly. Somebody else has already written the code nad
tested it -- why write a regexp (which is an inherently buggy and slow
operation) in Ruby when you can use the regexp somebody else has already
banged their head against in whatever language?

On the other hand, if you don't remember java well enough to modify it
easily, the awk version comes in easy-to-manipulate pieces.

Could this functionality be reproduced in Forth? Sure. But why bother?
there's nothing special about it. Why go to the trouble to reproduce in
Forth what you can easily do some other way? Wil Baden did a lot of this
sort of thing in Forth and as far as I know nobody joined him much, and
eventually he switched to doing it in a language where he didn't have to
build all his own tools for himself, where there was a community of
others doing similar things.

Well, but if you want to do part of the work in Forth and part in
another language then you need them to communicate well. In Unix it
isn't hard to do that with pipes. Marcel Hendrix has proposed a
reasonably simple set of Forth words that can do a whole lot of pipe
stuff without having to consider the difference between Windows pipes
and Unix pipes.

A number of Forths communicate well with C, but there isn't yet
agreement about a standard syntax so the code doesn't port well among
them. If you can talk to C programs you can probably talk to other
languages that can talk to C programs.

Meanwhile, it might be useful to have more string stuff in Forth. The
problem has been that working with strings tends to be complicated and
unnecessarily complicated. Lots of little detail stuff to keep track of.
Do it yourself and it's complex. Let the system do it and it's a lot of
stuff going on behind your back that breeds inefficiency of various
sorts. If the complicated string handling is something you're doing to
make your work easier, maybe it would be easier not to do it. But when
complicated string handling is the point in itself....

If there was going to be an extended string package beyond the string
words that are already there, what should be in it?
.



Relevant Pages

  • Re: Operator overloading in C
    ... All development of C as an independent language has ... making any changes or improvements to the standard ... The lack of a counted string data structure, ... Pointers can't be used for arg1 or arg2. ...
    (comp.std.c)
  • Re: Is C99 the final C? (some suggestions)
    ... because the ANSI standard obsoleted them, and everyone picked up the ANSI ... There are far more pressing problems in the language that one would like to ... But a string has variable length. ... > are multiplying two expressions of the widest type supported by your ...
    (comp.lang.c)
  • Re: Why C is really a bad programming language
    ... competent in terms of a low standard, he realized that he had to write ... his own string handlers, and he did so. ... Why on EARTH would anyone EVER use a language for applications or even ... string handlers work with EBCDIC, but hasn't told me how he would test ...
    (comp.lang.c)
  • Re: How to make Forth interesting?
    ... Standard Forth doesn't give you all the tools to do that. ... thought here is to set up some new wordlists whose hash function is the ... Get the King James Bible in a string, ... Sort the words ...
    (comp.lang.forth)
  • Re: Boost process and C
    ... typedef struct _string { ... standard library functions that use this representation, ... enthusiasm for adding it to the language. ... still in C89 and the few points that C99 brought in the sense of a ...
    (comp.lang.c)