Re: Why R6RS is controversial
- From: Ray Dillinger <bear@xxxxxxxxx>
- Date: Tue, 29 May 2007 11:01:28 -0700
Alex Shinn wrote:
> Below, for the record, I summarize some of the more controversial
> issues people have with R6RS (based on the r5.93rs draft). I
> include both complaints I agree with and those I don't, but have
> undoubtedly missed many and misstated others, so if you have an
> issue not mentioned please reply with it (preferably in summarized,
> non-ranting form if you can restrain yourself).
Wow. This is excellent work you've done here, collecting all
this stuff in one place and explaining why it raises cause
for concern. I agree with most of these criticisms, actually.
* IDENTIFIER-SYNTAX
Identifier syntax means that the macro system can expand a single
identifier even when not the first symbol in an expression. Thus
when you see an identifier, it may not actually be a real variable
reference, which can be confusing both for humans and other macros
which want to analyze code. It is a substantial complication to
the semantics of the language, with arguable benefits.
This is one of the things that gave me misgivings, but I
wasn't able to form a cogent argument against it. It is a
powerful weapon in the "obfuscated scheme" programming
contestant's arsenal, but it's not clear to me that most
programmers will use it that badly.
* Exceptions
Many things specified simply as "errors" in R5RS (with unspecified
behavior) are now required to signal exceptions. The exceptions
themselves fit within a complex hierarchy.
A complex and highly overspecified hierarchy. I am strongly
of the opinion that a very different and much simpler method
for handling such things is better. The one expressed in the
R6RS candidate appears to have semantics mostly copied from
other languages, and does not suit most of the other programming
paradigms that Scheme otherwise supports.
* Module system
- The enforced phase separation disallows some implementation
strategies and extensions.
- The versioning system is complex and not obviously necessary.
Something like it could always be added later.
- Libraries need to be wrapped in an extra layer of parenthesis,
as opposed to a single definition at the top of the file.
Valid points all. An additional point is that the module
system becomes an additional barrier to the use of scheme
as a pedagogic language, because it's something that beginners
have to deal with before much of anything else works, and
long before it is possible to explain to them why.
* Unicode
The standard makes Scheme (non-optionally) Unicode-specific, and
defines the character data-type as Unicode scalar values. This
prevents small implementations which only want to deal with ASCII
(e.g. in embedded systems), implementations which want to support
Unicode but want a different meaning of character (e.g. grapheme
clusters), and implementations which want to support a different
character altogether. There are a number of alternative proposals
to Unicode including Mojikyo, eKanji, TRON, and UTF-2000. Scheme
has been around for a long time, Lisp even longer, and many are
hesitant to wed themselves to a single character set forever.
I think that largely covers it. I do want to point out that the
behavior of grapheme-cluster characters under most linguistic
operations is *far* more reasonable, consistent, and logical,
from the POV of actual linguistics and what a student of those
natural languages would expect, than the codepoint characters
selected by the committee. Further, I strongly feel that
behavior which is more reasonable, consistent and logical to
users of natural languages written in those characters is much
more likely to be implementable in other representations of those
characters.
The standard should specify binary I/O and primitives for
using binary I/O to build character ports, and then have unicode
I/O as a standard library - which need not be loaded for a
particular implementation or application. Unicode case operations
and other semantics should be another standard library, probably
a superset of the unicode I/O library.
* STRING-REF is recommended to be constant time
This discourages a number of implementation strategies that use
variable width character encodings or alternate string
representations such as ropes or trees. It is easy to provide a
string API that is convenient to use and efficient for both
traditional and alternative string representations.
Agree, again. Ropes with copy-on-write nodes are more efficient
as the strings grow longer. Once you're doing corpus linguistics,
there really is no alternative. This guarantees all atomic string
operations in either constant or logarithmic time with respect to
the length of the string, *and* automatically enables shared storage
for the actual character sequences when new strings are created by
minor modifications from old strings. Array strings, as implied
by this wording in the R6RS candidate, are more efficient only if
your strings are mostly under three kilobytes long.
The standard should not forbid either of these implementation
strategies; It should presume that the implementors (or the
users, if the implementor gives them a choice) know what they're
using the language for and can make a considered choice. It
should specify an API for strings, period.
* Safety
R6RS makes a (possibly too) strong claim about safety, and
introduce an exception type for implementation restrictions.
Exceptions again. Highly overspecified again.
* Square brackets
Making [] identical to (), apart from all the arguments about
which looks better, breaks the entire axiomatic spirit and
prevents alternative extensions from using []. It also introduces
a trivial stylistic distinction where none existed before, and
puts Scheme among the ranks of languages where programmers need to
agree on a style guideline (there are many variations already)
before collaborating on a project.
Agree, again. I don't like them unless they mean something.
Given my druthers, they'd mean a simple vector instead of a list
in data and a syntax call instead of a procedure call in code.
But that would be a very fundamental change indeed and I don't
know if the resulting language would really be the same language.
* CALL/CC
A commonly enough used abbreviation for
call-with-current-continuation used in talking about the operator,
and already supported by some systems, some argue the operator is
used rarely enough (and is supposed to be so) that the
abbreviation isn't needed. At any rate, no other procedure in the
language has two names.
I strongly suspect that the longer name will be disappearing
with R7RS or R8RS. Moreover, both names are now incorrect:
what the routine actually does could more accurately be
expressed by call/wc or call-with-winding-continuation.
* Comment Syntax
#; expression comments and #| ... |# block comments have been
added to the language, though are not needed.
The "need" for expression comments, as far as I'm concerned,
just points out a(nother) limitation of our macrology, ie,
that one macro call can expand only into a single expression.
What the expression comment does is expand to zero expressions.
We ought to be able to define a macro that does that, or
expands to multiple expressions, easily.
The "need" for block comments, on the other hand, is not
really addressable by the language. You don't need them
if you have an editor that understands comment prefixes,
and you do if you don't.
* Bytevector Syntax
#vu8(...) reads as a bytevector. Bytevectors themselves are not
so controversial, though people disagree on the names and any
external representation.
Actually I object to these on the grounds that they
introduce de facto static typing to scheme. I think that
type should be an annotation or assertion added to an
otherwise correct procedure rather than something which
changes or specifies semantics.
* STRING-NORMALIZE-*
Normalization is hideously complicated, and may require many
manual conversions back and forth and after any operation that may
not have preserved normalization. A huge simplification very much
worth consideration is a system that maintains all internal
strings with a single consistent normalization, but explicitly
allowing conversion to any of a number of specific normalization
forms prevents this approach.
A simpler API could just provide a single STRING-NORMALIZE
procedure, which would normalize to a preferred internal
normalization form, and in the case of an automatically
normalizing implementation would just be the identity function.
Absolutely. It hugely overcomplicates things if your internal
strings are other than "a sequence of characters," full stop.
By overspecifying this, the R6RS candidate is setting up users
and impelementors for endless hair and bugs. I had not considered
a string-normalize! procedure; my thought was simply that
normalization ought to have no semantics anywhere except in the
code implementing character I/O ports or converting strings
to/from bitvectors. Seriously: a string is just a sequence
of characters. Normalization doesn't mean anything on characters.
Normalization only means something on a particular representation
of characters, and nothing outside your I/O port code or conversion
to/from binary code ought to have to deal with the idiosyncrasies
of that particular representation. If for any reason you want to
write invalid data (a non-normalized string) to a character stream,
you are clearly not using them as "characters" - you are doing
something that would make more sense as binary I/O. Conversely,
if you read something and want the exact binary sequence, as
opposed to the seqence of characters in a normalized string,
you are clearly not reading "characters." Once again, you are
doing something that would make more sense as binary I/O.
As a separate extension it would be possible to provide utilities
to normalize to bytevectors with specific normalization forms, for
interaction with external tools.
Inside the code that implements character I/O ports and
binary-to-string and string-to-binary conversions. Never in
anything the users ought to be expected to write.
* CHAR-*CASE
Case mapping is an incompletely defined operation on characters
when they are defined as Unicode scalar values, so it is likely
that any algorithm using individual character case mappings
instead of string case mappings is broken.
But drastically less broken if you are using grapheme-clusters
as characters rather than codepoints as characters. There is only
one extant case in unicode in which case mapping does not work
as a one-to-one mapping on grapheme-cluster characters.
If the standard requires codepoint characters only, then it would
be best to remove these procedures altogether. If the standard
permits representations on which the case relationships are less
broken, it would be better to keep them.
* FILE-OPTIONS, BUFFER-MODE, EOL-STYLE, ERROR-HANDLING-MODE
These optional arguments to opening file ports are also defined as
syntax without reason. Four optional positional arguments is also
unwieldy to some. Others would rather have EOL-STYLE managed by
operations on a port (e.g. READ-LINE) rather than the port itself.
The EOL-style itself arguably shouldn't bother with support for
NEL or LS, and possibly should allow automatic detection. The
ERROR-HANDLING-MODE is complex.
What these really are, amounts to dynamic-environment variables.
If we're going to keep introducing dynamic-environment variables
then clearly what we need is a reasonable semantics for dynamic
environments. After that, all of this stuff is just libraries
and more stuff like it, if desired, can be user-implemented.
* Binary vs. Text port distinction
Some want it, some don't. The primary argument in favor of the
distinction is efficient buffering of transcoded ports. The
compromise would seem to be to make the distinction but allow
implementations to optionally allow mixing procedures on both.
The current draft makes a distinction, but does not specify what
happens when binary procedures are applied to text ports and vice
versa.
I think the standard did the right thing, here. You've got to
have text ports distinct from (or built by layering code on top
of) binary ports in order to support more than one way of
reading and writing characters. Since Unicode has three
normalization forms in two endiannesses and (at least) four
character encodings, there are at least 24 different ways to
interpet binary data as characters just in Unicode alone! If
you want an entity someone can call just to say "read a
character" it's got to be a closure over the encoding
information as well as whatever buffering is necessary.
* Pair and string mutation moved to separate libraries
SET-CAR!, SET-CDR! and STRING-SET! have been moved to separate
libraries. Pairs and strings are still mutable, so this does
nothing to change the semantics of the language or even to help
optimizations (it would require a global compiler to detect that
these modules were never imported, but at that point it's trivial
for the compiler to simply detect that these individual procedures
aren't used). It is thus simply a gesture of moving towards a
more functional Scheme. Some people disagree, others think the
gesture is silly.
I think the gesture is silly. Oh, maybe there's a rationale
in that if you want guarantees that code is purely functional
you can just forbid the use of this library (and vectors, and
several other things). But it's silly. If you want a functional
lisp, you can do that. But that's not what scheme is for.
Scheme is for "any paradigm you've got, you can use scheme to
program in it."
Bear
.
- Follow-Ups:
- Re: Why R6RS is controversial
- From: Alex Shinn
- Re: Why R6RS is controversial
- From: Pascal Costanza
- Re: Why R6RS is controversial
- From: Anton van Straaten
- Re: Why R6RS is controversial
- References:
- Why R6RS is controversial
- From: Alex Shinn
- Why R6RS is controversial
- Prev by Date: Re: Why R6RS is controversial
- Next by Date: Re: Why R6RS is controversial
- Previous by thread: Re: Why R6RS is controversial
- Next by thread: Re: Why R6RS is controversial
- Index(es):
Relevant Pages
|