Re: Sets and portability (was) Re: Is ISO Pascal compatible with J&W (original) Pascal ?



On 2005-06-30, Jason Burgon <gvision@xxxxxxxxxxxx> wrote:
>> > what Delphi (and FPC, and GPC?) had to do with strings. The principle
>> > is just the same.
>>
>> The difference is that strings become large only if the user
>> explcitly puts long data in them which doesn't normally happen
>> accidentally, whereas, say, a set of all Unicode letters implcitly
>> requires more space (probably in any representation as it's rather
>> irregular) than in an 7/8 bit charset.

I agree with Scott that set of <16-bit> is maybe still doable, however
32-bit of course must be sparse.

> (1) The vast majority of code in any complex program is library code (be it
> your own or someone else's), and that needs to be as flexible as practical.
> So library code would be better if it could handle huge sets.

Agree. So say the "new code" perspective.

> (2) The computer world is more complex than it's ever been (eg Unicode)
> and will just get more so. So why make life even more difficult for Pascal
> programmers by obsoleting their (eg: character) set library code? Again
> the Delphi WideString type is a good example of providing familiar
> mechanisms for dealing as seamlessly as is possible with the added
> complexity of the 21st century.

Clean code will indeed remain working, since they operate on the basis char
tricks. A reference counted system like ansistring ensures some performance
with not that optimal existing code.

Widestring has as problem that it is different between the Windows and Linux
editions of Delphi. In one it is a COM bstr, in kylix more like an
ansistring (but then 16-bit). So I wouldn't use it as an example, unless you
mean the Kylix version.

> (3) Like your ~average~ string, a clever huge set implementation (like mine
> ;-) of an ~average~ huge set is likely to be quite sparse or have large
> areas of contiguous members, and wouldn't therefore use up huge amounts of
> memory.

True. And refcounting (copy on write) would ensure that original code that
is read-only, but passes somehow by value will still work not to shabby.

> (4) Sure, a typical set of Unicode chars (say all uppercase characters) will
> likely use more memory than a set of 7/8bit char (but in my case, not that
> much more).

Unicode is 32-bit, though only a (magnitude) 100000 codepoints are assigned.

While the 16-bit (first 40-48k) chars are most used, I wouldn't rule out the
codepoints of the 32-bit space with a new set/string design.

Such problems have been discussed in FPC circles before, and I think it will
eventually come to _three_ stringtypes UTF8/UTF16/UTF32 with autoconversions
between them and directives to alias one to default identifiers (now:
widestring, in the future maybe also string).

{$stringtype short/ansi/utf8/utf16/utf32/comstr}

Note that ansi->wide conversion is codepage sensitive. I haven't reached a
conclusion if this must be set runtime (from now on, assume all ansi->wide
conversions are cp857 or some windows convention) or compiletime (directive,
compiler links in correct conversion code or table).

The good part of doing this runtime you can make your program's user specify
what encoding he uses for all plain text. The bad part is bloat with a few
tens of kbs (even 100s) of conversion tables _IF_ they cannot be gotten from
the OS or shared libs.

Microsoft had valid reasons at the time to go for 16-bit, but that doesn't
mean we should repeat that.

> (5) Compilers would still be free to represent small sets in a
> speed-efficient (eg: linear bitmap) way. A really clever one would even
> allow the programmer to decide for speed vs size.

Yes, there are 4 types:

1. registers
2. static sets
3. ref counted, dynamically allocated sets.
4. dynamically allocated sparse sets, possibly ref counted.

The order 1 -> 4 is also roughly how you would change the type if the
amount of elements get higher.

One could specify the transitions from 2->3 and from 3->4 on the cmdline,
e.g. to mimic behaviour of a legacy pascal compiler. Changing this probably
requires RTL recompilation though.

Conversions are not necessary, since only set of x; and set of y with x<>y
are not compatible anyway.

> (6) If my code is typical, then the number of [character] set instances is
> at least 2-3 orders of magnitude less than the number of string instances
> and other variables I have. IOW, the (huge) sets I do have might be larger,
> but still only constitute a tiny fraction of the total memory requirement of
> my programs.

True. And you could always recode the worst library routines.
.



Relevant Pages

  • Re: IBM2435I on ROUND(x,-3)
    ... At run time the string itself may contain the character representation of any valid coded ... The key point is that at compile time, all that the assumption of FIXED DECIMALis used for is to determine the base, scale, mode, and precision that the value of the string will be converted to at run time. ... For each operation in an arithmetic expression, whether it be a prefix operation, an infix operation, or a builtin or user defined function, the compiler needs to know the base, scale, mode, and precision of each operand. ... Only in the case where a FIXED DECIMALvalue would have required no conversion is the string actually converted to those attributes. ...
    (comp.lang.pl1)
  • Re: Conversion operators and ambiguity
    ... > If I were to include in the definition of a string class the following ... > I can't use the indexing operator because the compiler doesn't know ... to convert it to a const char *, but being explicit about the conversion can ... help prevent accidental bugs. ...
    (comp.lang.cpp)
  • Why doesnt the compiler use ToString for implicit conversions?
    ... expression where a String is expected. ... MyClass in MyNamespace and don't override ToString, ... The compiler will complain that myObject cannot be cast to String. ... conversion on realizing that myObject should be converted to String? ...
    (microsoft.public.dotnet.languages.vb)
  • Re: EXPRESSION ATTRIBUTES
    ... appeared in place of the string. ... In the case of an assignment to a numeric variable, the compiler ... In the case of an expression, as an operand of an operator or function ... The attributes are entirely defined by the rules for conversion ...
    (comp.lang.pl1)
  • Re: Letter to US Sen. Byron Dorgan re unpaid overtime
    ... Big-O notation isn't mathematics per se, it's computer science notation ... "length of the string". ... outrun something compiled and optimized by a good C compiler. ... > either general computing culture or culture outside computing. ...
    (comp.programming)