Re: Sets and portability (was) Re: Is ISO Pascal compatible with J&W (original) Pascal ?



Marco van de Voort <marcov@xxxxxxxx> wrote:
> On 2005-06-30, Jason Burgon <gvision@xxxxxxxxxxxx> wrote:
>>> > what Delphi (and FPC, and GPC?) had to do with strings. The principle
>>> > is just the same.
>>>
>>> The difference is that strings become large only if the user
>>> explcitly puts long data in them which doesn't normally happen
>>> accidentally, whereas, say, a set of all Unicode letters implcitly
>>> requires more space (probably in any representation as it's rather
>>> irregular) than in an 7/8 bit charset.
>>
>> (1) The vast majority of code in any complex program is library code (be it
>> your own or someone else's), and that needs to be as flexible as practical.
>> So library code would be better if it could handle huge sets.

I didn't say it wasn't needed -- quite the opposite actually. For
strings, the user can control the length by the data they process;
with sets the size explosion happens automatically when switching
charsets, even when processing the same data.

>> (2) The computer world is more complex than it's ever been (eg Unicode)
>> and will just get more so. So why make life even more difficult for Pascal
>> programmers by obsoleting their (eg: character) set library code? Again
>> the Delphi WideString type is a good example of providing familiar
>> mechanisms for dealing as seamlessly as is possible with the added
>> complexity of the 21st century.
>
> Clean code will indeed remain working, since they operate on the basis char
> tricks. A reference counted system like ansistring ensures some performance
> with not that optimal existing code.
>
> Widestring has as problem that it is different between the Windows and Linux
> editions of Delphi. In one it is a COM bstr, in kylix more like an
> ansistring (but then 16-bit). So I wouldn't use it as an example, unless you
> mean the Kylix version.

I'm not too familiar with those Borland/Windows particulars. Anyway,
longer strings and "wider" (16 or 32 bit) chars have never been a real
problem. The Pascal `Char' type can be this size (unlike C, it isn't
required to be 1 byte). Both standard Pascal fixed-strings and
Extended Pascal strings can be as large as the integer range. (Only
the UCSD/BP short strings were problematic, being limited to 255
chars.)

>> (3) Like your ~average~ string, a clever huge set implementation (like mine
>> ;-) of an ~average~ huge set is likely to be quite sparse or have large
>> areas of contiguous members, and wouldn't therefore use up huge amounts of
>> memory.

Not necessarily. AFAIK, Unicode letters already are rather
fragmented. Of course, and that's the good thing, a typical program
probably won't use many different sets of such kind (letters,
upper/lower case, digits, punctuation, etc.). This might save the
day.

> True. And refcounting (copy on write) would ensure that original code that
> is read-only, but passes somehow by value will still work not to shabby.

Probably. Of course, most code shouldn't even need this as it
probably won't pass such sets by value. So even a dumb implementation
might work to some extent with 16 bit charsets.

>> (4) Sure, a typical set of Unicode chars (say all uppercase characters) will
>> likely use more memory than a set of 7/8bit char (but in my case, not that
>> much more).
>
> Unicode is 32-bit, though only a (magnitude) 100000 codepoints are assigned.

AFAIK, strictly speaking Unicode is 16 bit, and UCS is 32 bit, but
that's nitpicking. According to Wikipedia, UCS has over 1.1 million
"code points" already. But even this would be livable (135 KB per
set) today. A UCS `Char' could be a 32 bit type, but with a suitable
range (instead of the full 4 billion), so even a dumb `set of Char'
could just barely work in practice.

> Note that ansi->wide conversion is codepage sensitive. I haven't reached a
> conclusion if this must be set runtime (from now on, assume all ansi->wide
> conversions are cp857 or some windows convention) or compiletime (directive,
> compiler links in correct conversion code or table).

If this means roughly the same in Windows that means iso-8859-1 AKA
latin1 etc. elsewhere, I think it should be runtime.

> The good part of doing this runtime you can make your program's user specify
> what encoding he uses for all plain text. The bad part is bloat with a few
> tens of kbs (even 100s) of conversion tables _IF_ they cannot be gotten from
> the OS or shared libs.

Yes. At least on modern Unix systems, both the tables and readily
available conversion functions exists. You might not have to
reinvent the wheel.

> Yes, there are 4 types:
>
> 1. registers
> 2. static sets
> 3. ref counted, dynamically allocated sets.
> 4. dynamically allocated sparse sets, possibly ref counted.
>
> The order 1 -> 4 is also roughly how you would change the type if the
> amount of elements get higher.
>
> One could specify the transitions from 2->3 and from 3->4 on the cmdline,
> e.g. to mimic behaviour of a legacy pascal compiler.

I don't understand this point. I think if several set models are
provided, then automatic conversion between all of them should be
done wherever necessary. This may indeed be the hardest part.
(That's independent of charsets, of course.)

> Conversions are not necessary, since only set of x; and set of y with x<>y
> are not compatible anyway.

They are compatible if x and y are compatible. So you need the
conversions unless for sets of subranges you choose the
representation applicable to the base type; i.e., set of 1..10 would
need the same representation as set of Integer then, which is just
what you usually want to avoid by providing several
representations. So you probably will need the conversions.

Frank

--
Frank Heckenbach, frank@xxxxxxxx, http://fjf.gnu.de/
GnuPG and PGP keys: http://fjf.gnu.de/plan (7977168E)
Pascal code, BP CRT bugfix: http://fjf.gnu.de/programs.html
Free GNU Pascal Compiler: http://www.gnu-pascal.de/
.



Relevant Pages

  • Re: Dangerous behavior of CString
    ... If I'm reading a data file or serial port or something, if the raw data are multibyte but the compilation is Unicode or vice-versa, then sometimes the converting constructors in CString are convenient. ... I did not actually write code like this; in fact I was pretty careful always to use the _T macro with any literal strings. ... But it does the conversion using the current 8-bit code page, which is not what I want. ...
    (microsoft.public.vc.mfc)
  • Re: How to LPCTSTR Convert to char *
    ... number of people who use 'char' because they've never grown beyond their first programming ... These are the people who are getting nuked by VS2005 which defaults to Unicode apps. ... isolated to the embedded interface (rare and exotic situation imposed by external ... fields with char strings is quite essential. ...
    (microsoft.public.vc.mfc)
  • Re: Unicode strings vs. traditional C strings
    ... Compiler does what you'd expect it to. ... internally with a char *, ... It's really only the Win32 API that is primarily UNICODE. ... T or F - A function such as strchrfor ANSI strings does not exist but I ...
    (microsoft.public.windowsce.embedded.vc)
  • Re: Want Input boxes to accept unicode strings on Standard Window
    ... strings with _T ... pattern) but these blow up immediately. ... as a "massive effort" or, in one case, "we need a complete rewrite in Unicode and can't ... the process a couple of times the conversion thing is pretty academic. ...
    (microsoft.public.vc.mfc)
  • Dangerous behavior of CString
    ... On initial compilation under Unicode, there were several hundred errors, and it took me a couple of days to get rid of them. ... I then started to test my app with strings from different languages, and was surprised to find that in some places the strings were displayed correctly, but in others they were not. ... Thus the implicit conversion constructor prevents the compiler form telling me that my code is not as I intended. ...
    (microsoft.public.vc.mfc)