Re: A taxonomy of types




"Rod Pemberton" <do_not_have@xxxxxxxxxxxxx> wrote in message
news:gmanbg$7ad$1@xxxxxxxxxxx
"cr88192" <cr88192@xxxxxxxxxxx> wrote in message
news:gmaaub$835$1@xxxxxxxxxxxxxxxxxxxx
"Rod Pemberton" <do_not_have@xxxxxxxxxxxxx> wrote in message
news:gm773r$hr4$1@xxxxxxxxxxxxxxxxxxxxx
"James Harris" <james.harris.1@xxxxxxxxxxxxxx> wrote in message

news:3644adc7-cc3e-4b4e-9961-64a5588b6ef3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
On 30 Jan, 08:54, "Rod Pemberton" <do_not_h...@xxxxxxxxxxxxx> wrote:

Take a 16-bit integer. I need a way to state: "the next 2 bytes make
up a 16-bit signed integer," or, "the next four bytes are an IEEE 32-
bit float," for example.

I'd use unsigned integers with bits as flags... :-)


problem:
using flags and packing info into the bits of an integer does not scale
well, and very quickly it comes to a point where one has to set up a
special bit pattern that is like:
this type integer is actually an index into an array of type-definition
structures.

1) How is this any different from using utf-8 (as you were wanting to do
in
your assembler)?


UTF-8 is still text...

but, alas, the UTF-8 support is not so much for type definitions, but more
so that string literals can contain extended characters...

so, now it is possible to be like:
db "foo\u2345bar\u0021", 0

'\xHH' may also be used, but it inserts a raw byte.

dw "foo\u2345bar\u0021", 0

now will also work, but is implemented by internally converting the parsed
string from UTF-8 to UTF-16. as a natural result, using '\xHH' with dw is
not safe outside the ASCII range (0x00-0x7F), but '\XHHHH' has been added,
which serves a similar purpose but inserts raw 16-bit values...

'\uHHHH' and '\UHHHHHHHH' are used, but are intended for proper characters.

as-is, now the whole U+0000 to U+10FFFF range is supported...

note that with 'db' at present '\u0000' will come out in the "Modified
UTF-8" sequence (0xC0 0x80), but in the case of 'dw', it will come out as
0x0000.

however, elsewhere in my project (off in the dynamic typesystem, ...), I
ended up mapping the NULL sequence to a different codepoint for UTF-16
(since I am using NULL-terminated strings), and so I have used U+10FFFF
(encoded as 0xDBFF 0xDFFF).

another considered extension consists of using the PUA space as a means of
somewhat extending the code space (sort of like with surrogate pairs, but
allowing up to 64-bit values). but, alas, there is not so much need for this
at present...


but, alas, a lot of this code is a bit of a mix between ASCII, UTF-8, M-UTF,
CESU-8, UTF-16, ... and often with little discrimination made between them
(a lot of the code has to deal with whatever it is given... although I
usually distinguish between 8-bit and 16-bit encodings though...).


2) Are you saying he'll need more than 8 flags? Okay, eight types is
border-line useable. Let's say he uses the 8-bits more effectively for
255
values. Is that too few types? How many types do you expect a compiler
to
handle? I.e., according to the Gold Parser, an LALR(1) grammar for C only
has about 200 rules. Are you saying a language, like C, should or could
have more types than rules needed to implement the entire language?


well, I will just say, it gets more than a little "fun" trying to cram more
than non-trivial types into a 32-bit integer...

remember, C also has things like arrays, funtion pointers, nestable
pointers, ... and all of these cases would need to be handled. one ends up
using lots of bits just to say what is in the other bits...

a string-based representation handles all of this much easier and much
cleaner (I am saying this since I have used both approaches in my
experience...). not like we need a full tokenizing syntax or support for
whitespace or anything (these would eat up performance).


more so, when it comes to a C compiler, the parser is the EASY part...


it also tends to end up requiring some centralized piece of code for
managing the typesystem, ... (basically, the bit twiddling patterns
quickly become way more complicated than what one would likely
want the logic for being spread all over the app...).

Well, you're good at C, post some what you think a "centralized" piece of
code for managing the "typesystem" would look like...


well, from past experience, mostly it consists of a whole bunch of predicate
functions:
int RIL_TypeSmallIntP(int type);
int RIL_TypeQuatP(int type);
....

as well as some arrays of structs for holding info about types (usually
reflecting the "overflow" case, for when the type to be stored goes beyond
what can be packed into the integer).


but, the main issue, is that with packed integers of this sort, if one does
not keep the code for all this in a central location (such as all in the
same library), then one quickly ends up with a mess...

likewise, trivial-seeming changes are prone to require non-trivial
alterations to the representation, ...


or, if you really want some example code:
<--
RIL_Block *RIL_LookupStructType(RIL_Context *ctx, int ty)
{
RIL_Block *tmp;

if(ty<0)return(NULL);

tmp=ctx->top->typ[ty&RIL_T_EXT_MASK];
if(tmp)return(tmp);

tmp=RIL_LookupStruct(ctx, ctx->top->tyn[ty&RIL_T_EXT_MASK]);
ctx->top->typ[ty&RIL_T_EXT_MASK]=tmp;

return(tmp);
}

void RIL_SizeTypeCount(RIL_Context *ctx, int ty, int asz, int *algn, int
*rsz)
{
RIL_Block *st;
int i;

if(ty<0)
{
*algn=0;
*rsz=0;
return;
}

if(asz<=0)asz=RIL_GetTypeItemCount(ctx, ty);

//by default, assume pointer
*algn=ctx->szptr;
*rsz=asz*ctx->szptr;

if(!asz)
{
// RIL_Error("RIL_SizeType: asz==0\n");
// asz=1;
// *(int *)-1=-1;
}

if(ty&RIL_T_PTR_MASK)
{
if(RIL_TypeWPtrP(ctx, ty))
{
*algn=16;
*rsz=asz*16;
return;
}

*algn=ctx->szptr;
*rsz=asz*ctx->szptr;
return;
}

if(ty&RIL_T_TY_EXT)
{
*algn=1; *rsz=0;

st=RIL_LookupStruct(ctx, ctx->top->tyn[ty&RIL_T_EXT_MASK]);
if(st)
{
if(st->rty>=0)
{
//errm: function type...
*algn=-1;
*rsz=-1;
return;
}

*algn=st->algn;
i=st->sz;
if(i%st->algn)i+=st->algn-(i%st->algn);

*rsz=asz*i; //by default, assume int/pointer
}else
{
printf("RIL_SizeType: Absent Struct/Union Def "
"'%s'(%X)\n",
ctx->top->tyn[ty&RIL_T_EXT_MASK], ty);
*(int *)-1=-1;
}
return;
}

if(ty&RIL_T_LIT_MASK)
{
return;
}

switch(ty&RIL_T_TY_MASK)
{
case RIL_T_INT:
case RIL_T_UINT:
case RIL_T_FLOAT:
case RIL_T_FIMAG:
case RIL_T_XI32:
case RIL_T_UXI32:
*algn=4;
*rsz=asz*4;
break;
case RIL_T_LONG:
case RIL_T_ULONG:
case RIL_T_DOUBLE:
case RIL_T_M64:
case RIL_T_VEC2:
case RIL_T_FCOMPLEX:
case RIL_T_DIMAG:
case RIL_T_XI64:
case RIL_T_UXI64:
*algn=8;
*rsz=asz*8;
break;

case RIL_T_POINTER:
case RIL_T_VARIANT:
*algn=ctx->szptr;
*rsz=asz*ctx->szptr;
break;

case RIL_T_FLOAT80:
*algn=4;
*rsz=asz*12;
break;

case RIL_T_CHAR:
case RIL_T_BYTE:
*algn=1;
*rsz=asz*1;
break;
case RIL_T_SHORT:
case RIL_T_WORD:
case RIL_T_FLOAT16:
*algn=2;
*rsz=asz*2;
break;

case RIL_T_INT128:
case RIL_T_UINT128:
case RIL_T_FLOAT128:
case RIL_T_M128:
case RIL_T_VEC3:
case RIL_T_VEC4:
case RIL_T_QUAT:
case RIL_T_M2F:
case RIL_T_DCOMPLEX:
*algn=16;
*rsz=asz*16;
break;

case RIL_T_M3F:
*algn=16;
*rsz=asz*48;
break;
case RIL_T_M4F:
*algn=16;
*rsz=asz*64;
break;

default: break;
}
return;
}

void RIL_SizeType(RIL_Context *ctx, int ty, int *algn, int *rsz)
{
RIL_SizeTypeCount(ctx, ty, 0, algn, rsz);
}

int RIL_GetTypeItemCount(RIL_Context *ctx, int ty)
{
RIL_TypeBody *ovf;
int al, sz, ty1;
int i, j, k;

if(ty<0)return(0);

if(RIL_TypeArrP(ctx, ty))
{
i=(ty&RIL_T_ARR_MASK)>>RIL_T_ARR_SHL;
k=(ty&RIL_T_IDX_MASK)>>RIL_T_IDX_SHL;

j=-1;
switch(i)
{
case 0: j=1; break;
case 1: j=k; break;
case 2: j=1; break;
default:
ovf=ctx->top->tovf[k];
j=1; k=ovf->adn-(i-2);
for(; k<ovf->adn; k++)
j*=ovf->adsz[k];
break;
}
return(j);
}

return(1);
}

-->


I'd think types in a type system might also well suited to a DFA... Isn't
it? A Moore type FSM's state is represented by a few integers...

But, if I needed eight or fewer types, I'd use some define's and if's
where
needed. What could be simpler?

#define TYPE1 0x01
#define TYPE2 0x02
#define TYPE3 0x04
#define TYPE4 0x08
#define TYPE5 0x10
#define TYPE6 0x20
#define TYPE7 0x40
#define TYPE8 0x80

if(type&TYPE1) /* zero always fails typecheck */
{
do_type1();
}
if(type&TYPE2)
{
do_type1();
}


If I needed 255 or fewer:

#define TYPE1 0x01
#define TYPE2 0x02
#define TYPE3 0x03
#define TYPE4 0x04
// ...
#define TYPE255 0xff

switch(type)
{
case TYPE1:
{
do_type1();
break;
}
case TYPE2:
{
do_type2();
break;
}
// ...
default: /* zero always fails typecheck, no TYPE0 */
break;
}


do you expect this to really scale?...


more so, what stops you from being like:
switch(*s++)
{
case 'a':
...
case 'b':
...
...
case 'C':
switch(*s++)
{
case 'f':
...
...
}
}

this is what usually happens in the string case...



IMO, some kind of "signature" representation is much more general
purpose,
be it either a string (like in the JVM or my framework), or a kind of
packed value array (more like what .NET uses).

Space vs. speed vs. complexity... What is required?


in the non-trivial case, strings are about the most dense representation
(typically far more dense than structs). for the trivial case,
strings-merging will usually keep the number of strings small, and there is
no real added overhead for having multiple pointers to the same string.

for the nested-switch case, the overhead of processing strings is
performance competitive with processing bit-packed integers...

strings also keep the complexity lower as well...


now, what of predicates?...
well, these are still fairly easy, as again they typically amount to a
nested switch...


and, how do I know all this:
I have used both approaches in practice...


however, in my recommendation, maybe one should refrain from exactly
representing the C typesystem.

Exactly. All types in C map onto "C bytes". I.e., you don't need types.
You only need to be able to 1) represent "C bytes", usually for modern
computing platforms as 8-bit byte in assembler, and 2) record or store the
size of the type in terms of "C bytes".


it is not quite so simple in practice...

in order to be actually useful for much, a good majority of the C
typesystems' semantics (and, very likely, a good deal more) need to be
representable...


instead, one should represent the physical types, rather than the C-level
semantics types.

You don't even have to do that. The size of (i.e., sizeof()...) the type
should be sufficient to represent the type, possibly with some exceptions
say float or complex.


things don't work this way...

a type is not so useful if it can't tell your integer from your float from
your pointer...
almost may as well just give a raw chunk of memory and a size...


for example, my signature strings were defined roughly in terms of the C
typesystem, and this creates an ugly problem: are 'l' and 'm' (signed and
unsigned long) 4 bytes or 8 bytes?

I think C compilers work best when sizes are multiples of 2 of each
other...
That is "unsigned char" is 8-bits, "unsigned short" is 16-bits, "unsigned
long" is 32-bits, "unsigned long long" is 64-bits. When you don't
represent
the common and intermediate sizes, it seems to me you create code
implementation problems.


meaning here?...


are pointers 4 or 8 bytes? ...

The C standard requires that pointers be lossless, but IIRC, doesn't
explicitly define much else... Obsolete computing platforms had some
strange pointers. The sizes of C types are "standardized" for 64-bits,
with
a few choices (LP64, ILP64, LLP64):
http://www.unix.org/whitepapers/64bit.html

(Of course, these may be affected by the SysV ABI's calling convention.)

IMO, pointers should be the best format for functional addressing by the
assembly code. e.g., for x86 32-bit, 32-bit since offsets are 32-bits.
For
x86 64-bit, 32-bits since the instruction set apparently only supports
64-bit offsets on a few instructions... If you use 32-bit pointers in
64-bit mode, this of course, will limit the address range of the a
specific
C context (e.g., object or application...) to 4Gb or require extra code to
adjust for 64-bit offsets. OTH, if you use 64-bit pointers in 64-bit
mode,
your address range isn't limited, but your code has to do extra work to
convert the 64-bit pointers for use with instructions that use 32-bit
pointers. I'm not sure you can win here. This is much like the issue of
implementing 16-bit values in 32-bit mode. There, you have a choice of
address and operand overrides to 16-bits from 32-bits (OpenWatcom ?...) or
logically and-ing 32-bit values to create 16-bit values (like GCC or GAS
does.)


it, however, varries between x86 and x86-64, and code designed to work the
same on both needs to deal with this variance...


for example:
a/h: signed/unsigned 8-bit byte;

For C, don't you mean "8-bit char"?...


yes, but byte is more specific here...
for example, Java and .NET define char as 16-bits, and my system also deals
with these systems, and so byte makes it more clear that I mean 8-bits...

s/t: signed/unsigned 16-bit short;
i/j: signed/unsigned 32-bit int;
l/m: signed/unsigned 64-bit long;
n/o: signed/unsigned 128-bit int;

Hmm, I guess it's time for me to go back to the C spec., since I thought
for
sure that "int" had to be a "char", "short", "long", or "long long", not a
distinct type by itself as you've done... (Am I wrong?)


int and long may have the same size (at least on x86, or in MSVC on x64, but
not in Linux on x86-64, PPC64, ...), but in any case typically int is
regarded as an independent type...


as for my notation:
it was originally derived more or less from the SysV ABI name-mangling
scheme...
other syntactic elements come from the Java VM and others...



Rod Pemberton







.



Relevant Pages

  • RE: passing string array to C++
    ... Does this apply to all marshalling of passing arrays of pointers? ... other challenges in this area such as passing an array of pointers to class ... > private static extern int MyAdapt(int gbl, stringstrings, int count ...
    (microsoft.public.dotnet.framework.interop)
  • Re: Problems with a program
    ... pointers and strings very well. ... The following is working source code: ... int main ...
    (microsoft.public.dotnet.academic)
  • Re: String Comparision
    ... code which would compare two strings using pointers. ... int comp ...
    (comp.lang.c)
  • Re: meaning of a dialet or implementation of a programming language
    ... It has to do with strings ... >> being pointers. ... >> $ cat test.c ... >> int main{ ...
    (comp.lang.lisp)
  • Re: array of pointers
    ... All the three pointers point to literal strings, ... in memory that doesn't belong to you and that could be in read- ... If you want to change the memory the pointers are pointing ... > int size = strlen; ...
    (comp.lang.c)