Re: C-AUX / C and binary portability



On 1/16/2011 4:51 PM, Rod Pemberton wrote:
"BGB"<cr88192@xxxxxxxxxxx> wrote in message
news:igvlra$7sj$1@xxxxxxxxxxxxxxxxxxxx
basically, this idea is related to an idea I am currently considering
calling C-Aux, but is actually nothing terribly new in my case (ideas
have been floating around for a while, and I already have a "header-free
C" mode although its purpose and implementation details differ).

technically, this is likely to be the default mode for compiling C to
the BGBScript2 VM (implementation of this VM still ongoing, but I am at
least making *some* progress).

so, purpose:
C-Aux will be a C variant (or C-like-language) capable of producing
binary-portable code;
it will aim for common-case source-code compatibility with C, albeit
there will be some minor syntactic adjustments, and some semantics
changes.

a cost will be that it will not be strictly compatible with the ANSI and
ISO standards.

however, with care code should be able to work in both.


Yep, and you can #include a file with #define's which will convert any
syntax differences back to C. Simple. Clean. Effective. No one needs
know of the semantic differences if the C code functions correctly.


well, in this case, #include and #define will no longer behave exactly the same, since their operation will be delayed until link time.

the result is partly that one needs to be able to parse with none of the declarations (either macros or typedefs) known in advance.



so, likely changes:
trigraphs, digraphs, and K&R declarations will not be supported;

Good, that eliminates some of C's parsing issues. Those features are rarely
used. There are many others that aren't used much also. Look at the C
grammar. Cross off everything you don't use or understand... Nope, don't
leave those. Cross those off also. ;-)


I am not aiming for minimalism here, rather, reworking things so that they can be parsed unambiguously with incomplete info.

trigraphs, digraphs, and K&R declarations are, in many C compilers, currently at the level where they will cause compiler warnings (in case an unwary programmer stepped on them by accident).

my C compiler already dropped K&R declarations (mostly as I was too lazy to bother implementing a feature which hardly any code written in the past 20 years uses...), thus far to no ill effect.


Assignment operators are two char's, except assignment '='. Always using
two char's for assignement operators and other operators could simplify
parsing there. And, if you're using typedef's, the syntax needs a keyword
for use of a typedef, like struct's or union. Such a keyword will eliminate
the IDENTIFIER versus TYPEDEF NAME grammar conflict.


actually, typedef will remain as-is, however declaration parsing will be handled more like in Java and C#...

namely, the parser will see, say:
T N()
{
}

and parse it as a function, as, there is nothing else which is syntactically valid here...

T N;
can also be inferred to be a variable declaration by similar logic.

there is still the issue:
T *N;

which is ambiguous, but in this case, it will just assume that the declaration was intended.

T *N, *M;

could be a declaration, or also a multiplication, a comma operator, and a deference.

BGBScript2 (following after BGBScript, which used the same syntax here) had addressed this by using "*Type" instead:
var x:*int; //BGBScript
*int x; //BGBScript2

but, this can't be used for C (as this would break compatibility), so making the parser able to make a good guess will be needed instead.


"#include" will perform an "import" rather than a textual inclusion;

What does that mean?

"import" implies you're keeping each one separated from the other. Because
of the pre-processor, the current code affects the #include'd code and vice
versa, e.g., effects of #define or #undef or # or ## . The include must be
preprocessed within the context of the current code. Any code following the
include must be preprocessed within the context of the prior code and
include.


I meant what I said, and yes, they ARE being kept separate...


this is how standard C works, but I am actually going to make the most drastic divergence from C on this point:
the textual inclusion will not actually take place.

instead, all of this will be "remembered" and likely applied at the bytecode linkage stage (yes, including "#ifdef").

#if 0
....
#endif

may again be a special case, since it is both known false and also because this is often used as a means to bulk exclude unused or incomplete code (which is not necessarily correct), meaning it will still need to be handled as before.


the reason is that, if one does infact include headers and expand macros here, then the generated code will depend on whatever macros/constants/... are present in the headers (causing a dependency on the particular OS library).

however, if delayed until later, one can compile the code right then, and delay the handling of macros until dynamic-link-time (during application start-up), at which point the target OS is known, then the OS dependency is avoided.


granted, one does sort of have to turn C inside-out to make this work...


macro expansion will take place at link time (as a result, any used
macros need to be syntactically valid, and macros may not be allowed to
alter lexical structure);

What? Is this after you've dumped the syntax tree? How can you expand the
macro's without having the code or syntax tree? Why would you have them
available at link time? You've lost me...


during compilation, macros may be treated vaguely similarly to inline functions.

a non-destructive constant macro may be at this point handled as a constant variable (compare to Java's "static final x;" semantics). if a JIT is used, the JIT may then treat the variable as if it were a constant.

a destructive macro may then be handled similarly to a mix of an inline function and a closure (however, the exact semantics still need to worked out, since an unknown macro could interfere with the ability to safely scrub symbols).


the macro itself would be represented as an inlinable generic function and using dynamic scoping. technically, the bytecode does preserve some info related to scope-resolution, since types, macros, methods, ... all have to be resolved by the linker.


so, imagine:
#define FOO(x) (x+y)

to be more like:
static inline auto FOO(auto x) { dynamic auto y; return x+y; }


the behavior of conditional compilation (#if/#ifdef/#ifndef and #endif)
is likely to differ somewhat (it may limit itself only to
syntactically-valid forms);

Ok. If you're doing the pre-processor also, do you need to nest
pre-processor conditionals? Or, can you get by with only level? E.g.,

#if 1
/* stuff 1st section*/
#if 1
/* nested 2nd level*/
/* stuff 2nd section*/
#endif
/* other stuff */
/* stuff 3rd section*/
#endif

E.g., that can be unnested into three non-nested #if-#endif sections.


nothing prevents nested #if/#endif pairs, either for C or for C-Aux...


the main issue however, is that:
the code needs to be syntactically valid whether or not the #if/#endif is true;
it may not alter lexical program structure or basic syntax;
a complex syntactic form may not straddle such a block;
....

in this case, #if/#endif can be more regarded as an "if(...) {...}" block which is handled during linkage, and may also be used at declaration scope.

in the bytecode, declaration-scope conditionals are likely to be handled via attributes.

#ifdef FOO
void foo() { ... }
void bar() { ... }
#else
void baz() { ... }
#endif

being handled as:
[ifdef(FOO)] void foo() { ... }
[ifdef(FOO)] void bar() { ... }
[ifndef(FOO)] void baz() { ... }

within a function, likely the contents of an ifdef block will be folded off into an alternate code-block which inherits the parent scope (similar to a lexical closure), but may only be called conditionally.

void foo()
{
#ifdef FOO
...
#endif
}

being handled more like:

[ifdef(FOO)] void foo_magicgarbage()
{
...
}
void foo()
{
ifdef(FOO)foo_magicgarbage();
}


note that closures are similarly split off into separate code-blocks, where both blocks may share the same local variables, ...


arbitrary restriction:
it will likely impossible to use a 'goto' into or out of an #ifdef block if implemented this way.


many C++ keywords are likely to be reserved (C-Aux would not differ
between C and C++ mode, although full C++ support is unlikely);

Reserved, but unused... Will that help encourage C++ use? Or, is that to
reject C++ code? Or, is that to allow code C++ to pass-thru unharmed?


basically, it is so that I can freely implement some parts of C++ without having to add a different language mode, and because as I see it, any C code which uses C++ keywords as identifiers is likely erroneous anyways, even if the C compiler still accepts it.


ordering restrictions will be placed of modifiers and types (modifiers
will be required to come before the type, and only a single type is
allowed).


I think you could eliminate most of C's "modifiers". I'd eliminate all of
them. It'll simplify parsing. It'll simplify the type system by reducing
the number of types, etc. How often do you need "const" or "static" etc.?
They don't really add much to C. Do you really plan to implement "volatile"
or "register" or "restrict" or "extern"? etc.


these are needed because C code generally uses them, as do pretty much all other C family languages.

now, if my VM doesn't support "register" what do I do? I can ignore it.


the type restriction will mean, for example:
unsigned int x; //ok
int unsigned x; //invalid
int static x; //invalid


Eliminate "int". Most C code won't have the "int" anyway since it's
optional... Eliminate "static" Eliminate "unsigned". Maybe use uint8_t,
uint16_t, etc. Two space separated words describing a single type
complicates parsing, e.g., "long long" is "long" and "long" but a
"longlong"... See, "uint16_t" is a single word. It's C99 compatible also.
A simple #ifdef can insert those types on a system without them.


these are drastic changes to the language, and would break C source compatibility in a major way.

ideally, most code should not actually notice that it is using an implementation with many fundamental differences, but this does mean some level of "faking it"...


long long y; //probably ok (special case)

The space complicates parsing "long" vs. "long long"... Most code will be:

long y;
long long x;

Do you require a space between long and long?


C requires it, as it is "long long" in the C99 standard, and "longlong" would be a different token.

likewise, it is similarly needed to parse "long double", ... as well.


Does the parser resolve the two long's to a "longlong" type?


well, internally, I call it "llong" in my existing C frontend anyways, but then in the backends it gets changed around:
'long' is (often) remapped to 'int' (depending on target) and 'llong' is simply called 'long', so it is more inline with Java and C# conventions.

but, yeah, a C-Aux parser would resolve all of these special cases to single type names.


A space-delimited syntax like Forth, or mostly so, will simplify lexing and
parsing.


no, I am keeping standard C-family token and parsing behavior.

technically, since both my existing C compiler, and also my Java, C#, and BGBScript2 frontends, all share the same parser, it is really little more than adding another language mode (where most of this works by enabling and disabling language-specific parts of the parser, ...).


int long long y; //invalid
long int long y; //invalid
long long int y; //probably ok (special case)


Ditto on spaces and "int"...

special cases would be used for cases where a type-name may span
multiple tokens, in which case the whole thing will be recognized as a
unit (all other type names will be required to be a single token).

in this mode, the preprocessor is likely to generate different output
for directives.

for example:
#include<foo.h>
may be output as:
__declspec(include("foo.h"));


?

It's not C.
It's not C++.
What is that?

MSVC++ syntax extensions... So, it's just as bad as GCC's __attribute__().

Doesn't C specify #pragma for this? (yes). Couldn't you use a custom
pre-compiled macro, similar to __TIME__ or __LINE__ etc., for this? You
could also try a pre-processor function like "defined()".


well, it will appear as "#include" in the source, but the preprocessor would spit out a "__declspec()" for its output (instead of textually inlining the code and doing its usual thing...).


my parser supports "__declspec(...)", "__attribute__(...)", and also "[...]" syntax, all of which are treated as mostly equivalent.

my own compiler also generally takes up MSVC's "__declspec()" syntax, which is used for a number of internal tasks (declaration annotations, ...).

"[...]" syntax is not supported by default in C-mode, but is the default attribute syntax in BGBScript2, and is what MS used in C#.


Java uses "@attr" and "@attr()" for the same purpose...
this would look more like:
@include("foo.h");

more likely though, I will stick with "__declspec()" though, since it is more established in C land...

the main syntax alteration then would be to allow an attribute which is not otherwise a part of a declaration.


as noted, macros may not effect structure:
<--
#define FCN(ty, name, args) ty name args
#defin FBEGIN {
#define FEND }

FCN(int, main, ())
FBEGIN
...
FEND
-->

will be invalid as it uses the preprocessor in a way which alters the
lexical structure of the program.


Uhhh...

#define square(x) ((x)*(x))

That changes the lexical structure of the program... Everthing that was
"square(x)" will be "((x)*(x))". What pre-processor functionality doesn't
affect lexing? Isn't that why the pre-processor is outside the scope of the
C language?


this macro is different, since one can parse the invocation as if it were a function call, and expand the macro as stated before.

the restriction would be against macros which would do more complex operations (inserting or removing tokens which would effect the underlying structure of the code they are used on).

these will not be allowed as, effectively, they can't be delayed until link time.


rationale:
the reason for these changes is to allow compiling the source in a way
which will not depend on the contents of system headers.


Interesting, more local scope ... ? Doesn't that require some minimum
functionality to be moved into the C compiler, e.g., internal versions of
memcpy(), malloc(), free(), exit(), etc... ?


not particularly...

the linker will be responsible for a lot of this, and in many cases will just use the functions provided by the C library.


in effect, the compiler will see.
it will not be until link time that it will actually be known what types these functions return as well (during compilation, an incomplete typesystem is used...).


this is partly because even my prior C compiler was descended from my BGBScript compiler, which internally was based around using dynamic types and type-inference...

in this case, I am still making use of type-inference, but making it a good deal more explicit in the design.


at present, even portable C source-code in a portable IL would be
rendered non-portable WRT the binary output, due to the differences
between systems regarding the contents of system headers.

Why? If all binary output uses the same small set of functions, can't they
be adjusted for the binary arrangement of the host system? Data transfer
from one to the other will be incompatible, e.g., big-endian vs.
little-endian.


well, it is because there are many other things which are OS-specific and located in headers:
various flag constants (MAP_ANONYMOUS, PROT_READ, ...);
errno error numbers (ERRINV, ...);
signal numbers (SIGSEGV, SIGPIPE, SIGILL, ...);
which defines are defined;
certain C-library functions being macros, and/or mapping to different forms;
....


all of these differ from one OS to another, and code which depends on these *can't* be binary portable.

so, unless one is wanting to provide all of their own runtime libraries, .... then it is necessary to deal with all of this changing from one target to another.

even something as simple as:
#ifdef _WIN32
....
#endif
#ifdef linux
....
#endif

can't produce the correct results if "#ifdef" handling is done at compile-time, since it will see the define at the time it is being built, and not at the place where it is being used.

the only other real option would be to rebuild for every target or distribute code in source form, as is often done.


effectively, headers will be compiled separately from the source code,
and this will need to be done per-system.

Ok, apparently, that's your goal: How to isolate the libraries or code
pieces so they won't affect the compilation of the code. Step 1) eliminate
the pre-processor. That eliminates much functionality that is usually
considered by programmers to be part of the C language. Step 2) eliminate
"extern". Step 3) eliminate "static" ...


I still keep the preprocessor (in a sense), although its operation is a bit "alien", and it is "gimped" to some extent (it will mostly just be a cosmetic emulation of the C preprocessor, so sort of like the C# preprocessor, but with a little more attempt to emulate C's preprocessor...).

eliminating extern and static is not needed.


system headers would need to
be compiled in a standard-C mode, whereas application-specific headers
would be compiled in C-Aux mode.


Why do you need different behaviors or semantics?


standard C mode is needed for system headers because, well, they are already non-portable, and are far more likely to depend on having correct C preprocessor semantics (I know what sort of bizarre stuff goes on in system headers). however, pre-compilation is needed so that the VM can see them.

application headers would need C-Aux mode to not depend on system-specific stuff, since it would somewhat defeat the purpose of using a portable-mode on the source modules if not doing the same for their headers.


support for C++ features is still non-finalized


Why do you need any C++ features for a language as powerful as C? AIUI, C
can do everything C++ can. C++ adds high-level abstractions. Do you really
intend for them to work with a low-level VM?


this will be used with the BGBScript2 VM, which already provides a lot of OO-related facilities (also garbage collections, closures, ...), so it would not ask much to support them, and if one does, they almost may as well use C++ like syntax.

so, some C++ features would be used mostly as they would map against analogous BS2 features.


now, the bigger problem would be if anyone thinks they are dealing with a true C++, and then is left to ask questions like "why does my code using diamond or lattice inheritance not work?" or "why does trying to use STL or Boost make everything blow up?", leaving the only real answer being "because this isn't really C++...".


, however, likely:
only single inheritance would be allowed (in addition to interfaces);
certain abstract base classes will be assumed to be interfaces (abstract
class, all methods virtual, ...);
the use of templates may be restricted (specifics to-be-decided);
it is unclear if it would be possible to access native C++ code (this
opens up a number of awkward issues...).

or, IOW, it would be a cheap cosmetic imitation...


Eliminate the C++. That eliminates headaches...


probably a fair option as well...

.



Relevant Pages

  • Re: typical practise for #includes
    ... > I wonder are there any typical, common used practises to organize all ... > and also some macros, variables, types. ... > some of these headers might be useless in other translation units, ... Depends if your compiler supports this method or not. ...
    (comp.lang.c)
  • Re: variable scope in for loop
    ... >> plan to target a compiler that doesn't support for loop scoping. ... I think I agree with him that I'd just as soon not use macros. ... declaration for i before the first loop. ...
    (comp.lang.cpp)
  • Re: why still use C?
    ... > Because those dangerous casts are the key motivation for those macros. ... > there's no declaration of mallocin scope, ... If your C compiler fails to warn you about undeclared functions, ...
    (comp.lang.c)
  • Re: libstdc++.so.5 vs. libstdc++.so.6 and externa libraries
    ... > Here's the declaration that will compile on the linux machine, ... This declaration will not compile _by_itself_ because 'queue' has not ... C++ makes no claims WRT other headers. ... What compiler are you using on ...
    (comp.lang.cpp)
  • Re: Quick question!
    ... I will try this new "Forward declaration" stuff you just tought me ... class components; // forward declaration ... you are just telling the compiler that there is a class called components. ... A good rule to follow with headers is: ...
    (microsoft.public.vc.language)