OT: Requesting C advice

Mon Jun 4 00:49:53 UTC 2007

Sorry, readers, this is getting rather long and still pretty OT.  Les, if 
you want to continue, perhaps we should take it offline?

Comments insterspersed throughout.

On Sun, 3 Jun 2007, Les wrote:

> On Fri, 2007-06-01 at 19:02 -0400, Matthew Saltzman wrote:
>> On Fri, 1 Jun 2007, Les wrote:
>>
>>> On Fri, 2007-06-01 at 07:36 -0400, Matthew Saltzman wrote:
>>>>
>>>>> I know why their programs failed.  I also know that C uses a pushdown
>>>>                                                       ^some particular
>>>>                                                        implementations of
>>>>> stack for variables in subroutines.  You can check it out with a very
>>>>> simple program using pointers:
>>>>>
>>>>>    #include <sttlib.h>
>>>>>
>>>>>    int i,j,k;
>>>>>
>>>>>    main()
>>>>>    {
>>>>>        int mi,mj,mk;
>>>>>        int *x;
>>>>>        mi=4;mj=5;mk=6;
>>>>>        x=&mk;
>>>>>        printf ("%d  %d  %d\n",*x++,*X++;*X++);
>>>>>        x=&i;
>>>>>        printf ("%d  %d  %d\n",*x++,*x++,*x++);
>>>>>        i-1;j=2;k=3;
>>>>>        printf ("%d  %d  %d\n",*x++,*x++,*x++);
>>>>>  )
>>>>>
>>>>> Just an exercise you understand.  compile and run this with several c
>>>>> packages, or if the package you choose supports it, have it compile K&R.
>>>>> and try it.
>>>>
>>>> Of course, several constructs here are undefined, so there is no such
>>>> thing as "correct" or "incorrect" behavior.
>>>>
>>>> After correcting obvious typos and adding #include <stdio.h> so it would
>>>> compile, I got (using gcc-4.1.1-51.fc6 with no options):
>>>>
>>>>      $ ./a.out
>>>>      5  4  6
>>>>      0  0  0
>>>>      0  0  0
>>>
>>> OOPS, forgot to reset the X pointer between the last two print
>>> statements.  This bit of code is intended to show that globals are on a
>>> heap and locals are on a stack.
>>
>> Fixed that.  Now I get:
>>
>> $ ./a.out
>> 5  4  6
>> 0  0  0
>> 0  2  1
>>
>> But I confess, I don't see how this code proves your point.  It does
>> demonstrate that globals are initialized by default, though.
>>
> Actually, it doesn't.  And this is the problem.  Many people assume that

Note I said "demonstrate", not "prove".  For a math teacher, there's an 
important distinction 8^).

> because they obtained 0 one time, that the value was set in memory by
> some behind the scenes action of the compiler.  In fact the memory could
> have been set by any of a number of actions.  Some memory chips start
> with all data zero'ed (at the output, at the physical layer the
> construction is designed to minimize current drain and transitions, but
> that is another topic entirely.)  In that case, if power had been off
> all memory not explicitly set would be zero by default.  Another
> situation is when a memory checker runs, and leaves memory in a zero
> state (most do by design).  Thus if the compiler doesn't initialize
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> memory, and the memory where the code is placed has not been used in a
   ^^^^^^

But this is the key: In the absence of an explicit initializer, an 
ISO-compliant compiler *must* generate code to properly initialize static 
memory (not automatic or dynamic memory) just as if the default 
initializer had been provided explicitly.

Proper initialization means that floats and doubles must be initialized to 
0.0 and pointers must be initialized to the null pointer value, even if 
those bit patterns differ from all-bits-zero.  (calloc() must initialize 
its memory to all-bits-zero.)

If you don't believe me, how about the Usenet News comp.lang.c FAQ?
See http://c-faq.com/decl/index.html for a general discussion of 
allocation and initialization, but pay particular attention to 
http://c-faq.com/decl/initval.html:
-----------------------------------
comp.lang.c FAQ list  Question 1.30

Q: What am I allowed to assume about the initial values of variables and 
arrays which are not explicitly initialized?
If global variables start out as ``zero'', is that good enough for null 
pointers and floating-point zeroes?

A: Uninitialized variables with static duration (that is, those declared 
outside of functions, and those declared with the storage class static), 
are guaranteed to start out as zero, just as if the programmer had typed 
``= 0'' or ``= {0}''. Therefore, such variables are implicitly initialized 
to the null pointer (of the correct type; see also section 5) if they are 
pointers, and to 0.0 if they are floating-point. [1]

Variables with automatic duration (i.e. local variables without the static 
storage class) start out containing garbage, unless they are explicitly 
initialized. (Nothing useful can be predicted about the garbage.) If they 
do have initializers, they are initialized each time the function is 
called (or, for variables local to inner blocks, each time the block is 
entered at the top[2] ).

These rules do apply to arrays and structures (termed aggregates); arrays 
and structures are considered ``variables'' as far as initialization is 
concerned. When an automatic array or structure has a partial initializer, 
the remainder is initialized to 0, just as for statics. [3] See 
also question 1.31.

Finally, dynamically-allocated memory obtained with malloc and realloc is 
likely to contain garbage, and must be initialized by the calling program, 
as appropriate. Memory obtained with calloc is all-bits-0, but this is not 
necessarily useful for pointer or floating-point values (see question 
7.31, and section 5).

References: K&R1 Sec. 4.9 pp. 82-4
K&R2 Sec. 4.9 pp. 85-86
ISO Sec. 6.5.7, Sec. 7.10.3.1, Sec. 7.10.5.3
H&S Sec. 4.2.8 pp. 72-3, Sec. 4.6 pp. 92-3, Sec. 4.6.2 pp. 94-5, Sec. 
4.6.3 p. 96, Sec. 16.1 p. 386

[1] This requirement means that compilers and linkers on machines which 
use nonzero internal representations for null pointers or floating-point 
zeroes cannot necessarily make use of uninitialized, 0-filled memory, but 
must emit explicit initializers for these values (rather as if the 
programmer had).

[2] Initializers are not effective if you jump into the middle of a block, 
either with a goto or a switch. Initializers are therefore never effective 
on variables declared in the main block of a switch statement.

[3] Early printings of K&R2 incorrectly stated that partially-initialized 
automatic aggregates were filled out with garbage. 
-----------------------------------

> prior run, the variable space will be zero.  But if the program is
> deleted, and the memory filled with a nonzero pattern, and the code
> reloaded and compiled, the result may be much different, and can cause
> the program to crash.  When the program is saved to disk as an
> executable, the memory pattern that is saved is the last state of the
> code, whatever that was, and depending on how the code development
> system saves the code, the variables may or may not be set to zero at
> save time.  At load time, the memory will be initialized according to
> the data in the executable file.
>
>    So, while the compiler may initialize the variables, there are other
> issues that can impact the true state at run time, and therefore default
> state should not be relied on as the condition.

Yes, I know all this.  I've been programming since the 1960s and writing C 
since the 1980s.

>                                                  After all, you create a
> variable to store information, don't you?  Why would you not iinitialize
> it?

As I said, even if you are guaranteed that initialization will take place, 
it can't hurt and might help readability to do it explicitly anyway.

>      Anyway, while this has been a good discussion, I hope that you have
> begun to realize that all is not just in the compiler, but in the
> implementation, in the memory of the system, and in the methods of
> implementing and running code.

Sure.  But in this case, the compiler's guarantee trumps all that.

>                                 And by the way, Matthew, this is in no
> way critizing you.  I have heard of you before, and will probably hear
> great things from you in the future.
>
>    Good luck, and good fortune.

And the same to you, sir.

>>>
>>>>
>>>> Was that what you were expecting?

But you still haven't answered this question, nor explained how your code 
demonstrates the difference between "the heap" and "the stack".

>>>>
>>>>
>>>>>
>>>>> I cannot vouch for every compiler, only Microsoft, Sun, and Instant C
>>>>> off the top of my head.  I have used a few other packages as well.  But
>>>>> any really good programmer NEVER relies on system initialization.  It is
>>>>> destined to fail you at bad times.
>>>>
>>>> How much effort are you willing to expend to defend against potentially
>>>> buggy compilers (as opposed to undefined or implementation-defined
>>>> behaviors)?  The Intel fdiv bug would seem to prove that you should NEVER
>>>> rely on arithmetic instructions to provide the correct answer.  There's an
>>>> economic tradeoff between protecting yourself from all conceivable errors
>>>> and actually getting work done.
>>>>
>>>
>>> There is a difference between implementation differences and hardware
>>> errors, which was the microsoft error.  They had
>>> a bug in their silicon compiler that caused that IIRC.
>>
> I misspoke here, and said Microsoft, when I meant Intel.
>> I could just as easily reference some other obscure compiler bug or
>> implementation-defined behavior and make the same point.  The thing about
>> a standard is that there are clear requirements about what is
>> implementation-defined and what is not.  Static initialization in ISO C is
>> not one of those implementation-defined things.
>>
>> I will concede that explicit initializations--even to default
>> values--might be a useful self-documentation tool.
>>
>>>
>>>>>                                     One case is as has been pointed out
>>>>> here, that NULL is sometimes 0, sometimes 0x80000000, and sometimes
>>>>> 0xffffffff.  Even NULL as a char may be 0xFF 0xFF00 or 0x8000 depending
>>>>> on the implementation.  But strings always end in a character NULL or
>>>>> 0x00 for 8 bit ascii, if you use GNU, Microsoft, or Sun C compilers.
>>>>> They may do otherwise on some others.  It can byte (;-) you if you are
>>>>> not careful.
>>>>
>>>> In your source code, NULL is *always* written 0 (or sometimes (void *) 0
>>>> to indicate that it's intented to stand for a null pointer value, not a
>>>> NUL character value).  The string terminator character is *always* written
>>>> '\0'.  The machine's representation of that value is immaterial.  If you
>>>> type-pun to try to look at the actual machine's representation, your
>>>> program's behavior is undefined and you deserve what you get.  It's the
>>>> compiler's responsibility to ensure that things work as expected, no
>>>> matter what the machine's representation is.  (For example, '\0' == 0 must
>>>> return 1.)
>>>>
>>>
>>> '\0' is an escape forcing the 0, so of course this will be equal.
>>
>> OK.  But the main point is that it doesn't matter what bit pattern
>> represents a null pointer.  Your source code will always use the value 0
>> to represent it.  For example,
>>
>>  	int *p;
>>  	/* ...code that sets p... */
>>  	if ( p == 0 ) /* *not*  if ( p == 0x80000000 ) or
>>  				if ( p == 0xffffffff ) */
>>  	{ /* ...handle null pointer value... */ }
>>
> Actually this is one of the problem areas.  0 is an explicit, and is
> actully zero.  Only if using c++ and equality is overloaded for pointers
> will this work.  Otherwise the actual contents of p will be used to
> compare to 0 and that will fail in some systems.  Some compilers may
> deal with it as you expect, but I have not used one that did.

No, I may have been mistaken about ints and chars, but in a pointer 
context, 0 means a null pointer, whatever bit pattern represents it, and 
an ISO-compliant compiler *must* do the right thing. Again, the 
comp.lang.c FAQ covers null pointers in great detail 
(http://c-faq.com/null/index.html), but in particular, there's this 
(http://c-faq.com/null/machnon0.html):

----------------------------------
comp.lang.c FAQ list  Question 5.5

Q: How should NULL be defined on a machine which uses a nonzero bit 
pattern as the internal representation of a null pointer?

A: The same as on any other machine: as 0 (or some version of 0; see 
question 5.4).

Whenever a programmer requests a null pointer, either by writing ``0'' or 
``NULL'', it is the compiler's responsibility to generate whatever bit 
pattern the machine uses for that null pointer. (Again, the compiler can 
tell that an unadorned 0 requests a null pointer when the 0 is in a 
pointer context; see question 5.2.) Therefore, #defining NULL as 0 on a 
machine for which internal null pointers are nonzero is as valid as on any 
other: the compiler must always be able to generate the machine's correct 
null pointers in response to unadorned 0's seen in pointer contexts. A 
constant 0 is a null pointer constant; NULL is just a convenient name for 
it (see also question 5.13).

(Section 4.1.5 of the C Standard states that NULL ``expands to an 
implementation-defined null pointer constant,'' which means that the 
implementation gets to choose which form of 0 to use and whether to use a 
void * cast; see questions 5.6 and 5.7. ``Implementation-defined'' here 
does not mean that NULL might be #defined to match some 
implementation-specific nonzero internal null pointer value.)

See also questions 5.2, 5.10 and 5.17.

References: ISO Sec. 7.1.6
Rationale Sec. 4.1.5
----------------------------------

>>>
>>>>>
>>>>>    And since that is so, how are those variables initialized? and to
>>>>> what value?  What is a pointer set to when it is intialized.  Hint, on
>>>>> Cyber the supposed default for assigned pointers used to the the address
>>>>> of the pointer.  Again, system dependencies may get you.
>>>>
>>>> Pre-ANSI/ISO compilers might have initialized static memory to
>>>> all-bits-zero even when that was not the correct representation of the
>>>> default for the type being initialized.  ANSI/ISO compilers are not
>>>> allowed to do that.  The required default initializations are well
>>>> defined.  (This is the sort of thing that motivates the creation of
>>>> standards in the first place.)
>>>>
>>>>>
>>>>>    And those systems that used the first location to store the return
>>>>> address are not re-entrant, without other supporting code in the
>>>>> background.  I think I used one of those once as well.
>>>>
>>>> There's no requirement for re-entrancy in K&R or ANSI/ISO.  In fact
>>>> several standard library routines are known to not be re-entrant.
>>>>
>>>
>>> This is true, but knowing that the base code is not reentrant due to
>>> design constraints or due to hardware constraints makes the difference
>>> on modern multithreaded systems, where the same executable memory can be
>>> used for the program (if the hardware allows that).
>>
>> Sure, you need to know that you can compile re-entrant code if you need
>> it.
>>
>>>
>>>>>
>>>>>    PS.  A stack doesn't necessarily mean a processor call and return
>>>>> stack.  It is any mechanism of memory address where the data is applied
>>>>> to the current location, then the pointer incremented (or decremented
>>>>> depending on the architecture).
>>>>
>>>> But usually in the context of discussions about compiler architectures,
>>>> call stacks are exactly what is meant.
>>>>
>>>
>>> I am not sure that is true, because in some implementations, the data
>>> heap and stack are in the same segment of memory, while the runtime
>>> stack for the processor is somewhere else.  For high security systems
>>> running  this should be a requirement.  It prevents obvious means of
>>> inserting malicious code through variable initialization, and then stack
>>> manipulation.  I say should be, because it has been tossed around from
>>> time to time, but I am unsure if it has ever been formalized.
>>>
>>> One system I worked on looked like this:
>>>    init jump
>>>    heap
>>>    variable stack (push down)
>>>    program entrance
>>>    program
>>>    local libraries
>>>    relocation table
>>>    symbol table (if not removed)
>>>    machine stack
>>>
>>>    Unfortunately I no longer remember which system that was.  Just the
>>> fact that some standard libraries at that time would not run on it
>>> because they did manipulate the stack.
>>>
>>> Regards,
>>> Les H
>>>
>>
> I have said all that I know.  I hope it helps you all in the future.  C
> is wonderful, compact, close to the machine, and a good language,
> capable of expressing many many complex concepts.  I am sure there are
> other languages out there, and I have used a few, but I love C.

Hear, hear!  But like any complex language, it is not without its 
subtleties.

>
> Regards,
> Les H
>
>
>

-- 
 		Matthew Saltzman

Clemson University Math Sciences
mjs AT clemson DOT edu
http://www.math.clemson.edu/~mjs