mysterious memory corruption, very confused



ruby 1.8.7-p22, OS X 10.4.mumble, PostgreSQL 8.3.1, ruby-pg 2008-03-18.

I get random data corruption when trying to execute queries.

The data corruption comes and goes VERY unpredictably. I've narrowed it
down to a small chunk of the pg.c module, except I don't understand the
ruby interpreter well enough to say much more.

Here's what happens:

I pass a whole bunch of arguments in to a prepared statement. In pg.c,
that leads to a loop through nParams, reproduced here in case it means
anything to someone.

for(i = 0; i < nParams; i++) {
param = rb_ary_entry(params, i);
if (TYPE(param) == T_HASH) {
param_value_tmp = rb_hash_aref(param, sym_value);
if(param_value_tmp == Qnil)
param_value = param_value_tmp;
else
param_value = rb_obj_as_string(param_value_tmp);
param_format = rb_hash_aref(param, sym_format);
}
else {
if(param == Qnil)
param_value = param;
else
param_value = rb_obj_as_string(param);
param_format = INT2NUM(0);
}
if(param_value == Qnil) {
paramValues[i] = NULL;
paramLengths[i] = 0;
}
else {
Check_Type(param_value, T_STRING);
paramValues[i] = StringValuePtr(param_value);
paramLengths[i] = RSTRING_LEN(param_value);
fprintf(stderr, "%d: %p -> %s\n",
i, paramValues[i], paramValues[i]);
}
if(param_format == Qnil)
paramFormats[i] = 0;
else
paramFormats[i] = NUM2INT(param_format);
}

for(i = 0; i < nParams; i++) {
if (paramValues[i] && !strcmp(paramValues[i], "+rG")) {
fprintf(stderr, "got a +rG %p in slot %d\n",
paramValues[i], i);
abort();
}
}

Obviously, I added the printfs.

Running this, I get an abort:

0: 0x425af0 -> 102
2: 0x4265c0 -> true
3: 0x4265d0 -> 40.48324
4: 0x4265e0 -> -88.09905
5: 0x432820 -> 102
6: 0x422530 -> 65.2579241765071
7: 0x474ab0 -> 2008-06-17
8: 0x4258f0 -> 14:42:36
got a +rG 0x4265c0 in slot 2

So! Somewhere between the 2nd pass (out of 13 or so) through the first
loop, and the next loop, 0x4265c0 has gotten overwritten with garbage.

This is not specific to boolean data; I have also had it happen on strings,
but the boolean data was a bit easier to track down. This is a pure
heisenbug, which moves to new data depending on things like "the contents
of ARGV".

Can anyone give me a hint as to what I should be looking at? I tried turning
down compiler optimizations, to no noticable effect. (It moved, but it
moves any time anything changes.) The "T_HASH" case is probably irrelevant,
as all 12 arguments are strings. About all I can think of is that, perhaps,
rb_obj_as_string is allocating strings which are getting garbage collected
before the end of the routine?

I'm afraid I can't make this bug report much more useful, I don't really
understand the code. I don't know how the garbage collector works, either.

.... But interestingly, wrapping the call to the API function this wraps
in GC.disable/GC.enable makes the bug go away. I'll annotate my rubyforge
bug, but if anyone here can tell me what I should be doing properly to
tag these things not to be collected until this function is done, I'd love
to know.

--
Copyright 2008, all wrongs reversed. Peter Seebach / usenet-nospam@xxxxxxxxx
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
.



Relevant Pages

  • Re: Privileged Instruction exception in Release build
    ... Did you do a full rebuild after changing the size of any structures ... virtual methods, or making an existing method virtual, or making an ... Possible uninitialised memory problem, leading to data corruption, ... Possible bad message map, leading to stack corruption, leading to bug ...
    (microsoft.public.vc.mfc)
  • Re: IOMMUs was Re: Intel vs AMD x86-64
    ... > But it caused data corruption with a few devices, in particular 3ware, so I had ... I didn't find a bug in the code. ... problem with the PCIHT bridge doing prefetches beyond iommu mapped ... and removal now:(On the other hand, we can probably do per-tag TLB ...
    (Linux-Kernel)
  • Re: [PATCH] JMicron JM20337 USB-SATA data corruption bugfix - device 152d:2338
    ... the bug is not detected as an error and incorrect data is returned, causing insidious data corruption ... the patch provides a crude workaround by detecting the error condition ...
    (Linux-Kernel)
  • Re: patch: fix 30 second hang while resuming
    ... Augh. ... >> the import since the patches are minor and data corruption is ... > Your original message only mentioned a long-delay when resuming. ... The previous time this bug appeared, ...
    (freebsd-stable)