Tuesday, March 2, 2010

Bug smashing on the stack, hehe

Ok, I've spent some time longer then expected working on my games net code, mostly because I both needed one nights sleep this week, and wanted proper IPv4/IPv6 support going on. Sunday I basically finished principal work on the net module, and completed a pair of test cases: a really simple client and server program. After that went smooth, I had planned to complete the finishing touches.

The problem: the server example segfaulted. Now those who know me, I consider it a plight upon my honour as a programmer, for any code that I've written to crash without due cause (i.e. not my fault). So I spent work on Monday taking a break: refining code quality and then porting it from unix to windows. During the nights final testing runs after that however, I had not solved the mysterious crash yet, and got the same results. I switched over to my laptop and recompiled with debugging symbols, only to find that my program worked as normal, only dying with a segmentation violation once main() had completed, and the programs shutdown now "Beyond" my control.

My first thought of course, was "Fuck, I've probably screwed my stack"[1], a quick Google suggested I was right. I also noted that turning on GCCs stack protection option prevented the crash, so did manually adding a pointer to about 5 bytes of extra character data to the stack. Everything to me, looked like the the return address from main was being clobbered, or imagine invoking a buggy function that tries to return to somewhere other then where you called it from. Before going to bed, I narrowed it down to the interface for accept(). Further testing showed that omitting the request for data and just claiming the new socket descriptor, caused the crash to end but still, some funky problems with an invalid socket. Inspection of the operation also showed that the changes were well within the buffers boundary, yet it as still causing the crash. So I finished the remaining stuff (i.e. free()ing memory) and went to bed.

Having failed to figure it out this afternoon, and starting to get quite drowsy, I played a trump card and installed Valgrind. It's one of those uber sexy tools you dream of like driving a Ferrari, but don't always find a way to get a hold of lol. For developers in general, I would say that Valgrind is the closet thing to a killer app for Linux developers, as you are ever going to get. In my problem case however, Valgrind wasn't able to reveal the source of the problem, only that the problem was indeed, writing to memory it I shouldn't be screwing with \o/.

So putting down Valgrind and GDB, and turning to my favourite debugging tool: the human mind. It was like someone threw on the lights, once I saw the problem. Man, it's wonderful what a good nights sleep can do!

Many data structures in Stargella are intentionally designed so that they can be allocated on the stack or heap as needed, in order to conserve on unnecessary dynamic memory allocation overhead, in so far as is possible. So the example server/client code, of course allocated its per socket data structures right inside main(). Because there is no real promise of source level compatibility between systems, the networking module is implemented as a header file, having function prototypes and a data structure representing a socket; which contains an opaque pointer to implementation specific details, itself defined in unix.c and windows.c, along with the actual implementations of the network functions. Because of that,the behaviour of accept() can't be emulated. Net_Accept() takes two pointers as parameters, first to a socket that has been through Net_Listen(), and secondly to another socket that will be initialised with the new connection, and Net_Accept() then returns an appropriate boolean value.

All the stuff interesting to the operating systems sockets API, is accessed through that aforementioned  pointer, e.g s->sock->whatever. What was the big all flugging problem? The mock up of Net_Accept(), was originally written to just return the file descriptor from accept(), allowing me to make sure that Net_Listen() actually worked correctly. Then I adjusted it to handle setting up the data of the new socket, in order to test the IPv4/IPv6 indifference and rewrite the client/server examples using Net_Send() and Net_Recv(), and that's when the crashes started.

I forgot to allocate memory for the sub structure before writing to the fields in it, resulting in some nasty results. When I say that I don't mind manual memory management, I forget to mention, that programming while deprived of sleep, is never a good idea, with or without garbage collection ^_^.

Now that the net code is virtually complete, I can hook it into my Raven Shield server admin tool, which will make sure to iron out any missing kinks, before it gets committed to my game. Hehehe.

No comments:

Post a Comment