Last night I was up coding until 7 AM. A lot of the time was spent fixing compiler errors, and then tracking down a bug caused by using the wrong variable in one place, causing page table entries to get overridden… causing problems later.
Today, it was my turn to have a huge bug in my code, and mine was a conceptual error (although, to be fair, it was one both of us missed).
We decided to start the first two tasks for the kernel by writing a kernelland fork function, which works much like the user version. The user version copies over all of memory except for kernel memory, creating a new kernel stack for the child and putting only the necessary stuff on there (IRET foo and registers).
I did the same thing for the kernel version, except I don’t copy any memory (since there is nothing in userspace yet). I create a new kstack and copy over only the necessary values… the return address (to return to the previous calling function) and registers.
See the problem yet?
Turns out I forgot to take into account that execution in kernelland uses the kstack… and so maybe continuing execution in kernelland (as we do by using the same return address) would need things on the stack. As a result, we kept getting the strangest errors where memory was being overridden in various ways (since the function we return to expects certain things at esp+16, and expects ebp+4 to have arguments and such). So yeah. That bug took a good 3 hours to track down. Oops.
I had a Google interview today. I feel like I did well at the data structures/coding portion, but I completely bombed the algorithms part. It seriously makes me wonder how I managed to get an A in algo class. Or maybe I’ve just forgotten everything since then. Meh.
Whee… now it’s time to debug userspace exec and fork. Fun fun fun?
Edit: 5:30 AM. Looks like it’s going to be another all-nighter. We really should start working more earlier in the week so we’re not stuck doing a bunch of stuff the night of the deadline.
Edit edit: 6 AM and fork works! Huzzah. Turns out the problem wasn’t at all in my code (which was surprisingly, given the complexity of the mapping of multiple physical pages into and out of a virtual address space, correct the first time) but rather with the TLB keeping around “stale” entries. Oops. Now let’s get exec working (and maybe bed after that).
Edit edit edit: 9 AM. Mission failed. We didn’t manage to get exec working (although it now does return to userland and run for a bit, albeit incorrectly, before being context-switched away and page faulting on its next run). Hopefully we’ll have it done today.