Saturday, October 20, 2012

MUXX: Milestone 2 reached

My toy kernel slowly evolves...

This morning I have considered my little toy PDP-11 kernel has reached the second milestone. This means that right now my creature has the following super-advanced features:

  • It can open and close channels to devices (currently just the paper tape puncher and reader and the line printer).
  • It can read and write using those channels, in synchronous mode (no context switch when a task blocks for a read... it spins faithfully until the I/O operation is complete).
  • It can load tasks from a channel and execute those tasks in a separate, memory protected address space. 
The next goal is to pull some functionality out of the kernel space and move it to different processes. At that point I will be able to say without being ashamed of myself that MUXX has a microkernel architecture... :) And I really must do it, because the current version is just at the limit of the 24KB I defined as kernel code space. So, the next steps will be:

  • Define a framework of inter-processes message passing and replying (I will probably inspire myself on minix to do this).
  • Decide which kernel functionality I will move to the auxiliary processes without breaking it all. It will be probably the memory management and the error/message handling.
I have swapped milestones 3 and 4. My original plan was to get interrupt-driven asynchronous I/O first, but right now I can't stuff anything else in the kernel space. On the other hand, the asynchcronous I/O implementation could and will use separated processes and inter-task messaging, so it makes much more sense to do it in this order.

About process creation and task loading

As I wrote when I presented this project, my goal is to learn about operating system implementation as well as to known better the PDP-11 architecture. I have to say I am progressing in both goals. A lot.


Lets put an example. The load() function. This funcion loads a task image from a channel and puts it into memory. When I started I planned it to be a syscall. It IS a syscall in some operating systems (in particular, MVS or z/OS, as you want to call that thing). In UNIX there is not a load() function, but a family of exec() functions which do (aproximately) the same function.

The UNIX operating system uses a somehow curious way to create a new task. To create a new task (or process) you must use the fork() system call, which basically duplicates the executing task. Then this function returns a different result for the "parent" task (in this case it returns the ID of the created process) and the "child" or "new" task (it returns NULL). Then the parent task can go on with its own business, while the "new" one can call the exec() function to load a new executable and transfer control to it. 

In contrast, MUXX has a different model, inspired in VMS. In that wonderful operating system the parent task uses the CREPRC syscall to build a new address space and load an image into it, just in one step. There is no "forking": the new process does not begin its existance as a copy of the parent. In MUXX the CREPRC system call does create a new address space and does set up its corresponding memory pages, but it does not load an image into it. It assumes the new process will execute some code already present in the kernel image (so the new task is hard-linked into the kernel). The reason to do it this way is really simple: during the development of MUXX I needed to be able to create new tasks before I could think about loading different, separated images. So CREPRC just creates and branches, but does not load.

So here comes LOADPRC, a second syscall which does load a new image. And now comes the problem: the MUXX kernel is not preemptable at this moment. That means that the kernel code (so the syscalls) executes with the interrupts disabled. And to do a potentially long (in terms of elapsed time) operation like an  image load with interrupts (and hence task switching) disabled is really an ugly solution. That means, basically, that the actual loading of the new image has to be performed in user mode, out of the kernel space.

This brings us into another problem. We are loading an image into a new task, what means different memory configuration (different MMU setup)... from user mode code and from (probably) a non-privileged task. I could probably have written some kludge to do the MMU config switching during the LOAD, but I felt a little bit lazy about it, so I decided to use another aproximation. Enters rshell.

rshell comes from resident shell. It will evolve to be the "resident" part of the system shell (the part which will be present in memory at all the time), and is hard-linked with the kernel code, so it has to live in the low 24KB of memory. Right now rshell is really simple. It does just one thing: it looks at the task memory base (at 060000 octal) for a device name (in the future it will be a file specification) and calls load() to read that task into memory (beginning at 060000!). load() is a library function, not a syscall, and executes mostly in user mode (except for the actual I/O). When/if the load is completed, rshell jumps to the address 060000, where the linker has located the crt0() function, which in turn initializes the runtime and calls the main() function for the task we are loading. 

The nice part of this is rshell executes in the new task address space. So when a program invokes LOADPRC it does the following:
  • It prepares a new task address space, just like CREPRC does.
  • It copies the name of the image to load at the position 060000 of the new address space
  • It prepares the new task to begin its execution at the rshell() entry point
And that's it. When the scheduler selects the new task to execute it will run rshell, which will load the new image and transfer control to it... Of course, right now MUXX can use just the paper tape reader, so there are concurrency problems (if we try to run 3 tasks they will try to load at the same time...), but those are workable.

About the MMU setup and MUXX memory model

When I started loading tasks at their own address (beginning at 060000) I begun experiencing weird and hard to explain crashes. They came in all flavours: Illegal instructions, odd memory accesses, MMU exceptions... I really had no idea about what was happening, but they just occurred when I started the tasks using LOADPRC (if I linked the same tasks into the kernel and ran them using CREPRC they did fine). After some investigation, I realized the code resident at the page 6 (060000-080000) was being overwritten, so when the execution got transferred to the corrupted position all kinds of Bad Things happened. Observing the overwritten areas they seemed to be clobbered by printf()... so I spent a pair of days trying to find some bug there.

I obtained printf() from the 2.11BSD libc, with some minor changes to adapt the BSD code to MUXX ABI. I had previously found some stupid bugs in my minor changes, but I thought it was now correct. After some hours of work, I decided printf() was doing fine, so the problem had to be somewhere else. And that somewhere else happened to be precisely here. Let's take a closer look at that.

MUXX task memory map

The muxx_memsvc_svc.c module contains the basic memory management code. If you look at the linked version you'll see it is incomplete (right now the memory deallocation is not implemented yet).One of the things that module does is to set up the address space for a new task. That is what muxx_setup_taskmem() does.

A MUXX task address space has three (or four) different memory regions:
  • 000000 to 057777 is the kernel space, shared by all the tasks. It is write-protected from user mode code (except for the tasks with operprv privilege and the system tasks). 
  • 060000 to 137777 is the user space. CREPRC and LOADPRC allows to select between three task sizes (SMALL, MED and LARGE), which correspond to real user spaces up to 077777, 117777 or 137777 respectively.
  • 0150000 to 157777 is the stack space, furtherly divided at 156000 between user mode stack and kernel mode stack
The space between 137777 and 150000 is not allocated at this point. It will be used when I implement different stack sizes.

The fourth region, between 160000 and 177777 is the mapped IO area, and it is not mapped into the user mode address space unless the task has the ioprv privilege.


PDP-11 memory management

The PDP-11 program-accessible address space is, then, a 65566 position array of bytes, or a 32768 array of words. The physical memory addressing capability of the PDP11 depends on the model and goes up to 2048 Kbytes. The model I am targeting, the 11-60, has a 256KB addressable space. The MMU allows the PDP to map those 256KB so they cant be seen through the 64KB address space. When the MMU is enabled, the addresses the programs work with are virtual addresses which must be relocated to real, physical addresses. Using the MMU we can provide different relocations for different tasks, so each task gets its own 64KB address space, with or without sharing memory with other tasks.

A 16-bit virtual address is formed by two parts:
  • The highest 3 bits form a number from 0 to 7 which tells us what "page register" will we use to relocate the address.
  • The lowest 13 bits form a displacement inside a 8192 bytes "page".
The MMU has a bunch of registers to configure the memory relocation. These registers are grouped into two or three sets corresponding to the processor execution modes (user, kernel and, in some models, supervisor). The processor models with separate I and D address spaces double the number of registers. Our PDP-11/60 has just two sets for user mode and kernel mode. Each set contains eigth pairs of registers, each of those pairs corresponding to one "page" between 0 and 7. And each pair is composed by a page address register (PAR) and a page descriptor register (PDR). Each PAR contains a number of physical memory block, being each block formed by 64 bytes. The PDR contain control information, like the access permissions for the page they describe, the size of the block and some other information, particulary the growth direction, which tells the MMU if the accesses will be done upwards (like in normal code or data arrays) or downwards (like in a stack).

Lets say our program refers to the virtual address 060042. That address breaks into:
  • A page number in bits 13-15, which is 03.
  • A displacement in bits 0-12, which is 42.
That means the MMU will use the PAR and PDR for the page 03 of the current mode. We will ignore the PDR at this moment. Let's say the PAR for the page 03 contains the value 01200. Then, the physical direction will be formed this way:
  • Physical address base: 01200 * 0100 = 0120000 (remember we are using octal figures)
  • Displacement: 042
  • Physical address: 0120042
That will be the real address the CPU will reference. Changing the values of the PAR we can assign the same virtual address to different physical addresses, efectively isolating the tasks from each other. Manipulating the PDR we can establish protection for the memory pages, so the user mode code can't write into the kernel space or can't even see the IO mapped addresses.

The MMU can also detect when we try to access a memory address which is out of the mapped range. The PDR contains a field with the length (in blocks) of the current page. So, if  we confugure it with a 4KB memory page and we reference a position at that page with a displacement grearter than 4KB, the MMU forces a memory trap and invokes a handler using the usual PDP-11 trap sequence; this allows an operating system to do something about that, being it just killing the offending task, invoking some paging mechanism or panicking if the trap has occurred in kernel space.

Now you can tell me a liar. I've omitted that pesky "growth direction" bit in the PDR. If that bit is set then we are telling the MMU that the page grows downward, and then the size check works just the opposite way. That means in our exemple of a 4KB page it would trigger a trap if we referred to a virtual memory with a displacement less than 4KB! In MUXX the stack resides in a 4KB page, so if we grow the stack below that 4KB line, it will trigger the trap and will abort the task (currently, it panics the system, since MUXX does not know how to kill a task yet)

Misunderstanding the MMU and learning it the hard way

Let's look at the stack preparing code in muxx_setup_taskmem:


  /*
  ** Task stack space
  ** The stack size options should be evaluated and applied here
  ** Since the stack grows downward, the PAR contains Addr - Size 
  ** so the physical addresses are properly calculated without
  ** overlapping.
  **
  ** Example: Base virtual address: 0140000
  **          Top of stack:         0157777 (0120000 + 020000 - 1)
  **          Bottom of stack:      0150000 (0120000 + 010000 )
  **          PAR value:            0500
  **          Size:                 0100 (4K)
  **          Physical range:       060000:067777   
  */
  mcb = muxx_mem_getblock(task, 0100, MMCB_FLG_STK, 6);
  if (mcb != NULL) {
    task->mmuState.upar[6] = mcb->blockAddr-0100;
    task->mmuState.updr[6] = PDR_ACC_RW | PDR_SIZ_4K | PDR_DIR_DN;
    task->mmuState.kpar[6] = mcb->blockAddr-0100;
    task->mmuState.kpdr[6] = PDR_ACC_RW | PDR_SIZ_4K | PDR_DIR_DN;
  } else {
    return (ENOMEM);
  }

Don't you find anything unusual? Let's check the previous, bugged code:
  /*
  ** Task stack space
  ** The stack size options should be evaluated and applied here
  */
  mcb = muxx_mem_getblock(task, 0100, MMCB_FLG_STK, 6);
  if (mcb != NULL) {
    task->mmuState.upar[6] = mcb->blockAddr;
    task->mmuState.updr[6] = PDR_ACC_RW | PDR_SIZ_4K | PDR_DIR_DN;
    task->mmuState.kpar[6] = mcb->blockAddr;
    task->mmuState.kpdr[6] = PDR_ACC_RW | PDR_SIZ_4K | PDR_DIR_DN;
  } else {
    return (ENOMEM);
  }

Do you see the difference? In the corrected code I'm substracting 0100 from the PAR value for the stack page. So if I'm allocating the addresses 140000-150000 to the stack, I'm not writing 1400 in the PAR, but 1300. Why? 

The top of the stack for the kernel mode code is at 157777. That is, page 6, displacement 17777. If I put 1400 in the PAR this translates to 140000 + 17777 = 157777. Since the stack PDR is configure with the "grow downwards" bit, the MMU will not complain about this relocation and when MUXX writes something in the stack, it will clobber the 157777 physical memory address.

Of course, our stack is 4KB big, so we have just allocated 0100 blocks for it. And we have allocated the next available blocks to the next created task (since the stack is the last page we set up). So the next page will have probably the blocks 1500 to 1700 (for a 8K so 200 blocks page). Now we can see the problem. The stack of this task is sharing the same memory blocks of some page of the next created task. And that is precisely what we see... the 03 page of the next task (the one which will be allocated first!) is clobbed by our current task writing into its stack! 

The solution is to pull back the content we write in the PAR by 0100, corresponding to the 4KB size we are not allocating to the stack. Redoing the numbers, the 157777 virtual address would relocate into block 1300 plus displacement 17777 = 147777, which is just in the 1400-1477 range. If we go down 140000 the MMU will detect it and will trigger a trap, so we are safe. The weird thing is that in our memory management tables we will register the stack uses the pages  1400-1477, while the PAR will contain 1300. Weird, but correct.

Going on, and small machines...

And that is all for today. As unrelated stuff, my raspberry holds now 6 simulations: 2 VAXen, 3 PDP-11s (RSXM, RSXM+ and RSTS) and PDP-10... without a lot of load, but they work. For those of you who are part of HECNET, remember those machines are in area 7 and most of them have GUEST accounts, You will be welcome at my humble simulated, rapsberryzed digital home.