Sunday, January 21, 2018

Some assembly required - Part one

During my professional career, I've got to write assembly code for three different platforms: the VAX, the IBM PC and the IBM Mainframe, also known as zSeries. The only one I have got to somehow master is the mainframe one; my experience with the other two is limited to the occasional tinkering and writing occasional glue code.

As a hobbyist, however, I've been lucky to be able to write assembly code for other architectures. The PDP-11 is probably the one I have been more interested in, as you can see if you read past articles (and, yes someday I'll go back to my little MUXX project). I have also written some code for the old Commodore-64, or the MOS 6502 processor, which is the same. That makes a total of five platforms I've tinkered with at the assembly level. That means I've got the chance to see how different some quite kown and classic architectures are at the machine level.

In this post I'll present you with a working example written in assembly language. The example is quite simple, and consist in the well known "reverser" program. That is, it will prompt the user to enter a string and will output it reversed; it will keep asking the user to enter more strings until he enters an empty line. Then it will finish orderly (if possible) and will return control to the operating system (if there is one). This program is, of course, trivial, but it's a step further beyond a simple "echo" program and two steps away from a "Hello world". It allows us to see how looping works in each platform, and how to do some input/output at the basic level.

The examples are not meant to be... exemplar. I'm by no means an expert in each of the platforms. And, for sure, there are more efficient and canonical ways to do this. The idea is give a taste of each platform and an excuse to comment a little bit about each one of them. Since I'll cover eight different architectures, I will make several posts. This is just the first one of possibly three.

All the source code is in github, and can be browsed or cloned from the repository https://github.com/jguillaumes/ancientbits, which contains also code for other posts in this blog.

With no more ado, let's see our first example. And we'll begin with a well know platform, also known as...

The IBM PC

The horror. The 8086 is probably one of the ugliest processors to program. Its segmented memory model has provided headaches to thousands of programmers. Its non-ortogonal register set is, at least,  nightmarish, but it somehow achieved a total victory in the industry dominance, crushing all the competitors which crossed its path, with the sole exception of the ARM architecture, which specialized in low-power computing and won the mobile market.

The 8086 architecture uses a 16 bit word length and a 20 bit address length. To be able to address all the memory, it divides it in 64K segments which can be selected using some special registers, namely CS (code segment), DS (name segment), ES (extra segment) and SS (stack segment). An assembly program has to keep track of the content of those segments, and initialize and change them accordingly. The processor contains four registers which all almost general purpose (AX to DX), two registers related to stack managing and procedure call handling (SP and BP) and two more used as pointers for string-handling instructions (SI and DI). The architecture evolved to 32 bits with the 80386 processor and then to 64 bits, first introduced by AMD. Since this blog is basically about classical machines. we will stick to the original 16 bit 8086 (and its parents like the 8088 used in the original IBM PC).

Register 8086
Intel 8086 register model
The 8086 architecture uses a stack for interrupt handling and procedure calling. The usual ABIs also use the stack to place the procedure parameters. The AX register is the only one which can be used for any arithmetic operations, while CX is used in counting-related things (like counted loops), while BX is used to contain a base address. That's not exact, and the programmer had to memorize (or have the manual at hand) which registers can be used in each operation.

Using the MS-DOS operating system the programmer used a "software interrupt" to invoke the operating system kernel. That "software interrupt" is what we call "system trap" in other architectures. MS-DOS used the interruption number 21h (21 in hexadecimal), using the AX and DX registers to pass parameters to the kernel. Some of the system services were directly taken from a previous OS, CP/M (which we will also cover in this series of posts). Gradually, it evolved to distance itself from its ancestor.

Let's see some code. This is the start of our little program

main    proc
        mov    ax, seg hello       ; Setup data segment (DS and ES)
        mov    ds, ax
        mov    es, ax

        lea    dx, hello           ; Show welcome message
        mov    ah, 09h
        int    21h

What do we have got here? The first instructions just set up the DS and ES registers so we can address the different parts of our code. This program fits in just one 64K segment, so we will just set DS and ES to the segment which contains one of our literals and forget about them. That is what was called the tiny memory model. There were other models: small, medium, large and huge, but we will ignore those in this example.

After setting the data and extra segments, this code contains a system call, which will display a welcome message. We load the address of the "hello" string into the DX register and then call the 09h function of the DOS kernel using int 21h. That service outputs text to the standard output (the screen, usually). The string has to be terminated by a "$" character (although the terminator can be changed using another system call).

The next interesting thing is the user input. This is the related code:

        lea     dx, buffin          ; Input text line
        mov     ah, 0ch             ; Clear STDIN buffer
        mov     al, 0ah             ; Buffered read DOS function
        int     21h
        mov     al, byte ptr [numch]; AL: Number of characters
        cmp     al, 0               ; Is the buffer empty?
        je      final               ; Yes: finish

In this case, we use the 0ah DOS function to read a line of text. The line of text is placed in a buffer, which contains a counter byte and, hopefully, enough space for the entered text. If the user enters more text, it can overflow the buffer and overwrite code. A malicious user can use that kind of error to inject his own code into our computer and do nasty things. But, hey, when this was designed there was no internet full of bad guys trying to break into your machine!

After reading the input, we check if the user entered an empty line and, in that case, branch to the end of the program.

Now we'll get into the proper reversing part of the program.

     lea     ax,bufchr           ; AX => Start of input buffer
     sub     cx,cx               ; CX => Zero
     mov     cl,byte ptr [numch] ; CX => Number of bytes in buffer
     add     ax,cx               ; AX => End of input text + 1
     dec     ax                  ; AX => End of input text
     mov     si,ax               ; SI => End of input text
     lea     di,bufout           ; DI => Start of output buffer

theloop:
     std                         ; change direction: decrement
     lodsb                       ; Load byte from DS:SI, decrement SI
     cld                         ; change direction: increment
     stosb                       ; Store byte at ES:DI, increment DI
     loop    theloop             ; Check CX and loop if not zero
     mov     cx,3                ; Prepare to move 3 more bytes (CR,LF,'$')
     lea     si,crlf             ; DS:SI => CR+LF+'$'
     rep movsb                   ; Move to ES:DS (append to output buffer)

The 8086 architecture implements instructions to move, load or store bytes (MOVSB, LODSB and STOSB respectively). Those instructions move stuff pointed by DS:SI onto ES:DI, and at the same time increment or decrement the corresponding source or destination register. The instructions STD and CLD change the increment/decrement characteristic. So in this code, what we do is to load SI with the address of the last character of the string, DI with the address of the first byte of the output buffer and then proceed to load the bit after activating the decrement mode and store it after activating the increment mode. The LOOP instruction decrements the CX register and loops unless it is zero, so we will also preload it with the length of the string. And we are almost done. To complete the output string we add three more bytes: a return, a line feed and the terminating "$" using the MOVSB instruction, prefixed with a REP indicator, which tells the processor to repeat the instruction as many times as indicated in the CX register.  And this is basically all.

To run this example you will need a MS-DOS environment. You can use a real MS-DOS, a MS-DOS session under windows (I have not tested this!) or a DOS emulator like DOSBOX. I used this option. You can get the Microsoft MacroAssembler for free (legally), but please take note the last version capable of running under MS-DOS is 6.11. Any version higher than that needs a Windows environment to run.

The PDP-11

I have already wrote several entries about the PDP-11 and its architecture.  I have confessed I love it. The PDP-11 architecture defines eight general purpose registers, and it is basically orthogonal. All the instructions can be applied to any register. Almost.

Two of the registers have specialised uses. R6 is used as a stack pointer. When the basic architecture was enhanced to support different privilege levels, R6 was "multiplied" by three so each execution mode (user, supervisor and kernel) has its own copy and thus can address its own stack. The R7 register is the program counter and contains the address of the next instruction to execute. As a programmer you can perform arithmetic on R7... but if you think a little bit you'll find that is the same as doing a relative (if you add or subtract from R7) or absolute (if you deposit a value into R7) branch. And, by the way, you can use any register (except R0) as a stack pointer, but you won't get the three execution mode-specific instances!



The PDP-11 is a 16 bit machine, and so its address space is just 64K bytes long. Even in its time, 64K was considered too small, so the machine was enhanced to be able to physically address up to 4 megabytes. The program (or the operating system) configures a memory management unit to establish which 64K of those 4M can be accessed in each moment. The latest versions allowed to address 64K of code and 64K of data (Instruction/Data split). But that was as much as the architecture could be squeezed, so DEC designed the VAX to enhance and substitute the PDP-11.

There were plenty of operating systems available for the PDP-11, some of them written by DEC. The PDP-11 was the platform were UNIX got into its adolescence (it was born in the PDP-7, but grew in the PDP-11). For this example, I've choosen one of the DEC operating systems. Namely, RT-11. The RT in that name means, literally, "Real Time".

RT-11 is a single user, multitasking operating system. Its user interface comes from the TOPS-10 heritage, and hence it is part of the inspiration for CP/M which, in turn, is the base for MS-DOS. A MS-DOS user can feel quite comfortable typing commands in a RT-11 system: the files have the familiar 8+3 naming structure, and a lot of commands will be familiar: DIRECTORY, DELETE, TYPE, to say a few.  Our example code will use RT-11 system calls to get and put strings in the terminal.

Let's go and see some code. This is the beginning of our code:

START: 
        BIS     #TTLC$,$JSW ; Allow lowercase characters

        MOV     #HELLO,R1 ; Display welcome string
        MOV     #LHELLO,R2
        JSR     PC,LINOUT
        JSR     PC,LBREAK

The first instruction just sets a flag in the Job Status Word, which is located at the address 000044 octal (in the PDP-11 word it's customary to use octal, even it being a four-bit nibble machine). Specifically, this flag allows to use lowercase characters in input (it prevents the OS to automatically upercase them). After that. we have two subroutine calls which display the welcome message and a line break. We have to build our string-outuput routine, since RT-11 does not provide one. It provides just a basic "put byte" service, which we use to build our string printing routine:

;------------------------------------------------------------------------
; LINOUT: Display a text line
;------------------------------------------------------------------------
;  Subroutine to display a text line
; R1: @Text
; R2: Size
; R1 and R2 are destroyed
;
LINOUT: CMP     R2,#0  ; Exit when no more chars to display
        BLE     20$  ;
10$:    .TTYOUT (R1)+  ; Display character pointed by R1
        SOB     R2,10$  ; Decrement counter
20$:    RTS     PC 

Notice we specify the PC register both in the call (JSR) and return (RTS) instructions. In the JSR instruction, the specified register gets pushed into the stack and the current program counter is stored in the register. On return, an indirect jump is performed to the register and its value is popped from the stack. If, like in this case, we use PC that linkage process does not happen and no register apart from PC (R7) itself is modified. When we take a look at the mainframe architecture we will see a similar thing (although the mainframe does not really use a stack).

The subroutine invokes the "macro" .TTYOUT, which is a system call which outputs a character (pointed by R1) to the console. It uses the (R1)+ construct, which is a post-incremented indirect addressing. It means "take whatever is in the memory position pointed at by R1 and increment R1 afterwards". So we proceed through all the string characters, one by one. We use a counted string, with size at R2. The SOB instruction ("Substract One and Branch) subtracts one from the specified register (in this case, R2) and jumps to the target label unless the register is zero. This is one way to implement counted loops in the PDP-11.

We have to program the line input routine too. Since it's just used once, it has not been implemented as a subroutine, but as inline code. Let's take a look at it:

GETLIN: MOV     #PROMPT,R1  ; Display prompt
        MOV     #LPROMPT,R2
        JSR     PC,LINOUT

        MOV     #BUFFER,R1  ; R1 => @Buffer
        CLR     R2          ; R2 => Number of read characters
GETCHAR: 
        .TTYIN              ; Read character into R0
        CMPB    R0,#^X0D    ; Is it a CR?
        BEQ     GETLF       ; Yes, consume LF
        MOVB    R0,(R1)+    ; No, store character in buffer
        INC     R2          ; Increment char counter
        CMP     R2,#LBUFFER ; Full?
        BGT     FULL        ; Yes: finish
        BR      GETCHAR     ; No: get more characters
GETLF:
        .TTYIN              ; Consume LF
FULL:                       ; Full buffer, let's finish here

        CMP     R2,#0       ; Is the buffer empty?
        BEQ     BYE         ; Yes: finish

The .TTYIN macro is the one which reads a character from the terminal and gets it in R0. What we do is to check it for carriage return (0x0D), which is what we use to check for end of line. If the character is not a carriage return, we store it into our buffer, pointed by R1 (which we will autoincrement) and count the number of characters in R2. We check R2 for buffer size to avoid buffer overruns. If the character is a carriage return, the next one will be a line feed, which we can safely ignore. If the line is empty, we branch to the end of the process.

We can take a look now into the actual reversing code:

        MOV     R2,-(SP)    ; Save string length into stack
 
        MOV     #INVER,R1   ; Display "Inverted..."
        MOV     #LINVER,R2
        JSR     PC,LINOUT

        MOV     (SP)+,R2    ; Restore string length from stack

        MOV     #BUFFER,R1  ; R1 => @Buffer
        ADD     R2,R1       ; R1 => @End of string
OUTR:   .TTYOUT -(R1)       ; Display character pointed by R1
        SOB     R2,OUTR     ; Loop w/decrement pointer and counter
        JSR     PC,LBREAK   ; Add line break 
        BR      GETLIN      ; Get next line

The first instruction is a stack push. We use the SP register (R6) as a stack pointer, but since the PDP-11 is quite orthogonal, we could have set up our own stack pointed by any other register. We use the stack to temporaly save R2 since we are going to invoke our string-output routine which overwrites it (in a small program like this one we could have used another register, but this way we can see an example of stack push and pop). After invoking the put string, we pop R2, so we have got the lenght of the string to reverse.

Now the rest is quite easy. We set up R1 to point to the last character of the string and output it one by one, invoking the .TTYOUT macro predecrementing R1. This works in a simular way to the postincrement, just in this case the register is decremented before it's used to get the character to print.

And that's basically all. The PDP-11 instruction set and architecture are really nice and easy to learn and understand. For that reason it was used for many years to teach computer architecture and assembly programming to CS students.

To run this example you need a PDP11 simulator like simh and a RT-11 image. RT-11 comes with the assembler, and you can use KERMIT or plain cut and paste to bring the program into the simulator.

Closing words

This is the end of the first post. I hope you've got an idea about how did the 8086 and the PDP-11 "look" from the view of an assembly language programmer. Next post will be about small machines and will cover the 6502 as used in the Commodore-64 and the Z-80 as used in CP/M capable machines. Maybe I will add some other small machine. We will see.


Wednesday, May 11, 2016

Containerizing simh: BSD in a box

Containers (almost) everywhere

One of the buzzwords of the year is "container". Since it seems that containers and containerized systems are here to stay, I decided to take a look at that technology. Specially after good'ol IBM has blessed it, as can be read here. So it looks the Big Blue is going to push that technology to their Holy zOS land. Or something like that.

So I went to Docker Land and started reading about the stuff. This post is not an explanation about what containers are and how does Docker implements them. There are much better text in the web to learn about that, and I am just the last guy to learn about them, so my explanations would be probably wrong. This blog is about old computers, real and simulated. So let's do some retrocomputing...

Containerizing the past
The idea of packaging an application as a service, with all the dependencies solved so it is basically a software appliance is, of course nice. And, of course, it is not new. We have been able to do that using virtual machines for some time. The main difference is containers pretend to be lightweight and just package what is needed. A VM must include the whole operating system to run, and today operating systems tend to be multi-gigabyte monsters. So the idea of putting, let's say, a PDP-11 simulator with its configuration files and probably also an Operating System (of the "just megabytes" kind) in a nice, ready to run, packaging is appealing. So I wanted to know if that was posible.

It, indeed, is.

And I can offer now the results of this little research, in the form of five docker "images", that you can download and convert into multiple "containers" as you want. The images are the following:

  • jguillaumes/simh-allsims: Contains just the simulator binaries, compiled and ready to run.
  • jguillaumes/simh-pdpbsd: Contains the pdp11 simulator plus a BSD 2.11 image.
  • jguillaumes/simh-vaxbsd: Contains the vax780 simulator plus a BSD 4.3 image.
  • jguillaumes/simh-vaxnbsd: Contains the vax simulator plus a NetBSD 6.0 image.
  • jguillaumes/simh-vax: Contains the vax, vax780 and pdp11 simulator plus a pair of example config files.
To use those images you need to download and install the docker runtime in your machine. There are versions for Windows, Mac OSX and Linux. Oh, you must have a 64 bit system or it wont work at all. 

Using the images

To use the images you can "pull" them from the public Docker repository where I have uploaded them, You can do this explicitly or you can simply try to run the images. Docker will automatically pull what you need. 

Self-contained images

Let's take the PDP-11 BSD 2.11 for a ride:

docker run --name pdpbsd -p 2323:2323 -it jguillaumes/simh-pdpbsd
Unable to find image 'jguillaumes/simh-pdpbsd:latest' locally
latest: Pulling from jguillaumes/simh-pdpbsd

d0ca440e8637: Already exists 
a1e3125132f8: Already exists 
5fc723fb3b91: Already exists 
22f14ee4456b: Already exists 
fe58eb150210: Already exists 
3cacf15d9073: Already exists 
d95addc21743: Pull complete 
a3770888ba4a: Pull complete 
7e7599d06256: Pull complete 
705a1c43d99d: Pull complete 
Digest: sha256:51d3b466eda25296668302bd1ce6f599d2897a077b04d4d289ee27e07f7f4ef5
Status: Downloaded newer image for jguillaumes/simh-pdpbsd:latest
Uncompressing OS image file...

PDP-11 simulator V4.0-0 Beta        git commit id: 7bd58c6d
Logging to file "console.log"
Disabling RK
Disabling HK
Disabling TM
Listening on port 2323
LPT: creating new file
Eth: opened OS device eth0

73Boot from ra(0,0,0) at 0172150

At this point just press RETURN to proceed to boot BSD


: ra(0,0,0)unix
Boot: bootdev=02400 bootcsr=0172150

2.11 BSD UNIX #7: Thu Jun 8 21:53:04 PDT 1995
    root@:/usr/src/sys/DOKBSD

ra0: Ver 3 mod 3
ra0: RA81  size=891072
attaching qe0 csr 174440
qe0: DEC DEQNA addr 08:00:2b:aa:bb:cc
attaching sl
attaching lo0

phys mem  = 4186112
avail mem = 3730240
user mem  = 307200

May  9 05:49:32 init: configure system

dhv 0 csr 160440 vector 300 attached
lp ? csr 177514 vector 200 skipped:  No autoconfig routines.
ra 0 csr 172150 vector 154 vectorset attached
rl 0 csr 174400 vector 160 attached
tms 0 csr 174500 vector 260 vectorset attached
erase, kill ^U, intr ^C
#

Now you are in single user mode. It is a good moment to edit some /etc files, or simply press CTRL-D to exit single user mode and proceed to full multiuser...

 Fast boot ... skipping disk checks                                                        
checking quotas: done. 
Assuming NETWORKING system ...
add host dokbsd: gateway localhost
add net default: gateway 172.17.0.1
starting system logger
May  9 05:49:39 dokbsd vmunix: ra0: Ver 3 mod 3
May  9 05:49:39 dokbsd vmunix: ra0: RA81  size=891072
checking for core dump... 
preserving editor files
clearing /tmp
standard daemons: update cron accounting.
starting network daemons: inetd rwhod printer.
starting local daemons: sendmail.
May  9 05:49:39 dokbsd ntpd[98]: init_ntp: bad drift compensation value
Mon May  9 05:49:39 PDT 2016
May  9 05:49:39 dokbsd init: kernel security level changed from 0 to 1


2.11 BSD UNIX (dokbsd) (console)

login: root
erase, kill ^U, intr ^C

Login as root with no password... and voilĂ ! You are logged into a BSD 2.11 system running on a PDP-11 simulator. The kernel has network support, which is functional. Just edit /etc/resolv.conf to use a working DNS server and you will be in the internets:

# echo "nameserver 8.8.8.8" >> /etc/resolv.conf 
# ping www.google.com
PING www.google.com (216.58.214.164): 56 data bytes
64 bytes from 216.58.214.164: icmp_seq=0 ttl=37 time=33.334 ms
64 bytes from 216.58.214.164: icmp_seq=1 ttl=37 time=16.667 ms
64 bytes from 216.58.214.164: icmp_seq=2 ttl=37 time=16.667 ms
64 bytes from 216.58.214.164: icmp_seq=3 ttl=37 time=16.667 ms
^C
--- www.google.com ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 16.667/20.833/33.334 ms

(Of course I'm supposing your Docker installation is properly configured and your machine has internet access...).

The pdp11 simulator is configured with a DZ terminal multiplexer attached to the port 2323 in the container. This port is mapped to the same 2323 port in the host so you can TELNET into it. If you start several containers you will have to use a different local port for each one (specified as the first number in the -p parameter: if you want to map the 2323 port in the container to the 9999 port in the host, specify -p 9999:2323). If you are using Linux, you can TELNET to localhost; if you are using OSX or Windows you must use the IP address of the Docker host, which can be obtained with the docker-machine ip command. 

You should shut down the system properly (those old unices have filesystems that break easily if you simply stop the simulator) and you will be back at your host prompt:

# shutdown -h now
Shutdown at 06:11 (in 0 minutes) [pid 118]

        *** FINAL System shutdown message from root@dokbsd ***

System going down IMMEDIATELY

System shutdown time has arrived
May  9 06:11:01 dokbsd shutdown: halt by root: 
May  9 06:11:04 dokbsd syslogd: going down on signal 15
CAUTION: some process(es) wouldn't die
syncing disks... done
halting

HALT instruction, PC: 000014 (MOV #1,11616)
Goodbye
Eth: closed eth0
Log file closed
macjordie:~ jguillaumes$ 

Now you have a container in your Docker host. To run it again, do not use the docker run command, which would create a new container. Use the docker start command, followed by a docker attach.

macjordie:~ jguillaumes$ docker start pdpbsd
pdpbsd
macjordie:~ jguillaumes$ docker attach pdpbsd

: ra(0,0,0)unix
Boot: bootdev=02400 bootcsr=0172150

2.11 BSD UNIX #7: Thu Jun 8 21:53:04 PDT 1995

If you are not very fast, you will not see the ":" boot prompt (because at the attach time it will already be in the "console"). Just press enter and the boot will continue.

The simh-vaxbsd and simh-vaxnbsd images work just like this one. The BSD 4.3 boots straight into BSD, and the root is also passwordless. The NetBSD one asks for the boot device (just the first time it boots), and its root account has the password "manager". The other two images are a little bit different.

Non self-contained images

The other two images do not contain an OS image for the simulators, so you have to provide one. This can be done in two ways: 

  • Copy the image from your host to the container using docker cp
  • Mount the directory that contains your image into the /machines volume in the container, using the -v parameter of the docker run command as this: 
-v your_image_directory:/machines 

In any case, when you start the machine you will find youself in a shell in the /machines subdirectory, where you can start the simulator you want, providing your own configuration file (the simh-vax image has some samples). The images contain the nano editor so you don't need to fight vi.

There is an annoying bug, probably caused by the linux distribution I've based the images on (Alpine Linux), which does not use the regular C runtime library and somehow messes the console input in simh. It does not display any prompt, but you can type in commands and it works.

Building your own images

The dockerfiles and additional material needed to build those images are available in github. Feel free to clone/fork/reuse whatever you find there. The compressed images for BSD 2.11 and BSD 4.3 are in the repository. The one for NetBSD is not. Check the README.txt file in the vaxnbsd subdirectory for a link to download it (it weights about 260MB). The README.md file contains details and instructions to build the docker images, and suggestions to add your own OS images to build containerized systems. Please remember it is not allowed to distribute VMS under the hobyist license, so keep any VMS image to yourself.

Further information and updates

The images mentioned in this post are available at Docker Hub. Just search for "jguillaumes" and you will find them. The descriptions contain information about its usage. Of course, feel free to contact me with comments, suggestions or bug reports!

Enjoy your containerized PDP-11 and VAXen!