Sunday, January 21, 2018

Some assembly required - Part one

During my professional career, I've got to write assembly code for three different platforms: the VAX, the IBM PC and the IBM Mainframe, also known as zSeries. The only one I have got to somehow master is the mainframe one; my experience with the other two is limited to the occasional tinkering and writing occasional glue code.

As a hobbyist, however, I've been lucky to be able to write assembly code for other architectures. The PDP-11 is probably the one I have been more interested in, as you can see if you read past articles (and, yes someday I'll go back to my little MUXX project). I have also written some code for the old Commodore-64, or the MOS 6502 processor, which is the same. That makes a total of five platforms I've tinkered with at the assembly level. That means I've got the chance to see how different some quite kown and classic architectures are at the machine level.

In this post I'll present you with a working example written in assembly language. The example is quite simple, and consist in the well known "reverser" program. That is, it will prompt the user to enter a string and will output it reversed; it will keep asking the user to enter more strings until he enters an empty line. Then it will finish orderly (if possible) and will return control to the operating system (if there is one). This program is, of course, trivial, but it's a step further beyond a simple "echo" program and two steps away from a "Hello world". It allows us to see how looping works in each platform, and how to do some input/output at the basic level.

The examples are not meant to be... exemplar. I'm by no means an expert in each of the platforms. And, for sure, there are more efficient and canonical ways to do this. The idea is give a taste of each platform and an excuse to comment a little bit about each one of them. Since I'll cover eight different architectures, I will make several posts. This is just the first one of possibly three.

All the source code is in github, and can be browsed or cloned from the repository https://github.com/jguillaumes/ancientbits, which contains also code for other posts in this blog.

With no more ado, let's see our first example. And we'll begin with a well know platform, also known as...

The IBM PC

The horror. The 8086 is probably one of the ugliest processors to program. Its segmented memory model has provided headaches to thousands of programmers. Its non-ortogonal register set is, at least,  nightmarish, but it somehow achieved a total victory in the industry dominance, crushing all the competitors which crossed its path, with the sole exception of the ARM architecture, which specialized in low-power computing and won the mobile market.

The 8086 architecture uses a 16 bit word length and a 20 bit address length. To be able to address all the memory, it divides it in 64K segments which can be selected using some special registers, namely CS (code segment), DS (name segment), ES (extra segment) and SS (stack segment). An assembly program has to keep track of the content of those segments, and initialize and change them accordingly. The processor contains four registers which all almost general purpose (AX to DX), two registers related to stack managing and procedure call handling (SP and BP) and two more used as pointers for string-handling instructions (SI and DI). The architecture evolved to 32 bits with the 80386 processor and then to 64 bits, first introduced by AMD. Since this blog is basically about classical machines. we will stick to the original 16 bit 8086 (and its parents like the 8088 used in the original IBM PC).

Register 8086
Intel 8086 register model
The 8086 architecture uses a stack for interrupt handling and procedure calling. The usual ABIs also use the stack to place the procedure parameters. The AX register is the only one which can be used for any arithmetic operations, while CX is used in counting-related things (like counted loops), while BX is used to contain a base address. That's not exact, and the programmer had to memorize (or have the manual at hand) which registers can be used in each operation.

Using the MS-DOS operating system the programmer used a "software interrupt" to invoke the operating system kernel. That "software interrupt" is what we call "system trap" in other architectures. MS-DOS used the interruption number 21h (21 in hexadecimal), using the AX and DX registers to pass parameters to the kernel. Some of the system services were directly taken from a previous OS, CP/M (which we will also cover in this series of posts). Gradually, it evolved to distance itself from its ancestor.

Let's see some code. This is the start of our little program

main    proc
        mov    ax, seg hello       ; Setup data segment (DS and ES)
        mov    ds, ax
        mov    es, ax

        lea    dx, hello           ; Show welcome message
        mov    ah, 09h
        int    21h

What do we have got here? The first instructions just set up the DS and ES registers so we can address the different parts of our code. This program fits in just one 64K segment, so we will just set DS and ES to the segment which contains one of our literals and forget about them. That is what was called the tiny memory model. There were other models: small, medium, large and huge, but we will ignore those in this example.

After setting the data and extra segments, this code contains a system call, which will display a welcome message. We load the address of the "hello" string into the DX register and then call the 09h function of the DOS kernel using int 21h. That service outputs text to the standard output (the screen, usually). The string has to be terminated by a "$" character (although the terminator can be changed using another system call).

The next interesting thing is the user input. This is the related code:

        lea     dx, buffin          ; Input text line
        mov     ah, 0ch             ; Clear STDIN buffer
        mov     al, 0ah             ; Buffered read DOS function
        int     21h
        mov     al, byte ptr [numch]; AL: Number of characters
        cmp     al, 0               ; Is the buffer empty?
        je      final               ; Yes: finish

In this case, we use the 0ah DOS function to read a line of text. The line of text is placed in a buffer, which contains a counter byte and, hopefully, enough space for the entered text. If the user enters more text, it can overflow the buffer and overwrite code. A malicious user can use that kind of error to inject his own code into our computer and do nasty things. But, hey, when this was designed there was no internet full of bad guys trying to break into your machine!

After reading the input, we check if the user entered an empty line and, in that case, branch to the end of the program.

Now we'll get into the proper reversing part of the program.

     lea     ax,bufchr           ; AX => Start of input buffer
     sub     cx,cx               ; CX => Zero
     mov     cl,byte ptr [numch] ; CX => Number of bytes in buffer
     add     ax,cx               ; AX => End of input text + 1
     dec     ax                  ; AX => End of input text
     mov     si,ax               ; SI => End of input text
     lea     di,bufout           ; DI => Start of output buffer

theloop:
     std                         ; change direction: decrement
     lodsb                       ; Load byte from DS:SI, decrement SI
     cld                         ; change direction: increment
     stosb                       ; Store byte at ES:DI, increment DI
     loop    theloop             ; Check CX and loop if not zero
     mov     cx,3                ; Prepare to move 3 more bytes (CR,LF,'$')
     lea     si,crlf             ; DS:SI => CR+LF+'$'
     rep movsb                   ; Move to ES:DS (append to output buffer)

The 8086 architecture implements instructions to move, load or store bytes (MOVSB, LODSB and STOSB respectively). Those instructions move stuff pointed by DS:SI onto ES:DI, and at the same time increment or decrement the corresponding source or destination register. The instructions STD and CLD change the increment/decrement characteristic. So in this code, what we do is to load SI with the address of the last character of the string, DI with the address of the first byte of the output buffer and then proceed to load the bit after activating the decrement mode and store it after activating the increment mode. The LOOP instruction decrements the CX register and loops unless it is zero, so we will also preload it with the length of the string. And we are almost done. To complete the output string we add three more bytes: a return, a line feed and the terminating "$" using the MOVSB instruction, prefixed with a REP indicator, which tells the processor to repeat the instruction as many times as indicated in the CX register.  And this is basically all.

To run this example you will need a MS-DOS environment. You can use a real MS-DOS, a MS-DOS session under windows (I have not tested this!) or a DOS emulator like DOSBOX. I used this option. You can get the Microsoft MacroAssembler for free (legally), but please take note the last version capable of running under MS-DOS is 6.11. Any version higher than that needs a Windows environment to run.

The PDP-11

I have already wrote several entries about the PDP-11 and its architecture.  I have confessed I love it. The PDP-11 architecture defines eight general purpose registers, and it is basically orthogonal. All the instructions can be applied to any register. Almost.

Two of the registers have specialised uses. R6 is used as a stack pointer. When the basic architecture was enhanced to support different privilege levels, R6 was "multiplied" by three so each execution mode (user, supervisor and kernel) has its own copy and thus can address its own stack. The R7 register is the program counter and contains the address of the next instruction to execute. As a programmer you can perform arithmetic on R7... but if you think a little bit you'll find that is the same as doing a relative (if you add or subtract from R7) or absolute (if you deposit a value into R7) branch. And, by the way, you can use any register (except R0) as a stack pointer, but you won't get the three execution mode-specific instances!



The PDP-11 is a 16 bit machine, and so its address space is just 64K bytes long. Even in its time, 64K was considered too small, so the machine was enhanced to be able to physically address up to 4 megabytes. The program (or the operating system) configures a memory management unit to establish which 64K of those 4M can be accessed in each moment. The latest versions allowed to address 64K of code and 64K of data (Instruction/Data split). But that was as much as the architecture could be squeezed, so DEC designed the VAX to enhance and substitute the PDP-11.

There were plenty of operating systems available for the PDP-11, some of them written by DEC. The PDP-11 was the platform were UNIX got into its adolescence (it was born in the PDP-7, but grew in the PDP-11). For this example, I've choosen one of the DEC operating systems. Namely, RT-11. The RT in that name means, literally, "Real Time".

RT-11 is a single user, multitasking operating system. Its user interface comes from the TOPS-10 heritage, and hence it is part of the inspiration for CP/M which, in turn, is the base for MS-DOS. A MS-DOS user can feel quite comfortable typing commands in a RT-11 system: the files have the familiar 8+3 naming structure, and a lot of commands will be familiar: DIRECTORY, DELETE, TYPE, to say a few.  Our example code will use RT-11 system calls to get and put strings in the terminal.

Let's go and see some code. This is the beginning of our code:

START: 
        BIS     #TTLC$,$JSW ; Allow lowercase characters

        MOV     #HELLO,R1 ; Display welcome string
        MOV     #LHELLO,R2
        JSR     PC,LINOUT
        JSR     PC,LBREAK

The first instruction just sets a flag in the Job Status Word, which is located at the address 000044 octal (in the PDP-11 word it's customary to use octal, even it being a four-bit nibble machine). Specifically, this flag allows to use lowercase characters in input (it prevents the OS to automatically upercase them). After that. we have two subroutine calls which display the welcome message and a line break. We have to build our string-outuput routine, since RT-11 does not provide one. It provides just a basic "put byte" service, which we use to build our string printing routine:

;------------------------------------------------------------------------
; LINOUT: Display a text line
;------------------------------------------------------------------------
;  Subroutine to display a text line
; R1: @Text
; R2: Size
; R1 and R2 are destroyed
;
LINOUT: CMP     R2,#0  ; Exit when no more chars to display
        BLE     20$  ;
10$:    .TTYOUT (R1)+  ; Display character pointed by R1
        SOB     R2,10$  ; Decrement counter
20$:    RTS     PC 

Notice we specify the PC register both in the call (JSR) and return (RTS) instructions. In the JSR instruction, the specified register gets pushed into the stack and the current program counter is stored in the register. On return, an indirect jump is performed to the register and its value is popped from the stack. If, like in this case, we use PC that linkage process does not happen and no register apart from PC (R7) itself is modified. When we take a look at the mainframe architecture we will see a similar thing (although the mainframe does not really use a stack).

The subroutine invokes the "macro" .TTYOUT, which is a system call which outputs a character (pointed by R1) to the console. It uses the (R1)+ construct, which is a post-incremented indirect addressing. It means "take whatever is in the memory position pointed at by R1 and increment R1 afterwards". So we proceed through all the string characters, one by one. We use a counted string, with size at R2. The SOB instruction ("Substract One and Branch) subtracts one from the specified register (in this case, R2) and jumps to the target label unless the register is zero. This is one way to implement counted loops in the PDP-11.

We have to program the line input routine too. Since it's just used once, it has not been implemented as a subroutine, but as inline code. Let's take a look at it:

GETLIN: MOV     #PROMPT,R1  ; Display prompt
        MOV     #LPROMPT,R2
        JSR     PC,LINOUT

        MOV     #BUFFER,R1  ; R1 => @Buffer
        CLR     R2          ; R2 => Number of read characters
GETCHAR: 
        .TTYIN              ; Read character into R0
        CMPB    R0,#^X0D    ; Is it a CR?
        BEQ     GETLF       ; Yes, consume LF
        MOVB    R0,(R1)+    ; No, store character in buffer
        INC     R2          ; Increment char counter
        CMP     R2,#LBUFFER ; Full?
        BGT     FULL        ; Yes: finish
        BR      GETCHAR     ; No: get more characters
GETLF:
        .TTYIN              ; Consume LF
FULL:                       ; Full buffer, let's finish here

        CMP     R2,#0       ; Is the buffer empty?
        BEQ     BYE         ; Yes: finish

The .TTYIN macro is the one which reads a character from the terminal and gets it in R0. What we do is to check it for carriage return (0x0D), which is what we use to check for end of line. If the character is not a carriage return, we store it into our buffer, pointed by R1 (which we will autoincrement) and count the number of characters in R2. We check R2 for buffer size to avoid buffer overruns. If the character is a carriage return, the next one will be a line feed, which we can safely ignore. If the line is empty, we branch to the end of the process.

We can take a look now into the actual reversing code:

        MOV     R2,-(SP)    ; Save string length into stack
 
        MOV     #INVER,R1   ; Display "Inverted..."
        MOV     #LINVER,R2
        JSR     PC,LINOUT

        MOV     (SP)+,R2    ; Restore string length from stack

        MOV     #BUFFER,R1  ; R1 => @Buffer
        ADD     R2,R1       ; R1 => @End of string
OUTR:   .TTYOUT -(R1)       ; Display character pointed by R1
        SOB     R2,OUTR     ; Loop w/decrement pointer and counter
        JSR     PC,LBREAK   ; Add line break 
        BR      GETLIN      ; Get next line

The first instruction is a stack push. We use the SP register (R6) as a stack pointer, but since the PDP-11 is quite orthogonal, we could have set up our own stack pointed by any other register. We use the stack to temporaly save R2 since we are going to invoke our string-output routine which overwrites it (in a small program like this one we could have used another register, but this way we can see an example of stack push and pop). After invoking the put string, we pop R2, so we have got the lenght of the string to reverse.

Now the rest is quite easy. We set up R1 to point to the last character of the string and output it one by one, invoking the .TTYOUT macro predecrementing R1. This works in a simular way to the postincrement, just in this case the register is decremented before it's used to get the character to print.

And that's basically all. The PDP-11 instruction set and architecture are really nice and easy to learn and understand. For that reason it was used for many years to teach computer architecture and assembly programming to CS students.

To run this example you need a PDP11 simulator like simh and a RT-11 image. RT-11 comes with the assembler, and you can use KERMIT or plain cut and paste to bring the program into the simulator.

Closing words

This is the end of the first post. I hope you've got an idea about how did the 8086 and the PDP-11 "look" from the view of an assembly language programmer. Next post will be about small machines and will cover the 6502 as used in the Commodore-64 and the Z-80 as used in CP/M capable machines. Maybe I will add some other small machine. We will see.