One way to think about assemblers is to imagine that they compute the value of a series of expressions assigned to sequentially increasing memory locations. The expressions may conventionally consist of the value of a symbol, some arithmetic done on symbols, constants, and special variables such as the “current location counter” (often written with a funny name such as “$”), or truly peculiar expressions whose syntax is that of machine instructions.
Note that an expression may produce a value that fills several sequential memory locations; machine instructions tend to do this, but it is useful to have expressions for string literals, multiprecision numbers, initialized structs, etc. This only affects bookkeeping details but doesn’t change what assemblers do in the abstract.
To compute the final value of each expression, the assembler must know the value of any symbols that might be involved. It discovers symbol values in only a few ways. First, the symbol value might be defined as the result of an expression. Second, the symbol value might be assigned the value of the current location counter; typically assemblers do this when a symbol is written in “label” position. On such discovery, the assembler records the symbol name and its value in the symbol table to use in evaluating expressions.
A key problem an assembler faces is producing the value of an expression, having not yet encountered all the definitions of symbols. The assumption is that if a symbol is not defined in some particular line, it will be defined in some later line that the assembler will eventually process.
A two pass assembler tries to compute the value of each expression as it encounters it, in two passes called “first” and “second” passes. During first pass, if there are undefined symbols (presumed to be forward references) in the expression, the assembler simply substitutes a dummy value (often zero); in any case, it computes a value for the expression. If a machine instruction or data constant is being processed, the results are ignored, but the size is used to advance the location counter to enable label value assignment. If a label is encountered, its value is set to the current location counter. If a symbol assignment “A EQU ” is encountered, the symbol value is set to the result of the expression; if the expression contained an undefined symbol, the assembler will emit an error. If an origin statement is found “ORG “, it is treated as if one wrote “$ EQU “. At the end of the first pass, all labels have been assigned values; any symbols that do not have values are marked as “undefined” in the symbol table. A second pass repeats the expression evaluation of the first but does not (re)define any symbols; since all symbols are (expected to be) defined, the expression values are correct and emitted to the output stream. Any undefined symbols found in an expression cause an “undefined symbol” complaint.
A one-pass assembler tries to compute the value of each expression as it encounters the expression. If the expression contains only defined symbols, the assembler can evaluate it and produce the final value, and write that information to its output stream. (Another answer here suggested that some one pass assemblers write their answer to memory. That’s just a special case). If the expression contains an undefined symbol, the assembler stores a pair (location,expression) to be reprocessed later, either when the symbol becomes defined, or at end of assembly. Some expressions such as those that set the location counter can’t have undefined symbols; the assembler will complain in that case.
So the tricky part is storing the unresolved expression, and deciding when to re-evaluate it. One way to store the expression is to simply keep the text; another is to build what amounts to a (reverse) Polish notation for the expression. To determine when the expression needs to be re-evaluated, one can associate it with the undefined symbols it contains; then when a symbol gets defined, the corresponding unresolved expressions are re-evaluated, with completed ones being emitted, and unresolved ones left again for reprocessing. Alternatively the assembler could simply save all the expressions until it encounters the end of input; at this point, all symbols should be defined and so it should be able to determine final values for each expression. One chooses between these two techniques based on how much memory one can afford to store forward reference expressions.
In a previous century, I built a one-pass assembler that ran on an 8k byte computer, that used the Polish representation of expressions. As symbols were defined, the Polish expression was evaluated and any subexpressions that were computable were computed, simplifying the resulting Polish either to a final value or a smaller Polish expression involving only the operators on undefined symbols. Symbol table entries for undefined values had a linked list of all the Polish expression slots corresponding to undefined symbols; as symbol definitions were encountered, all elements of the linked-list were updated and Polish expressions were re-evaluated as that occurred. This keeps the Polish expression sizes as small as possible and gets rid of them the moment all of their symbols are defined. This assembler processed hundred thousand line programs just fine in small machine. The reason for doing a one pass assembler in such a small machine is the source code came from paper tape (a Teletype, for those of you old enough to remember) and reading that paper tape even once is pretty painful and slow; a second time was not a good idea, so a two pass assembler was not an appropriate choice.
One of my cohorts much later built an interesting two pass assembler. Rather than processing the text twice, he tokenized the text (storing it in memory) on the first pass as well as collecting symbol values. Pass two processed the tokenized text. This was a very fast assembler for two passes. He had a lot more memory available.
Originally posted 2013-11-09 23:22:29.