Detokenizers for Microsoft BASICs

In the 1970s and the 1980s, computer memory was expensive. The 8bit microcomputers came with a RAM ranging from 1 kilobyte (the Sinclair ZX-81) to about 64 kilobytes (the Commodore 64, Atari 800 XL or the Amstrad CPC 464), which was the maximum adressable by an 8 bit microprocessor with 16 bit address range. On the 16 bit microcomputers the memory was about 512 kilobytes (Atari 520 ST or the Amstrad PC1512) to 640 kilobytes (most IBM PC compatibles). A typical 5''1/4 floppy disk would store in double density about 360 kilobytes of data. With such a low memory, storing BASIC programs in plaintext in the computer RAM or on floppies would have exhausted the available space very quickly. The solution to that problem was to replace BASIC keywords in the program by single byte codes called tokens (or sometimes opcodes). Nowadays, having about a gigabyte of memory on a laptop computer is common, and tokenizing interpreted languages to save memory is not necessary. However, old BASIC programs saved on floppies are generally in the tokenized format which makes them unreadable without the BASIC interpreter that created them in the first place. One solution is to run the original BASIC interpreter in an emulator, but this is often impossible without the original ROMs of the emulated machine. So, even if the BASIC program files on the floppies can be copied to Unix/MacOS/Windows filesystem, listing them requires a detokenizer that turns the tokens back into the original BASIC keywords.

There are unfortunately many dialects of BASIC, and there is no single manner to tokenize BASIC programs. As a result there is a wide variety of formats of tokenized BASIC files (usually at least one per line of 8 bit computer for each computer maker of the 70s and 80s). Somewhat fortunately, many computer makers were using some version of Microsoft BASIC on their computers. Although the different Microsoft BASICs can use different codes for the same keywords, they have some codes in common, and often share the same internal representation for the BASIC program line and numerical constants. The changes to turn a detokenizer for one flavor of microsoft BASIC into a detokenizer for a different flavor are often minimal.

The package msbasicascii is a set of seven detokenizers for GW-Basic, MSX-Basic, MBASIC, Dragon 32 BASIC and TRS-80 Basics (Level II, Model 4, and CoCo), based on the original detokenizer for GW-Basic written by Christian A. Ratcliff. It has been compiled on Linux with gcc and GNU make and should be portable to Unix/*BSD/MacOSX. For Windows, it could be necessary to find a replacement for curses or termcap libraries. In any case, the program is a text mode application that runs in a terminal emulator window. It has been tested for the first all BASICs and has given satisfactory results, but it has not been tested with the Dragon 32 BASIC. Some untested programs for Ohio Scientific, Nascom, Compucolor and Exidy Sorcerer are also included.

The various Microsoft BASICs
BASIC interpreterDouble opcodesArithmetic
TRS-80 Basic Level II NoFloating Point
Exidy Sorcerer ROM-PAC Basic NoFloating Point
Nascom Basic NoFloating Point
Ohio Scientific Basic NoFloating Point
Microsoft BASIC (MBASIC) for CP/M 80 YesFloating Point
Dragon 32/64 YesFloating Point (?)
TRS-80 Color Computer YesFloating Point (?)
MSX-Basic YesBinary Coded Decimal
GW-Basic/BASICA Yes (3 sorts)Floating Point

In all the Microsoft BASICs, the tokenized program file begins with 0xff. Each line begins with two bytes that indicate the address of the next line, and two other bytes that contain the line number. Tokens of BASIC keywords, numerical constants and variable names follow. The line is terminates with a byte set to zero.

Let us give one example (for MBASIC on CP/M-80). The first line is a BASIC line, the second line its tokenized form in hexadecimal notation (obtained with hexdump -C). 125 FOR I = 1 TO 4 cb 62 7d 00 82 20 49 f0 12 20 ce 20 15 00 The first two bytes contain the address of the next line, so we can ignore them. The next two bytes are 0x7d (hexadecimal) which is 7*16+13=125 (decimal) and 0. In general the line number is equal to (second byte)*256 + (first byte) in decimal. This is followed by the token 0x82 that corresponds to the BASIC keyword FOR, a space (ASCII code 0x20 in hexadecimal) the variable name I (ASCII code 0x49) the token 0xf0 that corresponds to the BASIC keyword =, 0x12 that encodes the single byte constant 1, a space, 0xcethat corresponds to the BASIC keyword TO and 0x15 that encodes the single byte constant 4. The 00 terminates the line.

BASICA/GWBasic

BASICA was the Basic interpreter of the IBM PC. GW-Basic was used on IBM PC compatible and MSDOS compatible computers. As explain on the page GW-BASIC tokenised program format, it uses double tokens (opcodes) to represent some BASIC keywords. This means that for instance, the keyword CVI is represented by the sequence of two bytes Oxfd 0x81 (in hexadecimal base). It has the most complicated double opcodes sequences, since 0xfd,0xfe,and 0xfe can appear as the first byte.

GW-Basic contains all the different representations of numeric constants used in Microsoft BASICs. It uses the floating point representation for real number in both single precision (4 bytes) and double precision (8 bytes).

MSX-Basic

MSX was a standard for home computers based on Zilog Z80 8bit microprocessors originally developped in Japan by ASCII corporation in the 1980s and adopted by Japanese and South Korean home computer makers, as well as by Philipps in the Netherlands. A Basic interpreter was developped for MSX computers by Microsoft. Contrarily to GW-BASIC, the MSX computers use a Binary Coded Decimal representation for real numerical constants. In Binary Coded Decimal, two figures of a decimal number are stored in a single byte, so that for instance 12 34 in hexadecimal (i. e. 1*16+2=18 3*16+4=52 in decimal) represents 12 34 in decimal. The advantage over binary floating point representation is that quantities such as 1/5=0.2 can be represented exactly with Binary Coded Decimal, while in binary floating point, a truncation would occur. This can be useful in financial computation. However, this representation is less economical in computer memory, and less efficient in terms of speed of calculation. MSX-Basic is using both single and double opcodes, however double opcodes always start with 0xff. The list of opcodes is available from the MSX2 Technical manual.

Microsoft BASIC for CP/M

CP/M was a operating system for 8 bit microcomputers based on Intel 8080, Zilog Z80 and Intel 8085 chips. It was developped by Digital Research at the end of the 70s, and was used mainly on professional computers. Microsoft developped a BASIC interpreter for CP/M, MBASIC. This basic uses a binary floating point number representation for reals, as GW-BASIC, but only uses double opcodes starting with 0xff.

TRS-80 Basic Level II

The TRS/80 was a popular computer manufactured by Tandy Radio Shack in the 1980s. The Basic Level II is a Microsoft BASIC that does not use dual opcodes, and represents real number with binary floating point. The opcodes are available from Ira Goldkland's site. See Tokenized BASIC.

TRS-80 Model 4 Basic

The TRS/80 Model 4 was the successor of the Model I. The Model 4 Basic differs from the Level II Basic by the use of dual opcodes. The opcodes are also available from Ira Goldkland's site. See Tokenized BASIC.

Miscellaneous

The Ohio Scientific 8K Basic in ROM, NASCOM Basic and the Exidy Sorcerer ROM-PAC Basic are similar to the TRS-80 Basic Level II. The tokens for the Ohio Scientific have been published in MICRO n°15, p. 20 (1979) and Compute n°2, p. 121 (1980). In the Sorcerer's Apprentice vol. 3, n° 4, p. 77 (1981) a table of correspondence between TRS-80 tokens and Sorcerer tokens is available. However, this table contains some errors and is incomplete (LOG is missing). In Micropower vol. 1 n°3 p. 18 (1981) a similar table appears for NASCOM Basic and Crystal Basic. [Return to the BASIC on Linux page]