In the 1970s and the 1980s, computer memory was expensive. The 8bit microcomputers came with a RAM ranging from 1 kilobyte (the Sinclair ZX-81) to about 64 kilobytes (the Commodore 64, Atari 800 XL or the Amstrad CPC 464), which was the maximum adressable by an 8 bit microprocessor with 16 bit address range. On the 16 bit microcomputers the memory was about 512 kilobytes (Atari 520 ST or the Amstrad PC1512) to 640 kilobytes (most IBM PC compatibles). A typical 5''1/4 floppy disk would store in double density about 360 kilobytes of data. With such a low memory, storing BASIC programs in plaintext in the computer RAM or on floppies would have exhausted the available space very quickly. The solution to that problem was to replace BASIC keywords in the program by single byte codes called tokens (or sometimes opcodes). Nowadays, having about a gigabyte of memory on a laptop computer is common, and tokenizing interpreted languages to save memory is not necessary. However, old BASIC programs saved on floppies are generally in the tokenized format which makes them unreadable without the BASIC interpreter that created them in the first place. One solution is to run the original BASIC interpreter in an emulator, but this is often impossible without the original ROMs of the emulated machine. So, even if the BASIC program files on the floppies can be copied to Unix/MacOS/Windows filesystem, listing them requires a detokenizer that turns the tokens back into the original BASIC keywords.
There are unfortunately many dialects of BASIC, and there is no single manner to tokenize BASIC programs. As a result there is a wide variety of formats of tokenized BASIC files (usually at least one per line of 8 bit computer for each computer maker of the 70s and 80s). Somewhat fortunately, many computer makers were using some version of Microsoft BASIC on their computers. Although the different Microsoft BASICs can use different codes for the same keywords, they have some codes in common, and often share the same internal representation for the BASIC program line and numerical constants. The changes to turn a detokenizer for one flavor of microsoft BASIC into a detokenizer for a different flavor are often minimal.
The package msbasicascii is a set of seven detokenizers for GW-Basic, MSX-Basic, MBASIC, Dragon 32 BASIC and TRS-80 Basics (Level II, Model 4, and CoCo), based on the original detokenizer for GW-Basic written by Christian A. Ratcliff. It has been compiled on Linux with gcc and GNU make and should be portable to Unix/*BSD/MacOSX. For Windows, it could be necessary to find a replacement for curses or termcap libraries. In any case, the program is a text mode application that runs in a terminal emulator window. It has been tested for the first all BASICs and has given satisfactory results, but it has not been tested with the Dragon 32 BASIC. Some untested programs for Ohio Scientific, Nascom, Compucolor and Exidy Sorcerer are also included.
BASIC interpreter | Double opcodes | Arithmetic |
---|---|---|
TRS-80 Basic Level II | No | Floating Point |
Exidy Sorcerer ROM-PAC Basic | No | Floating Point | Nascom Basic | No | Floating Point |
Ohio Scientific Basic | No | Floating Point |
Microsoft BASIC (MBASIC) for CP/M 80 | Yes | Floating Point |
Dragon 32/64 | Yes | Floating Point (?) |
TRS-80 Color Computer | Yes | Floating Point (?) |
MSX-Basic | Yes | Binary Coded Decimal |
GW-Basic/BASICA | Yes (3 sorts) | Floating Point |
In all the Microsoft BASICs, the tokenized program file begins with 0xff. Each line begins with two bytes that indicate the address of the next line, and two other bytes that contain the line number. Tokens of BASIC keywords, numerical constants and variable names follow. The line is terminates with a byte set to zero.
Let us give one example (for MBASIC on CP/M-80). The first line is a BASIC line, the second line its tokenized form in hexadecimal notation (obtained with hexdump -C).BASICA was the Basic interpreter of the IBM PC. GW-Basic was used on IBM PC compatible and MSDOS compatible computers. As explain on the page GW-BASIC tokenised program format, it uses double tokens (opcodes) to represent some BASIC keywords. This means that for instance, the keyword CVI is represented by the sequence of two bytes Oxfd 0x81 (in hexadecimal base). It has the most complicated double opcodes sequences, since 0xfd,0xfe,and 0xfe can appear as the first byte.
GW-Basic contains all the different representations of numeric constants used in Microsoft BASICs. It uses the floating point representation for real number in both single precision (4 bytes) and double precision (8 bytes).