One such bad decision has been the tendency of various font designers to modify the ASCII character 39 (or HEX 27 if you prefer) which is a forward single quote ', such that the single quote stands straight, rather than having a forward slant to match up with the backward single quote ASCII character 96 which is `.
This page is a list of historical precedents for the slanting slope and a bunch of good practical reasons why the standard was correct to begin with, and why messing with a good thing should be sternly discouraged. In addition, historical revisionism is bad, it creates a memory hole and manipulates people's behaviour. This page hopes to set the record straight and hopefully encourage people to fixup their fonts to set the trusty old ASCII 39 at a slant once more. Also, try to match the forward/backward nature of ASCII 39 and ASCII 96.
If anything in the history of communications can define a successful standard, ASCII would have to be the immediate choice. Almost 50 years old and still (almost) unchanged, it is now the base encoding of nearly all electronic documents worldwide and the only likely future contender is Unicode which is supposed to be an extension of ASCII anyhow.
The only other outstanding character encodings of human history have been Morse Code (started approx 1844, officially retired for most uses approx 2000, original Morse only supported upper case letters and numbers, later some punctuation was added, never was any attempt made to support foreign characters), and Baudot Code (started approx 1875 and used for telegrams for 70 years, still used in a few special applications, also upper case only with only a single quote).
Any standard attempting to follow on from ASCII must fully accept both the literal details of the standard itself, and the common usage that represents this standard in practice. Otherwise the replacement is an attempt to "embrace, extend and extinguish" by introducing incompatibilities into the original system to break existing implementations.
It should be noted that ASCII went through a number of revisions, including 1963 ASCII and 1965 ASCII. This is a good report of the various characters with reference back to committee notes, etc. The idea of using the ASCII 39 (or APOSTROPHE) as a substitute for an acute accent goes back to 1963 (L. L. Griffin, report of ISO TC 97/SC 2 meeting, October 29-31, 1963, NMAH 310, box 4).
It's very normal for standards to go through a few iterations and a bit of settling down as the "designed by committee" document gets adopted by the "real world". Thus, a standard consists of two things: the official conclusion of what the standard should be, and the actual result of what happens in the implementation of that standard. In the case of ASCII, the two are very close, and the widespread public response was to accept ASCII 39 as a general-purpose character -- an apostrophe, a quote and an accent. With the space restrictions imposed by 7-bit encoding, it made a lot of sense.
Early versions of ASCII had things like an up-arrow which were removed in later versions. The 1963 ASCII only standardized upper case characters (the lower case alphabet was added later). No one is suggesting that we now go back and review this, because the established practice has been to accept that the ASCII up-arrow is forever gone, and the lower case alphabet has become well established. Similarly, we have to accept that the established standard is to support one forward slanting single quote character and one backward slanting single quote character.
This page has a big history of Baudot, ASCII and other codes. It even explains the diacritical marks and the "open quote" and "close quote" of ASCII's 1967 standard.
One of the important features of the Apple I (which carried over into later Apples and into the PC) was the concept of an expansion slot. This led to the idea that computers should have an "open architecture" where third party vendors could add circuitry and capabilities. In later years, Apple got variously hotter and colder with this concept, their iPhone is anything but open architecture and is mostly designed to keep third party vendors out. The principle of the thing goes deeper than any particular company.
There's a screen shot of the original Apple 2 screen font here. It has a single quote (which is straight upright) and only supports ASCII 20 up to ASCII 95 (thus missing the lower case and the backward quote). Strangely, the alphabet has been placed first, where it would normally be after the numbers (that may be just this screen shot, it seems unlikely that the actual encoding was backwards). Presumably, later revisions became more ASCII compatible.
FIXME: So far no details on keyboards from early apples.
FIXME: I have no photographs of a PET keyboard. Would be very grateful to get hold of some.
Every modern TTY or system console is fundamentally designed around the DEV VT100. The Linux text console was designed from the start to be VT100 compatible. The X11 xterm is also a VT100 clone at the heart of it.
Commodore had the advantage of owning the chip design and fabrication house "MOS", and they used this advantage to create custom chips (for some reason, the chips got people-sounding names like "Sid", later the Amiga custom chips all got people names like "Angus"). The C64 keyboard was a somewhat non-standard layout by today's standard and the character set encoding (known as PETSCII) was based off the 1963 ASCII. Wikipedia has an article on PETSCII including screen dumps of the character sets.
PETSCII is based on the 1963 version of ASCII (rather than the 1967 version, which most if not all other character sets based on ASCII use). As such, PETSCII has only uppercase letters (in its unshifted mode, that is; see below), an up-arrow (↑) instead of a caret (^) in position $5E and a left-arrow (←) instead of an underscore ( _ ) in position $5F.
Here I show a photograph of the C64 keyboard with the clearly slanting single quote character.
For what it's worth, the C64 offered a pound sign on the keyboard which possibly helped sales in England. Here's a photo of the pound key.
The C64 supported uppercase and lowercase fonts but they did it in a strange way so that in "unshifted" mode, the uppercase ASCII alphabet was available in the normal position but the lowercase alphabet was replaced by grpahics characters (designed mostly for games). In "shifted" you had access to ASCII uppercase and lowercase but they were swapped over from the currently accepted positions.
Old C64 fonts are available to download, and alternative fontsets were used around the world to provide special characters for region-specific purposes. There are even amateur hacks where alternative fontsets were loaded into EPROM by home users. Aside from all this curiosity, the ASCII 39 used by the C64 screen font was always a right sloping quote. Indeed, here is the quote character as extracted from a download of the original C64 font binary:
The QL was a big jump in microcomputing: lowest price 32 bit microcomputer of its day, pre packaged for the small business market with office software and including many radical ideas such as micro tape mass storage (which was obliterated in the market by Sony's superior 3.5" floppy disk). When you consider that it hit the market only a year after the C64 (above) but offered a 32 bit chip instead of a 16 bit chip, built in magnetic storage as standard (with a directory structure) and more memory and higher screen resolution, the QL had a host of stunning features for its day. The QL was smart in recognising ASCII as the dominant standard at the time and was a fully ASCII based system with the traditional British modification of a pound sign (on the keyboard under the tilda) displacing the backward quote on ASCII 96. The keyboard included a forard sloping single quote under the double quote and the documentation regarded this as an "apostrophe". Here's a scan from the original user guide:
In addition, the QL sported a keyboard layout very close to the modern PC QWERTY keyboard that is the solid standard in Western countries. Here's a photo of the left-hand portion of a QL keyboard (a bit dusty from sitting in storage for a long time). Note that under the tilda is a British Pound character which was dropped in replacing the backward quote.
Sure enough, a traditional single/double quote key with a clear slant on the single quote. This was not a Sinclair special but a nod to what was the widely accepted layout at the time, including the slanting quote.
FIXME: Would like to see the screen font of the IIe, especially as a comparison to the original II font linked above. Also, a photo of the keyboard.
Modern PCL offers a wide range of font selection escape codes, including a selection of symbol set and typeface. The default symbol set is known as "PC-8" which is to say: 8-bit font and PC oriented. The most basic symbol set available is 7-bit ASCII. Of course, we can select this symbol set on any PCL printer and have a look at what comes out.
The result is the Laserjet definition of the ASCII symbols. Naturally, the DEL comes out looking a bit strange, but the slanting forward and back quotes are unmistakable. This PCL and Laserjet standard has broadened amongst printers and is now widely supported. The above font forms the basis of this standard and has been accepted by the market.
That DEL character isn't as silly as a first glance might suggest. ASCII goes back to the days of punched tape data storage, also known as "Eight hole tape" (seven bits plus parity). We used to play with it when we were kids, as kite tails and similar; it was already nearly obsolete back then. When you start out, the tape has no holes, that's blank tape. A hole counts as a 1, undamaged paper counts as 0, so if you read a blank tape you get NUL characters (because all the bits are zero). Now, NUL has no printing glyph, it is a non-printing character and it represents nothing, just what you would expect when reading blank tape.
A normal punching machine would take a keystroke, and punch out the bit pattern appropriate for that keystroke, then slide the tape along one notch to the right. The special character DEL would punch out all the holes (i.e. all 1's) which is guaranteed to obliterate any other character that might have been in that position. Thus, if you type the wrong thing, you go back some number of steps, then you type DEL a number of times and obliterate the bogus characters.
If you enter a sequence into a PCL printer with backspace and DEL characters, you see something like this:
The use of backspace and another character over the top is known as doublestrike and it also has a rich tradition in making the most out of a limited fontset. Note the underline built from underscore characters: this design is still supported by some text formatters and even recognised by terminal emulators as an ASCII standard way of producing an underline font.
You can use the ASCII diacritical marks in much the same way, this short menu demonstrates French and Spanish words printed in a plain ASCII Laserjet font. It isn't perfect, but the doublestrike feature turns a simple 7-bit font into what is effectively a much larger characterset. There's a few sneaky details in the PCL case -- you need an escape code to get the vertical spacing right for the tilde and the acute and grave accents (circumflex works OK without adjustment), without the PCL escape codes it looks much uglier. Thus, although you can unambiguously represent those accented characters in plain ASCII, some post-processing is required to trim up the print output. Yes, the cedilla (using a comma) is a bit rough, but it is clear and readable which is the general idea of communication.
The real point I'm trying to make, those simple ASCII characters really are powerful and versatile (for languages with primarily Roman alphabet).
By the way, Unicode has continued the overstrike trend (sort of) by providing a bunch of extra accent characters including a nicer cedilla and yet another tilde (the Unicode extra tilde sits higher so the vertical spacing might be better for doublestrike printing). But Unicode also provides all the accented characters as new characters, on the basis of providing all things for all people.
Hint: if you want to be hard to find in government databases, get a heavily accented name and wait for them to start implementing Unicode in the civil service (even if they don't deliberately implement Unicode, the software upgrades will automatically creep the features in).
Unicode actually covers these issues and has given a bit of thought to it, what they have come up with is a normalisation methodology where combining sequences can be merged into single characters or where that is not possible, a canonical form can be identified. This is really important for search, sort and matching operations. What people don't seem to realise is that stepping from ASCII to Unicode is not just a matter of choosing an encoding and having a bigger character set. It is a computationally much larger problem to deal with all of the interesting ways humans have invented for written language. Part of the genius of ASCII was what it threw away as much as what it kept.
Knuth needed a high quality document formatting system for his own publication and he wasn't happy with the existing state of the art, so he made his own and many other authors (particularly in academia) were attracted to Knuth's design. Knuth also published a set of books about computers and typesetting so when it comes to the subject of document formatting and character encoding, this man literally wrote the book on it!
Here's the scan from his 1984 work "The TeXbook":
Knuth accepts that some computers will have an upright single quote but his method of using single quotes in a balanced manner has become the standard in the Unix world, and particularly in the GNU world. For a long time gcc produced error messages using these sorts of quotes (before they got butchered by locale variations).
The FX80 was part of an earlier age where peripheral manufacturers actually delivered documentation for what they sold, including extensive technical reference and a user guide. The FX80 user guide has been considered a canonical reference ever since. This user guide still exists today as a PDF download from Epson (and as various weatherbeaten and dog-eared second hand copies handed from hacker to hacker, sold at computer fairs). The user guide included a full breakdown of the character set encoding and the exact font that came with the printer, right down to the pin strike pattern grids. Here is a small snapshot taken from the PDF version as distributed by Epson:
Hopefully the slope on character 39 is crystal clear. Indeed, the matching character 96 is also in the fontset:
The important thing to note here is the nice mirror-image matching of the left and right quote characters -- set down for all of history in the original FX80 font that has become an iconic standard throughout the microcomputer world.
Here's the critical slanting quote key (sorry for the ever so slightly fuzzy photo):
This font is non-proportional, 80 columns by 25 rows, and is basically a copy of the original PC console running MSDOS back before VGA was even heard of. Needless to say, slanting single quote as per ASCII standard.
I will point out that other Microsoft fonts have a straight quote glyph for exactly the same character (and no particlar explanation for the inconsistency).
Sure, the idea of locales is nice, but not everyone wants it, so they should at the very least be able to keep using what they always had. The idea of ASCII extensions is also nice, but still not everyone wants it so it should be an option. Back-compatibility is essential.
Unicode is perfectly justified in creating single-use characters to provide alternatives to what was historically a multi-use character. The original multi-use character remains exactly the same as it always was.
Encouraging historical revisionism is also encouraging people to make irrational judgements and is often used as a tool of political manipulation.
I'm going to make a few hard criticisms about this particular page where the author takes the opposite view (i.e. ASCII 39 should be straight). From the summary at the top:
Only old X Window System fonts and some old video terminals show ASCII 0x60/0x27 as left and right quotation marks, while most modern systems follow the ISO and Unicode standards instead.
This is completely wrong, and I hope I can collect enough genuine examples to bury this so deep that it never even starts to climb out of the hole again. Indeed, a great many fonts had slanting ASCII 39, right round the computing world.
Most European keyboards have keycap labels for the apostrophe and both accents.
This one I can't prove wrong easily, but I'm working on it. Certainly, the majority of Western keyboards have some sort of slanting single quote even to this day. I would very much like some photos of older keyboards from machines around the world to provide comparison (particularly from machines that were somehow significant at their time, because of popularity or representing a computing milestone).
PostScript provides several predefined 8-bit encoding vectors. Authors of printer drivers can easily add their own. As the above table shows, the original PostScript standard encoding followed a practice similar to the old X fonts, with all its problems, namely it mapped the ASCII bytes 0x60 and 0x27 to curly opening and closing quotation marks (“quoteleft” and “quoteright” in PostScript glyph-name terminology, or U+2018 and U+2019 in Unicode).
There never was a "problem" with this ASCII mapping, it was the standard mapping of the day and it remains the standard ASCII mapping. Unicode was the problem because it pretended to extend ASCII while quietly changing the encoding of the original 128 slots.
The discussion on TeX in the above document is particularly bad, Markus Kuhn has obviously got some understanding of TeX but then is happy to completely ignore the rich and beneficial contribution that TeX and LaTeX made to our computing environment.
Donald Knuth’s TeXbook (chapter 2, page 3, end of second paragraph) has actually warned TeX users already since 1986 that the apostrophe and grave accent shapes can show up as required by ISO and Unicode and not as used in the rest of the TeXbook.
Markus is making reference to the same chunk that I copied above, it isn't a "warning" it is an explanation of the convention used, why it was used and how to work with it. Pretending that Knuth "warned TeX users" is wrongly claiming that TeX somehow supports the claim for a straight quote, when in fact, quite the opposite is the case.
As much as anything, I am writing this page to document the facts and make sure that regardless of how fonts might get rewritten, history does not get rewritten.
Thus, pushing Unicode onto computers has not only provided new, exciting characters and the ability to support languages with characters totally unrelated to the English alphabet (e.g. Chinese), it also left ASCII crippled with a silly straight quote. The result is that people trying to get balanced quotes will be encouraged to use a non-ASCII character instead, which in turn will push other people into upgrading their systems to be able to actually see these strange non-ASCII characters. What is actually does is lead to massive breakage caused be an incompatibility that never needed to happen (see below).
If you think I'm paranoid about this, see it in action right here before your eyes. A recomendation for a right quotation mark to also be used for an apostrophe (i.e. contractions and possessives) and the Unicode 8217 being suggested for this purpose. Hey! What about ASCII 39, it was already doing this job! Essentially we have decided that ASCII 39 should no longer be used for any purpose... but why? The above author does bring up an important point, which is that the farious *ML family of markup languages HTML, SGML, XML, etc are not consistent with their handling of special characters so converting between them is likely to result in subtle scrambling.
Another author provides a similar opinion:
ASCII Apostrophe This is the character that you type on a standard (US layout) keyboard with the key that's beside the semicolon. It shouldn't really ever be used in proper typography, but is often used because it's easy to type and well supported. It is superseded ...
The same source declares that ASCII 96 should also never be used for any purpose. So some people seem to believe that ASCII is officially dead thanks to Unicode.
The result is that by encouraging people to find ASCII unusable, they will be forced to change everything they do in order to get back what they already had.
Of course I have no problem with Unicode representing an extension to the ASCII character set, but it must not be allowed to destroy the heritage that ASCII has already built up.
You see it on web pages all the time, both in Firefox and in IE, special fontsets with weirdo quote characters that don't display anything even vaguely like a quote. This gets worse with syndication, content management and other systems where blocks of text are copied in and out of various storage mechanisms. Ensuring compatible character encoding seems to be an afterthought.
Here's an example from a common Linux website. You will notice that there's a few special characters in the HTML source which you can find if you wget the HTML file and then use:
/usr/bin/od -tx1 -tc < NS5456703154.html | grep -B2 -A3 'ef bf bd'
to get this result:
0020060 73 20 69 73 20 61 6e 6f 74 68 65 72 20 6f 6c 64
s i s a n o t h e r o l d
0020100 65 72 ef bf bd 63 69 72 63 61 20 32 30 30 35 ef
e r 357 277 275 c i r c a 2 0 0 5 357
0020120 bf bd 73 79 73 74 65 6d 2e 20 49 74 20 68 61 73
277 275 s y s t e m . I t h a s
Using "man utf-8" I can figure out what those characters are supposed to be:
binary UTF-8 chars: 11101111 10111111 10111101
decoded binary: 1111111111111101
decoded HEX: 0xFFFD
Hmmm, I don't know what that is.
I hoped that emacs might know what it is, so I discovered just how difficult it is to load up a page while telling emacs that you want the page loaded as UTF-8, emacs will make up it's own mind thank you very much. I finally managed to trick it by inserting a line at the very top of the file:
;; -*- coding: utf-8 -*-
... but it didn't help because emacs didn't know what those characters were either. What I did find was this excellent Web CGI decoder for Unicode UTF-8. Just drop the hex in (or a variety of input formats) and you get:
Decoder output: Byte number 1 is decimal 239, hex 0xEF, octal \357, binary 11101111 This is the first byte of a 3 byte sequence. Byte number 2 is decimal 191, hex 0xBF, octal \277, binary 10111111 This is continuation byte 1, expecting 1 more. Byte number 3 is decimal 189, hex 0xBD, octal \275, binary 10111101 This is continuation byte 2, expecting 0 more. U+FFFDREPLACEMENT CHARACTER * used to replace an incoming character whose value is unknown or unrepresentable in Unicode * compare the use of 001A as a control character to indicate the substitute function
So to puzzle out what happened here, somewhere in the guts an encoding mismatch causes some part of the system to generate a complex 3-byte rejection code which in turn is wrongly interpreted by both Firefox and emacs to look like a bunch of garbage characters. I believe that the character should render as "�" which draws a little inverse question mark, indicating a broken character. Google won't let you search for this character (at least, not when I paste it into Firefox).
Now maybe some readers might see why I'm happy with basic ASCII.
Sure enough, Wordpress support have come across a simlar problem... and sure enough, it's quote characters that are screwing up. They don't seem to have a real answer for it.
These people seem to have a problem with the same thing. Although, it doesn't explain where the problem came from.
Yet another thread with broken quotes on a comment page, somehow this one has not only multiple of those "unknown character" symbols but an HTML escape for a Euro in the middle too! I'm a little stumped at the process that must have created this.
Here's a perfectly regular example of generated content, with a broken heading, presumably from some backend database with encoding compatibility issues.
And a blog article so badly danaged be characterset substitution that it is almost unreadable.
An electronic magazine with a review and a totally broken headline... doesn't anyone actually read this stuff?
Local council is using a template page with inserted content in the middle. The inserted content seems to be in some Microsoft character encoding which uses character hex 92 (= octal 222) as an apostrophe. Firefox shows these as "U+FFFD" but strangely if you wget the page then display the file in firefox you do get apostrophes (maybe some magic header metadata is happening).
Lovely graphic, shame that the quotations are all broken.
This last one is kind of interesting, not only is it providing secrets of success, but not many people can actually puzzle out what she is saying, thanks to the mangled charset. The HTML source code shows every single quote has been expanded out into six bytes!. Multiple phases of character expansion have converted single characters into UTF-8. How is the web browser supposed to figure out that it is looking at just a plain simple quote mark? Well, it can't.
| Font Size | Regular HTML Special | Bold HTML Special | Regular Unicode | Bold Unicode | ||||
|---|---|---|---|---|---|---|---|---|
| +0 | ‘single’ | “double” | ‘single’ | “double” | ‘single’ | “double” | ‘single’ | “double” |
| +1 | ‘single’ | “double” | ‘single’ | “double” | ‘single’ | “double” | ‘single’ | “double” |
| +2 | ‘single’ | “double” | ‘single’ | “double” | ‘single’ | “double” | ‘single’ | “double” |
| +3 | ‘single’ | “double” | ‘single’ | “double” | ‘single’ | “double” | ‘single’ | “double” |
The exact result is a property of browser font handling and other things. Small, non-bold fonts in Firefox fall back to the straight quotes, presumably because they don't want slanting quotes and there's not enough space to draw a full "6" or "9" style quote. Various versions of lynx will handle it by either converting back to the TeX standard ASCII, or by attempting to handle the Unicode (and I've not yet seen a lynx that handles the Unicode correctly). In IE it always curls the quotes, regardless of the font size.