Unicode

Multi-Lingual International Collaboration:

Typing English, Russian and Other Characters Together, or

The Unicode Solution to HTML Display of

Microsoft Word Documents on the World Wide Web

John V. Richardson Jr., UCLA GSE&IS DIS

As you well know if you have tried to collaborate with non-English speakers, the typical American computer keyboard is limited when it comes to languages other than English (i.e., based on the Latin alphabet).

Nonetheless, it is easy to envision the following scenario (label it A):

1. Open Word2002 on your computer and start a new document.

2. Type some text on the QWERTY keyboard.[1] Note that in the lower right hand corner an abbreviation appears “En.”[2] The text appearing on the screen is in English.

3. However, if you press “Shift+Alt” at the same time, the “En” changes to Ru.”

4. Type some more characters. These appear in Cyrillic characters on your screen.

5. Now, save your file as a Word document.

6. Next, start to save your document as an HTML file by going to “File—Save as Web Page…”

7. The file saved successfully in HTML format.

But, what actually happened along the way?

The answer depends on the default font that you have previously selected as part of Word’s default normal.dot template. Note that Word allows the user to select from many different fonts.

In general, fonts are of three types: PostScript Type 1 (Type1) or TrueType (TT)[3], which came along about six years later, and then a blending of the two previous fonts called OpenType (OT).[4]

In any case, these fonts have a lookup table of values and a set of characters, called glyphs, which accompany them to tell the computer how to store and display the information about the particular font.

Let’s start, for example, with the English capital letter A. In one system, this glyph could be stored as hex 41 or decimal 65.[5] Consider the Cyrillic capital letter А. Note that the two characters may look almost the same on the display. However, the Cyrillic А could have the hex value of 0410 or decimal 1040. In either case, this information is part of the table of values. One can look up these values on a code page. A code page is composed of a character set, numbers, punctuation marks, etc.

To be more specific about this table of values, the maximum range of values once was limited to 256. More recently, it became common to use up to 65,535 values, although that meant much more work to encode a font with all the possible glyphs and values needed to represent more languages.

So, now, you can envision fonts that are encoded with a modest table or a much more complete table. The more modest scheme is usually called Latin1. Latin1 is also known as ISO 8859/1 and stores characters into eight bits (or, one byte). Latin1 uses 0-to 255 characters, which can only provide a maximum of 256 glyphs. A maximum of 256 glyphs was probably satisfactory in many cases of monolingual collaboration, or if one was dealing with the more common Western European languages. True international collaboration requires more glyphs and thus, Unicode, which can support up to 65,535 characters, becomes more desirable and is recognized as an international two-byte standard that can represent a broader set of characters. Word is a Unicode-enabled[6] word processor. Word’s encoding[7] scheme is two-bytes (16-bits), sometimes called a wide character set[8] and is truly multilingual.

As mentioned above, each glyph is assigned a numeric or more specifically, an integer value. As used in the example, an English capital letter A could be stored as a Unicode 0041 (i.e., as a Latin1 character, the capital A simply got a leading zero added in front to stretch it into two bytes) whereas the Cyrillic A is hex, 0410, or decimal 1040.

Unicode encoded fonts are not common, however, due to the effort required to map all the glyphs and values into a table. And, to make it even more complicated, TrueType fonts can be either Unicode encoded or non-Unicode encoded. In the latter case, it means the glyphs could be mapped with a table of values that might be correct for Latin1, or not, as well as containing randomly placed values for other glyphs.

In short, if you wish to use a Unicode font inside Microsoft Word, look for Arial Unicode MS[9] or Lucida Sans Unicode[10]—these are the only two Unicode-enabled fonts available to you without going to a third party. To make matters worse, HTML uses Latin1 as its default-encoding scheme, so that a non-Unicode font may not survive the conversion with all the characters displaying correctly.

Hence, this situation of Unicode versus non-Unicode encoding of fonts explains why when you switch fonts you don’t necessarily get the same characters from the original document and is also why, when you don’t have the original font and Word chooses an alternate one for you, you may encounter display problems.

Generally, an encoding problem (i.e., different integer values) only arises when converting to other formats. When Word tries to convert the saved document to HTML, it may encounter non-Latin characters; in the example, the Cyrillic characters may be problematic and can’t always be mapped to new characters.

On occasion with you are using US-ASCII, Word may recommend UTF-8 in the conversion process. If you click OK, then Word will place any characters outside the range of US-ASCII (7-bit) into the file as UTF-8 encoded values, which might range from two, three, or even four 8-bit bytes. Hence, the resulting HTML files could be as much as seven times as large as they might be with a more efficient encoding scheme.

Hence, the resulting HTML files could be as much as seven times as large as they might be with a more efficient encoding scheme. But, you can now display both languages. When you click OK and change the properties, Word seems to add a metadata tag to the file. When this situation occurs, Word is forcing non-US-ASCII Unicode characters into HTML numeric escape sequences which consist of an ampersand, pound sign, four decimal digits and a semicolon. However, Word is smart, too smart sometimes—if the Unicode character has an HTML named escape sequence, then Word will output it as its HTML named escape sequence that consists of an ampersand, name, and semicolon. As you can see, this situation can be problematic for later processing of the HTML files.

B. Imagine another scenario on your computer:

You receive a document file from a Russian friend.
When you try to open it in Word2002, you may not be able to open it at all. An error message about “embedded fonts” appears.
The file will open, but only in read-only mode.
If you can open it fully, then some of the characters do not appear correctly on the screen. For example, you may see squares (i.e., ) or perhaps something like this--кèðèëëèöà.

So, what happened this time?

In the first instance, the original file was created using a non-core Windows font and which isn’t installed on your computer. In this case, you will have to install the correct font on your machine before Word will allow you to open it; Word does not embed these fonts in the document itself because some font files might be as large as 22MB each. Normally, a document file with an embedded TT font will only increase the size of the file by the size of the .ttf font[11].

In the second instance, the file will open in read-only mode. That’s because Word cannot generate all the characters that you may wish to use in editing the document. Not very helpful, but that’s the way Word is designed[12].

In the third case, the file opens but characters are not displayed correctly due to characters not mapping to any particular value; hence, Word cannot generate the proper character to display.

Think of it this way--a font is a table of glyphs and the value of the character is the pointer into the table. When the default font in Word does not have a particular value on the table, there is no character for it to display. Thus, the default character to display is a square. Clearly, there is text to display, but Word does not have a font mapped to the original character in the file, so it has no choice other than to display a square. The solution is to install the correct font, such as Baltica (55 Kbytes), which came embedded in the original text document.

But, what can you do about those кèðèëëèöà? It may just be a simple matter of highlighting the text in the original file and clicking on Baltica[13] in the font window for Word2002. The strange text should go away and the Cyrillic text that you were expecting will be there. If it does not, then the text may have been typed into the document using yet another non-Unicode embedded font such as Times New Roman (OT, 340 KB) which has both English and Cyrillic characters. When it was changed, it mapped correctly, but could not survive another conversion. The solution is to rekey those portions of the text that do not display correctly, if there aren’t many, or use the Fix feature.

C. Imagine a third scenario of trying to convert a Word document to HTML:

Open Word with a file you want to convert to HTML
Click on “File” and then “Save As:”

a) “File Name” can be changed, if you wish

b) “Save as Type:” and select Web Page (*.htm; *.html)

At the top of Save As, select “Tools”
Select “Web Options:”

a) On the “General Options” tab: click off “Disable”

b) Click on the “Encoding” tab, and check “Save this document as;” and then

c) Scroll and select the non-intuitive option of US-ASCII[14]

Click Ok.
Save file in directory as you wish.

What happened this time around? By selecting these options, the Word document was saved in a browser-neutral file format. Specifically, Word saved the English and Cyrillic characters as HTML numeric entities allowing the user to select the correct display of characters from inside their browser.

Recommendations:

If you are considering posting Word documents to the WWW, start with a true Unicode font such as Arial Unicode MS (TT) or Lucida Sans Unicode (TT).
Encode the Word document in HTML so that it is browser neutral.

ACKNOWLEDGEMENTS

I especially wish to thank Ralph LeVan of OCLC's Office of Research for initially talking with me about these problems, showing me the actual encoding that was going on, and then explaining it again. In addition, I appreciate Andy Houghton helping me to understand this matter at a greater depth as well as taking the time to check and correct the statements above.

Appendix A. Other related international issues

1. UTF-8 is space efficient by encoding into a single 8-bit byte. “BTW, Word 2000 and Word 2002 (XP) allow easier control over this process. When you Save As…in the dialog box you get to either browse to a file or type a file name, there is a non-obvious Tools menu, Top right. After you select the .html file type, then click on the Tools menu and select the Web Options…menu choice. That brings up another dialog that allows you to specify any encoding you desire for the HTML file.”[15]

2. USM-94 aka Marc21 or the Marc8 character set (8 bit bytes). USM used a series of escape sequences that redefine the semantics of the bytes. So, for example, a hex 41 in USMARC-94 falls within the Cyrillic language and is the lower case Russian a.

3. CP1251 refers to the Microsoft Windows Code Page 1251; for particular values, see http://www.cyrillic.com/ref/cyrillic/cp1251.html. It’s the standard encoding for internal machine hardware on Microsoft Windows platforms. Hence, some programs may care. See the following link for a test of the correct encoding, http://www.relcom.ru/English/Russification/WinNetscape/test1251.html.

4. KOI8R is another external encoding scheme. Russian email often arrives using this encoding scheme and may or may not be converted automatically by Microsoft Outlook[16] when receiving or sending mail.

[1] Of course, you can plug another keyboard into your laptop. For example, you might want to have the Cyrillic characters on the keys in front of you. In that case, then, you would want to purchase multi-lingual English/Cyrillic keyboard caps (from Ron Pusl Key Connections at 800- 870-1369—or point your browser to http://www.customkeys.com), or an entire keyboard from them, or a truly Russian keyboard in either a YAWERTY layout or the more standard Russian typewriter layout called JTsUKENG--see examples, at http://www.siber.org/sib/russify/fundamentals.html.

[2] If this icon does not appear in Windows XP, then go to Start, Control Panel, Regional and Language Options, Languages, Details, Add, Input Language: Russian, OK, Apply and then OK.

[3] For “A History of TrueType” by Laurence Penney, see http://www.truetype.demon.co.uk/tthist.htm.

[4] One difference is the underlying mathematics used to describe curves—TrueType uses simpler quadratic B-splines while PostScript uses cubic Bézier curves. Unlike bit-mapped fonts (i.e., .fon or .fnt), which are “show this,” TrueType (i.e., .ttf) and PostScript fonts (i.e., .ps or .pfm) work in the “do this” manner. For more about the similarities and differences, see Thomas W. Phinney’s “TrueType, PostScript Type 1, & OpenType: What’s the Difference?” at http://www.font.to/downloads/TT_PS_OT.pdf

[5] The hexadecimal number system (i.e., base 16) is based on 16 or 16 numbers instead of 10, with the letters A through F used in place of 10 through 15. For example,

DECIMAL SYSTEM 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ETC.

HEXIDECIMAL 0 1 2 3 4 5 6 7 8 9 A B C D E F 11 12 13 ETC.

[6] See “Unicode: The Wide-Character Set” at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/HTML/_crt_unicode.3a_.the_wide.2d.character_set.asp.

[7] Encoding simply means that characters (glyphs) must be represented by some physical means—bits or bytes.

[8] Languages, such as CJK, a string of ten characters wide will not be ten Latin characters wide on the screen. But, that’s another issue.

[9] For more information about this OpenType font, see http://www.microsoft.com/typography/fonts/fonttest.asp?FID=24&FNAME=Arial%20Unicode%20MS&FVER=0.84

[10] For more information about this OpenType font, see http://www.microsoft.com/typography/fonts/fonttest.asp?FID=56&FNAME=Lucida%20Sans%20Regular&FVER=1.50

[11] One can embed TrueType fonts by selecting the Tools menu, clicking Options, selecting the Save tab, and then clicking the Embed TrueType fonts.

[12] One possible workaround might be to embed some well-known pangram (i.e., a sentence
that use every letter of the alphabet) plus the numerals and punctuation into a document, according to Andrew Houghton of OCLC’s Office of Research.

[13] Designed digitally by Alexander Tarbeev in 1988 at Polygraphmash, which in turn was based on a 1951-52 version of Candida (Ludwig & Mayer, 1936).

[14] US-ASCII is 7-bit ASCII, encoding 128 characters. USMARC is a variant 7-bit ASCII, aka standard ANSI X3.4 or ANSEL.

[15] Andy Houghton to John Richardson, 2 August 2002.

[16] Navigator version 4.0 and higher as well as MS IE 3.0 and higher automatically converts it; see http://www.siber.org/sib/russify/fundamentals.html.