Multi-Lingual International
Collaboration:
Typing English, Russian and Other
Characters Together, or
The Unicode Solution to HTML Display
of
Microsoft Word Documents on the
World Wide Web
John V. Richardson Jr., UCLA
GSE&IS DIS
As you well know if you
have tried to collaborate with non-English speakers, the typical American computer
keyboard is limited when it comes to languages other than English (i.e., based
on the Latin alphabet).
Nonetheless, it is easy
to envision the following scenario (label it A):
1. Open Word2002 on your computer and start a new
document.
2. Type some text on the QWERTY keyboard.[1] Note that in the lower right hand corner an
abbreviation appears “En.”[2] The text appearing on the screen is in
English.
3. However, if you press “Shift+Alt” at the same
time, the “En” changes to Ru.”
4. Type some more characters. These appear in Cyrillic characters on your
screen.
5. Now, save your file as a Word document.
6. Next, start to save your document as an HTML file
by going to “File—Save as Web Page…”
7. The file saved successfully in HTML format.
But, what actually
happened along the way?
The answer depends on the
default font that you have previously selected as part of Word’s default
normal.dot template. Note that Word
allows the user to select from many different fonts.
In general, fonts are of
three types: PostScript Type 1 (Type1) or TrueType (TT)[3],
which came along about six years later, and then a blending of the two previous
fonts called OpenType (OT).[4]
In any case, these fonts
have a lookup table of values and a set of characters, called glyphs, which
accompany them to tell the computer how to store and display the information
about the particular font.
Let’s start, for example,
with the English capital letter A. In
one system, this glyph could be stored as hex 41 or decimal 65.[5] Consider the Cyrillic capital letter À. Note that the two characters may look
almost the same on the display.
However, the Cyrillic À
could have the hex value of 0410 or decimal 1040. In either case, this
information is part of the table of values.
One can look up these values on a code page. A code page is composed of a character set, numbers, punctuation
marks, etc.
To be more specific about
this table of values, the maximum range of values once was limited to 256. More recently, it became common to use up to
65,535 values, although that meant much more work to encode a font with all the
possible glyphs and values needed to represent more languages.
So, now, you can envision
fonts that are encoded with a modest table or a much more complete table. The more modest scheme is usually called
Latin1. Latin1 is also known as ISO
8859/1 and stores characters into eight bits (or, one byte). Latin1 uses 0-to 255 characters, which can
only provide a maximum of 256 glyphs. A
maximum of 256 glyphs was probably satisfactory in many cases of monolingual
collaboration, or if one was dealing with the more common Western European
languages. True international
collaboration requires more glyphs and thus, Unicode, which can support up to
65,535 characters, becomes more desirable and is recognized as an international
two-byte standard that can represent a broader set of characters. Word is a Unicode-enabled[6]
word processor. Word’s encoding[7]
scheme is two-bytes (16-bits), sometimes called a wide character set[8]
and is truly multilingual.
As mentioned above, each
glyph is assigned a numeric or more specifically, an integer value. As used in the example, an English capital
letter A could be stored as a Unicode 0041 (i.e., as a Latin1 character, the capital
A simply got a leading zero added in front to stretch it into two bytes)
whereas the Cyrillic A is hex, 0410, or decimal 1040.
Unicode encoded fonts are
not common, however, due to the effort required to map all the glyphs and
values into a table. And, to make it
even more complicated, TrueType fonts can be either Unicode encoded or
non-Unicode encoded. In the latter
case, it means the glyphs could be mapped with a table of values that might be
correct for Latin1, or not, as well as containing randomly placed values for
other glyphs.
In short, if you wish to
use a Unicode font inside Microsoft Word, look for Arial Unicode MS[9]
or Lucida Sans Unicode[10]—these
are the only two Unicode-enabled fonts available to you without going to a
third party. To make matters worse,
HTML uses Latin1 as its default-encoding scheme, so that a non-Unicode font may
not survive the conversion with all the characters displaying correctly.
Hence, this situation of
Unicode versus non-Unicode encoding of fonts explains why when you switch fonts
you don’t necessarily get the same characters from the original document and is
also why, when you don’t have the original font and Word chooses an alternate
one for you, you may encounter display problems.
Generally, an encoding
problem (i.e., different integer values) only arises when converting to other
formats. When Word tries to convert the
saved document to HTML, it may encounter non-Latin characters; in the example,
the Cyrillic characters may be problematic and can’t always be mapped to new
characters.
On occasion with you are
using US-ASCII, Word may recommend UTF-8 in the conversion process. If you click OK, then Word will place any
characters outside the range of US-ASCII (7-bit) into the file as UTF-8 encoded
values, which might range from two, three, or even four 8-bit bytes. Hence, the resulting HTML files could be as
much as seven times as large as they might be with a more efficient encoding
scheme.
Hence, the resulting HTML files could be as much as seven times as large as they might be with a more efficient encoding scheme. But, you can now display both languages. When you click OK and change the properties, Word seems to add a metadata tag to the file. When this situation occurs, Word is forcing non-US-ASCII Unicode characters into HTML numeric escape sequences which consist of an ampersand, pound sign, four decimal digits and a semicolon. However, Word is smart, too smart sometimes—if the Unicode character has an HTML named escape sequence, then Word will output it as its HTML named escape sequence that consists of an ampersand, name, and semicolon. As you can see, this situation can be problematic for later processing of the HTML files.
B. Imagine another scenario on your computer:
So, what happened this
time?
In the first instance,
the original file was created using a non-core Windows font and which isn’t installed
on your computer. In this case, you
will have to install the correct font on your machine before Word will allow
you to open it; Word does not embed these fonts in the document itself because
some font files might be as large as 22MB each. Normally, a document file with an embedded TT font will only
increase the size of the file by the size of the .ttf font[11].
In the second instance,
the file will open in read-only mode.
That’s because Word cannot generate all the characters that you may wish
to use in editing the document. Not
very helpful, but that’s the way Word is designed[12].
In the third case, the
file opens but characters are not displayed correctly due to characters not
mapping to any particular value; hence, Word cannot generate the proper
character to display.
Think of it this way--a
font is a table of glyphs and the value of the character is the pointer into
the table. When the default font in
Word does not have a particular value on the table, there is no character for
it to display. Thus, the default
character to display is a square.
Clearly, there is text to display, but Word does not have a font mapped
to the original character in the file, so it has no choice other than to
display a square. The solution is to
install the correct font, such as Baltica (55 Kbytes), which came embedded in
the original text document.
But, what can you do
about those êèðèëëèöà? It may
just be a simple matter of highlighting the text in the original file and
clicking on Baltica[13]
in the font window for Word2002. The
strange text should go away and the Cyrillic text that you were expecting will
be there. If it does not, then the text
may have been typed into the document using yet another non-Unicode embedded
font such as Times New Roman (OT, 340 KB) which has both English and Cyrillic
characters. When it was changed, it
mapped correctly, but could not survive another conversion. The solution is to rekey those portions of
the text that do not display correctly, if there aren’t many, or use the Fix
feature.
C. Imagine a third scenario of trying to
convert a Word document to HTML:
a) “File Name” can be changed, if you wish
b) “Save as Type:” and select Web Page (*.htm; *.html)
a) On the “General Options” tab: click off “Disable”
b) Click on the “Encoding” tab, and check “Save this document as;” and then
c) Scroll and select the non-intuitive option of US-ASCII[14]
What happened this time around? By selecting these options, the Word document was saved in a browser-neutral file format. Specifically, Word saved the English and Cyrillic characters as HTML numeric entities allowing the user to select the correct display of characters from inside their browser.
Recommendations:
ACKNOWLEDGEMENTS
I especially wish to
thank Ralph LeVan of OCLC's Office of Research for initially talking with me
about these problems, showing me the actual encoding that was going on, and
then explaining it again. In addition,
I appreciate Andy Houghton helping me to understand this matter at a greater
depth as well as taking the time to check and correct the statements above.
Appendix A. Other related international issues
1. UTF-8 is space efficient by encoding into a single 8-bit byte. “BTW, Word 2000 and Word 2002 (XP) allow easier control over this process. When you Save As…in the dialog box you get to either browse to a file or type a file name, there is a non-obvious Tools menu, Top right. After you select the .html file type, then click on the Tools menu and select the Web Options…menu choice. That brings up another dialog that allows you to specify any encoding you desire for the HTML file.”[15]
2. USM-94 aka Marc21 or the Marc8 character set (8 bit bytes). USM used a series of escape sequences that redefine the semantics of the bytes. So, for example, a hex 41 in USMARC-94 falls within the Cyrillic language and is the lower case Russian a.
3. CP1251 refers to the Microsoft Windows Code Page 1251; for particular values, see http://www.cyrillic.com/ref/cyrillic/cp1251.html. It’s the standard encoding for internal machine hardware on Microsoft Windows platforms. Hence, some programs may care. See the following link for a test of the correct encoding, http://www.relcom.ru/English/Russification/WinNetscape/test1251.html.
4. KOI8R is another external encoding scheme. Russian email often arrives using this
encoding scheme and may or may not be converted automatically by Microsoft
Outlook[16]
when receiving or sending mail.
[1] Of course, you can plug another keyboard into your laptop. For example, you might want to have the Cyrillic characters on the keys in front of you. In that case, then, you would want to purchase multi-lingual English/Cyrillic keyboard caps (from Ron Pusl Key Connections at 800- 870-1369—or point your browser to http://www.customkeys.com), or an entire keyboard from them, or a truly Russian keyboard in either a YAWERTY layout or the more standard Russian typewriter layout called JTsUKENG--see examples, at http://www.siber.org/sib/russify/fundamentals.html.
[2] If this icon does not appear in Windows XP, then go to Start, Control Panel, Regional and Language Options, Languages, Details, Add, Input Language: Russian, OK, Apply and then OK.
[3] For “A History of TrueType” by Laurence Penney, see http://www.truetype.demon.co.uk/tthist.htm.
[4] One difference is the underlying mathematics used to describe curves—TrueType uses simpler quadratic B-splines while PostScript uses cubic Bézier curves. Unlike bit-mapped fonts (i.e., .fon or .fnt), which are “show this,” TrueType (i.e., .ttf) and PostScript fonts (i.e., .ps or .pfm) work in the “do this” manner. For more about the similarities and differences, see Thomas W. Phinney’s “TrueType, PostScript Type 1, & OpenType: What’s the Difference?” at http://www.font.to/downloads/TT_PS_OT.pdf
[5] The hexadecimal number system (i.e., base 16) is based on 16 or 16 numbers instead of 10, with the letters A through F used in place of 10 through 15. For example,
DECIMAL SYSTEM 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ETC.
HEXIDECIMAL 0 1 2 3 4 5 6 7 8 9 A B C D E F 11 12 13 ETC.
[6] See “Unicode: The Wide-Character Set” at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/HTML/_crt_unicode.3a_.the_wide.2d.character_set.asp.
[7] Encoding simply means that characters (glyphs) must be represented by some physical means—bits or bytes.
[8] Languages, such as CJK, a string of ten characters wide will not be ten Latin characters wide on the screen. But, that’s another issue.
[9] For more information about this OpenType font, see http://www.microsoft.com/typography/fonts/fonttest.asp?FID=24&FNAME=Arial%20Unicode%20MS&FVER=0.84
[10] For more information about this OpenType font, see http://www.microsoft.com/typography/fonts/fonttest.asp?FID=56&FNAME=Lucida%20Sans%20Regular&FVER=1.50
[11] One can
embed TrueType fonts by selecting the Tools menu, clicking Options, selecting
the Save tab, and then
clicking the Embed TrueType fonts.
[12] One
possible workaround might be to embed some well-known pangram (i.e., a sentence
that use every letter of the alphabet) plus the numerals and punctuation into a
document, according to Andrew Houghton of OCLC’s Office of Research.
[13] Designed digitally by Alexander Tarbeev in 1988 at Polygraphmash, which in turn was based on a 1951-52 version of Candida (Ludwig & Mayer, 1936).
[14] US-ASCII is 7-bit ASCII, encoding 128 characters. USMARC is a variant 7-bit ASCII, aka standard ANSI X3.4 or ANSEL.
[16] Navigator version 4.0 and higher as well as MS IE 3.0 and higher automatically converts it; see http://www.siber.org/sib/russify/fundamentals.html.