Ask Leo! by Leo A. Notenboom

Why do I get odd characters instead of quotes in my documents?

Search First! Then browse: Categories | Full Archive | By Date | Newsletter

Home » General Computing

Summary: The way characters are represented within computer documents and email isn't always the same everywhere, and things often get misinterpreted.

I have noticed for years that certain emails and documents have strange characters where punctuation and other characters should be. An example is this word: yesterday’s Where the characters ’ should clearly be an apostrophe. Why is this happening and what can I do to eliminate this occurring? I suspect that it happens more often when the originating computer system is a mac.

It's all about character encoding.

And that simple sentence represents a bit of complexity.

Let me cover a few concepts, and throw out a few tips on how it can sometimes be avoided.

Encoding

As I've discussed before, typically in the context of email, there are several ways to "encode" the characters - the letters and numbers and symbols - you see on the screen.

The fundamental concept is that all characters are actually stored as numbers. The uppercase letter "A", for example, is the number 65. "B" is 66, and so on.

"The fundamental concept is that all characters are actually stored as numbers."

The "ASCII" character set or encoding uses a single byte - values from 0 to 255 - to represent up to 256 different characters. (Technically ASCII actually only uses 7 bits of that byte, or values from 0-127. The most common true 8-bit encoding used on the internet today is "ISO-8859-1".)

The problem, of course, is that there are way more than 256 possible characters. While we might spend most of our time with common characters like A-Z, a-z, 0-9 and a handful of punctuation, in reality the there are thousands of other possible characters - particularly if you think globally.

At the other end of the spectrum is the "Unicode" encoding, which uses two (or more) bytes, giving many more possible different characters. "A" is still 65, but if we look at it in hexadecimal the single byte Ascii "A" is 41, while the two-byte Unicode "A" is 0041.

At this point, it should be clear that switching from Ascii to Unicode would immediately double the size of every email, every document, and everything else that stored text. Possible, and in some cases even the right solution, but when you consider that the majority of communications, particularly in the western world, focus on the basic roman alphabet and a few numbers and punctuation, it starts to seem wasteful.

Enter "UTF-8", for "8 bit Unicode Transformation Format".

In UTF-8 the entire Unicode character set is broken down by an algorithm into byte sequences that are either 1, 2, 3 or 4 bytes long. The reason is simple: the vast majority of characters in common usage in Western languages fall into the 1 byte range. Messages remain smaller, but should one of those "other" characters be needed it can be incorporated by using it's "longer" representation.

All that is a lot of back story to the problem.

Mis-Interpretation

When you see funny characters it's because data encoded using UTF-8 is likely being interpreted as ISO-8859-1.

Let's use an example: that apostrophe.

First, let's be clear as mud: there are apostrophes, and apostrophes. In reality the characters we often refer to as apostrophes could be:

  • the apostrophe: (')

  • the acute accent: (´)

  • the grave accent: (`)

  • the right single quote (’)

  • the left single quote (‘)

(Those might look similar, different, or not appear at all depending on the fonts and character sets available on your computer. I told you this was complex. Smile)

Each, of course has a different encoding. Let's take the right single quote (for reasons I'll explain below):

  • ASCII: doesn't exist

  • ISO-8859-1: 0xB4 in hexadecimal

  • Unicode: 0x07E3 in hexadecimal

  • UTF-8: 0xE28099

I don't expect you to care about the actual numbers there, but simply notice how dramatically different they are.

Now, what happens when the UTF-8 series of numbers is interpreted as if it were ISO-8859-1?

’

Look familiar?

0xE28099 breaks down as 0xE2 (â), 0x80 (€) and 0x99 (™). What was one character in UTF-8 (’) gets mistakenly displayed as three (’) when misinterpreted as ISO-8859-1.

The Culprits

There are typically two.

Email programs: email messages can include, as part of the header information you don't see, the type of encoding used to represent the contents of the message. The problem is that some get it wrong, or, as you compose mail you enter characters that cannot actually be represented by the current encoding scheme. In the later case the email program has to do "something", and that may include sending the character anyway, in one encoding scheme, even though the message is flagged as being in another.

I can hear you saying "but I didn't type in any special characters!".

Use Word to edit your email or your web page? Then you probably did. Microsoft Word is culprit number 2.

In particular, the "Smart Quotes" option in Word will often replace a plain apostrophe (') with an acute accent (´) or - as we saw above - right single quote (’). When that gets sent in or displayed using ISO-8859-1 encoding, you get the results above.

The solution? Ideally, watch what you're typing. I know that "Smart Quotes", while nice in printed documents, causes me enough grief elsewhere that it's one of the first options I turn off when configuring Microsoft Word.

If you can, configure your email program to send in UTF-8 encoding (many, if not most, don't make this easily configurable).

But regardless of how you got here, at least now you'll know why.

Related:

Helpful? Get new articles weekly by email in my FREE newsletter!

Your Name:
Your Email:


Why Subscribe?

Article C3868 - September 13, 2009

Recent Comments
10 Comments

I just liked to blame Word, because MSFT is an easy target. Thanks for the detailed explanation.

Posted by: Greg Bulmash at September 13, 2009 11:59 PM

Outlook is culprit #3, since it can be configured (maybe the default config) to use Word as the email editor.

Posted by: Shawn at September 15, 2009 10:30 AM

The first clear explanation I've seen for why I get the strange "a Euro-sign TM" characters -- which I see a lot. Thanks.

Posted by: Don Taber at September 15, 2009 11:36 AM

UTF-8 - I need a lesson on this too. Many thanks

Posted by: Sue at September 15, 2009 1:30 PM

I rarely get this in Emails. However, I come across this very often, while navigating the web. If an site has SOME, for example, Japanese type characters included, I get the little boxes with numbers. Even if I use Google's, "translate this page" function there will still be "little boxes w/ numbers in them" Can be really frustrating.

Posted by: Dan at September 15, 2009 4:11 PM

There's another type of cause for these unanticipated character swaps that database developers are accustomed to dealing with. Each database system has it's own set of special characters, which need to be 'escaped' with other special characters or sequences whenever they are used in a text representation - in order to prevent them from being interpreted as instructions. When working with multiple database types, it's easy to use the wrong escape sequence. Also, escaping can be overlooked, which, technically is a 'bug'. These are just some additional reasons quote characters get messed up - especially on mass-produced interactive web pages.

Posted by: Chris at September 15, 2009 6:41 PM

I think we've all been seeing these more lately, and your explanation was about as clear as it could be, I guess. It was also the first I've ever seen where anybody even attempted to detail it out - and now I see why. Good job!

One thing, though - you didn't tell us what you thought would be the best "fix". What should savvy PC folks be working towards?

Good question, and I'm not sure there is a simple fix. Ideally, I suppose, we'd all juse use a single character encoding, like perhaps Unicode, all the time. Today guess I'd be happy if everyone just settled on UTF-8, but because of all the different combinations of systems, tools and legacy documents it's not very likely.
Leo
16-Sep-2009
Posted by: Michael Smith at September 15, 2009 9:36 PM

Not helped, of course, by SOME people not knowing when (and when not) to use an apostrophe !

"It's" - as shown a cuppla times on this page - Acksherly is the abbreviation for "it is". The possesive "its" has no apostrophe.

<grin>

It's also my Achilles heel. Fortunately I have about three people that pounce every time I get it wrong. Smile
Leo
16-Sep-2009
Posted by: Robin Clay at September 16, 2009 4:25 AM

It's worse than that.

'Unicode' is an encoding of characters as integers, which it calls code points.

UTF-8 is a method of encoding each code point as a variable number of (8-bit) bytes, at least one and possibly as many as three.

UCS-2 is actually a subset of Unicode, which encodes the first 64,000 code points (there are more!) as two bytes.

Where UTF-8 needs only one byte, UCS-2 wastes one byte; however, the fixed character width of UCS-2 is easier to process for such apps as Word.

ISO-8859-1 is a one-byte encoding that happens to be identical to UTF-8, for the first 128 characters only. If an interpreter thinks it's looking at 8859-1, it goes wrong when it sees a byte with the top bit set (i.e. a character beyond 127). And vice versa of course.

If you want more, there's an extensive article on Unicode at Wikipedia. It may not help much. I have the problem with Thunderbird, which allows me to specify the character encoding, though it will Auto-Detect. I suspect the problem lies at the sender, for example, by pasting USC-2 text into a UTF-8 message.

Posted by: James at September 16, 2009 9:11 AM

When I received an email (in Eudora) with those odd characters in it, I copied it into MS word - the correct characters appeared. Then I copied it back into Eudora and the weird characters were gone!

Posted by: George Jensen at September 30, 2009 5:51 PM

Post a comment on "Why do I get odd characters instead of quotes in my documents?":






(Email Address will not be published.)

Remember Me?

By popular demand...
my tip jar
Cuppa Joe
Buy Leo a Latte!

(you may use HTML tags for style)

RSS feed Subscribe to the RSS Feed specifically for comments on this article.

Before commenting, please...

  • Read the article at the top of this page. If your comment shows you didn't, it'll be deleted and ignored.

  • Comment only on this article. Use the Google search box at the top of the page if you have a question about something else.

  • Don't include personal information in the comment. No email addresses. No phone numbers. No physical addresses.

  • Don't spam. Excessive links to unrelated sites within a comment or across multiple comments will cause all such comments to be removed.

  • Don't ask me to recover lost passwords or hacked accounts. I can't, and those comments will be deleted.

  • I can't respond to every comment. And I can't vouch for the accuracy of others who do.

Please wait. Your comment is being processed ...


Question? Ask Leo!