Helping people with computers... one answer at a time.
The way characters are represented within computer documents and email isn't always the same everywhere, and things often get misinterpreted.
I have noticed for years that certain emails and documents have strange characters where punctuation and other characters should be. An example is this word: yesterdayâs Where the characters â should clearly be an apostrophe. Why is this happening and what can I do to eliminate this occurring? I suspect that it happens more often when the originating computer system is a mac.
•
It's all about character encoding.
And that simple sentence represents a bit of complexity.
Let me cover a few concepts, and throw out a few tips on how it can sometimes be avoided.
•
Encoding
As I've discussed before, typically in the context of email, there are several ways to "encode" the characters - the letters and numbers and symbols - you see on the screen.
The fundamental concept is that all characters are actually stored as numbers. The uppercase letter "A", for example, is the number 65. "B" is 66, and so on.
The "ASCII" character set or encoding uses a single byte - values from 0 to 255 - to represent up to 256 different characters. (Technically ASCII actually only uses 7 bits of that byte, or values from 0-127. The most common true 8-bit encoding used on the internet today is "ISO-8859-1".)
The problem, of course, is that there are way more than 256 possible characters. While we might spend most of our time with common characters like A-Z, a-z, 0-9 and a handful of punctuation, in reality the there are thousands of other possible characters - particularly if you think globally.
At the other end of the spectrum is the "Unicode" encoding, which uses two (or more) bytes, giving many more possible different characters. "A" is still 65, but if we look at it in hexadecimal the single byte Ascii "A" is 41, while the two-byte Unicode "A" is 0041.
At this point, it should be clear that switching from Ascii to Unicode would immediately double the size of every email, every document, and everything else that stored text. Possible, and in some cases even the right solution, but when you consider that the majority of communications, particularly in the western world, focus on the basic roman alphabet and a few numbers and punctuation, it starts to seem wasteful.
Enter "UTF-8", for "8 bit Unicode Transformation Format".
In UTF-8 the entire Unicode character set is broken down by an algorithm into byte sequences that are either 1, 2, 3 or 4 bytes long. The reason is simple: the vast majority of characters in common usage in Western languages fall into the 1 byte range. Messages remain smaller, but should one of those "other" characters be needed it can be incorporated by using it's "longer" representation.
All that is a lot of back story to the problem.
Mis-Interpretation
When you see funny characters it's because data encoded using UTF-8 is likely being interpreted as ISO-8859-1.
Let's use an example: that apostrophe.
First, let's be clear as mud: there are apostrophes, and apostrophes. In reality the characters we often refer to as apostrophes could be:
the apostrophe: (')
the acute accent: (´)
the grave accent: (`)
the right single quote (’)
the left single quote (‘)
(Those might look similar, different, or not appear at all depending on the fonts and character sets
available on your computer. I told you this was complex.
)
Each, of course has a different encoding. Let's take the right single quote (for reasons I'll explain below):
ASCII: doesn't exist
ISO-8859-1: 0xB4 in hexadecimal
Unicode: 0x07E3 in hexadecimal
UTF-8: 0xE28099
I don't expect you to care about the actual numbers there, but simply notice how dramatically different they are.
Now, what happens when the UTF-8 series of numbers is interpreted as if it were ISO-8859-1?
â
Look familiar?
0xE28099 breaks down as 0xE2 (â), 0x80 () and 0x99 (). What was one character in UTF-8 (’) gets mistakenly displayed as three (â) when misinterpreted as ISO-8859-1.
The Culprits
There are typically two.
Email programs: email messages can include, as part of the header information you don't see, the type of encoding used to represent the contents of the message. The problem is that some get it wrong, or, as you compose mail you enter characters that cannot actually be represented by the current encoding scheme. In the later case the email program has to do "something", and that may include sending the character anyway, in one encoding scheme, even though the message is flagged as being in another.
I can hear you saying "but I didn't type in any special characters!".
Use Word to edit your email or your web page? Then you probably did. Microsoft Word is culprit number 2.
In particular, the "Smart Quotes" option in Word will often replace a plain apostrophe (') with an acute accent (´) or - as we saw above - right single quote (’). When that gets sent in or displayed using ISO-8859-1 encoding, you get the results above.
The solution? Ideally, watch what you're typing. I know that "Smart Quotes", while nice in printed documents, causes me enough grief elsewhere that it's one of the first options I turn off when configuring Microsoft Word.
If you can, configure your email program to send in UTF-8 encoding (many, if not most, don't make this easily configurable).
But regardless of how you got here, at least now you'll know why.
Article C3868 - September 13, 2009
First, thank you, Leo. I'd read this article before which provided a basic understanding. Now I need to address it, so the extended explanation gives me a direction.
Second, Dick, your automated response reminds ME of another automated response: "Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime."
Lazy people want a singular, simplistic answer to their singular local problem. Since it's never singular, and it's never simplistic, and the audience SURE isn't local, Leo's site is a learning experience that helps deal with future problems encountered that are related to an earlier one.
Since my issue can't be resolved with one simple answer, I've learned how to apply the general principle to dealing with it. And if I still can't, at least now I'm armed with more information to search for an answer that does apply to my particular situation.
Posted by: Mike at April 26, 2011 12:13 PMThanks for your answer, Leo. That was very informative.
And for those who complained, the question did not ask how to fix the problem, it asked why the characters were appearing.
If you are seeing the ’ characters, chances are the fault is with the source, not your browser. The source was probably written using something like Word that uses "smart quotes" and when converted to HTML those smart quotes were simply converted into what the coding conversion indicated should be used. In the case of a smart apostrophe, it is converted to the ’ characters.
When creating HTML code, be sure to use a text-based editor or convert it into plain text before saving it as HTML.
As far as what shows up on someone else's web page, if they have a comment page or a means to contact them, bring it to their attention in a courteous and humble manner, and perhaps they will fix the problem.
Posted by: Jonathan at June 19, 2011 9:51 AMLeo's info is correct but I finally found the solution. Open WORD, go to TOOLS then AutoCorrectOptions. Select the AutoFormat Tab and de-select "Plain Text Wordmail documents" under Always Autoformat. This stops reformatting of a plain text email message!!
Posted by: Jaxsun at June 29, 2011 6:49 AMThanks Leo for this explanation. Now (I think) I understand what's going on. My client has a website that is Joomla (a Content Management System, or CMS) based, which stores page content in an SQL database (in this case, MySQL). They switched over to a new hosting provider back in November 2011, and their content (about 1400 articles, newsletters, ezines, etc) was displaying just fine - until last weekend. Someone inadvertantly deleted the domain and the files, directories, etc. and they had to be restored. After they were restored, and Joomla re-installed, suddenly these strange characters started popping up in almost every single article - yup, you got it, the ’ in place of every single apostrophe. So I'm trying to figure out how to help them resolve this issue, without going in to over 1400 documents and editing/modifying them all manually. I first checked Joomla, and it's got its encoding character set as UTF-8. Then I checked MySQL, thinking it must be set for some different character encoding set, but no, it's actually UTF-8 also! But when I look into the MySQL database and run search queries against the entire database for the ' ’ ' designation, sure enough, it finds that exact string, right throughout the database. Hundreds and hundreds of places.
Posted by: Scott at January 28, 2012 12:51 PMSo, I'm trying to figure out how the data got there like that in the first place, and it must be that, PRIOR to the "restore", it was being stored and displayed, encoded and decoded, as ISO-8859-1 because it was being displayed properly. So now what can I do to fix the problem?!? Anyone?!? HELP?!?
@Scott
Posted by: connie at January 28, 2012 1:36 PMDoes a new article end up with the characters, or just the restored database? Since you can find the characters in the database, you could do a search and replace for them all. But if new documents are having the same error, you are going to have to go deeper.