I just liked to blame Word, because MSFT is an easy target. Thanks for the detailed explanation.
Posted by: Greg Bulmash at September 13, 2009 11:59 PM
Outlook is culprit #3, since it can be configured (maybe the default config) to use Word as the email editor.
Posted by: Shawn at September 15, 2009 10:30 AM
The first clear explanation I've seen for why I get the strange "a Euro-sign TM" characters -- which I see a lot. Thanks.
Posted by: Don Taber at September 15, 2009 11:36 AM
UTF-8 - I need a lesson on this too. Many thanks
Posted by: Sue at September 15, 2009 1:30 PM
I rarely get this in Emails. However, I come across this very often, while navigating the web. If an site has SOME, for example, Japanese type characters included, I get the little boxes with numbers. Even if I use Google's, "translate this page" function there will still be "little boxes w/ numbers in them" Can be really frustrating.
Posted by: Dan at September 15, 2009 4:11 PM
There's another type of cause for these unanticipated character swaps that database developers are accustomed to dealing with. Each database system has it's own set of special characters, which need to be 'escaped' with other special characters or sequences whenever they are used in a text representation - in order to prevent them from being interpreted as instructions. When working with multiple database types, it's easy to use the wrong escape sequence. Also, escaping can be overlooked, which, technically is a 'bug'. These are just some additional reasons quote characters get messed up - especially on mass-produced interactive web pages.
Posted by: Chris at September 15, 2009 6:41 PM
I think we've all been seeing these more lately, and your explanation was about as clear as it could be, I guess. It was also the first I've ever seen where anybody even attempted to detail it out - and now I see why. Good job!
One thing, though - you didn't tell us what you thought would be the best "fix". What should savvy PC folks be working towards?
Good question, and I'm not sure there is a simple fix. Ideally, I suppose, we'd all juse use a single character encoding, like perhaps Unicode, all the time. Today guess I'd be happy if everyone just settled on UTF-8, but because of all the different combinations of systems, tools and legacy documents it's not very likely.
16-Sep-2009
Posted by: Michael Smith at September 15, 2009 9:36 PM
Not helped, of course, by SOME people not knowing when (and when not) to use an apostrophe !
"It's" - as shown a cuppla times on this page - Acksherly is the abbreviation for "it is". The possesive "its" has no apostrophe.
<grin>
It's also my Achilles heel. Fortunately I have about three people that pounce every time I get it wrong.
16-Sep-2009
Posted by: Robin Clay at September 16, 2009 4:25 AM
It's worse than that.
'Unicode' is an encoding of characters as integers, which it calls code points.
UTF-8 is a method of encoding each code point as a variable number of (8-bit) bytes, at least one and possibly as many as three.
UCS-2 is actually a subset of Unicode, which encodes the first 64,000 code points (there are more!) as two bytes.
Where UTF-8 needs only one byte, UCS-2 wastes one byte; however, the fixed character width of UCS-2 is easier to process for such apps as Word.
ISO-8859-1 is a one-byte encoding that happens to be identical to UTF-8, for the first 128 characters only. If an interpreter thinks it's looking at 8859-1, it goes wrong when it sees a byte with the top bit set (i.e. a character beyond 127). And vice versa of course.
If you want more, there's an extensive article on Unicode at Wikipedia. It may not help much. I have the problem with Thunderbird, which allows me to specify the character encoding, though it will Auto-Detect. I suspect the problem lies at the sender, for example, by pasting USC-2 text into a UTF-8 message.
Posted by: James at September 16, 2009 9:11 AM
When I received an email (in Eudora) with those odd characters in it, I copied it into MS word - the correct characters appeared. Then I copied it back into Eudora and the weird characters were gone!
Posted by: George Jensen at September 30, 2009 5:51 PM
Comments
Read the article that everyone's commenting on.
Subscribe to the RSS Feed for comments on this article.
I just liked to blame Word, because MSFT is an easy target. Thanks for the detailed explanation.
Posted by: Greg Bulmash at September 13, 2009 11:59 PMOutlook is culprit #3, since it can be configured (maybe the default config) to use Word as the email editor.
Posted by: Shawn at September 15, 2009 10:30 AMThe first clear explanation I've seen for why I get the strange "a Euro-sign TM" characters -- which I see a lot. Thanks.
Posted by: Don Taber at September 15, 2009 11:36 AMUTF-8 - I need a lesson on this too. Many thanks
Posted by: Sue at September 15, 2009 1:30 PMI rarely get this in Emails. However, I come across this very often, while navigating the web. If an site has SOME, for example, Japanese type characters included, I get the little boxes with numbers. Even if I use Google's, "translate this page" function there will still be "little boxes w/ numbers in them" Can be really frustrating.
Posted by: Dan at September 15, 2009 4:11 PMThere's another type of cause for these unanticipated character swaps that database developers are accustomed to dealing with. Each database system has it's own set of special characters, which need to be 'escaped' with other special characters or sequences whenever they are used in a text representation - in order to prevent them from being interpreted as instructions. When working with multiple database types, it's easy to use the wrong escape sequence. Also, escaping can be overlooked, which, technically is a 'bug'. These are just some additional reasons quote characters get messed up - especially on mass-produced interactive web pages.
Posted by: Chris at September 15, 2009 6:41 PMI think we've all been seeing these more lately, and your explanation was about as clear as it could be, I guess. It was also the first I've ever seen where anybody even attempted to detail it out - and now I see why. Good job!
One thing, though - you didn't tell us what you thought would be the best "fix". What should savvy PC folks be working towards?
16-Sep-2009
Not helped, of course, by SOME people not knowing when (and when not) to use an apostrophe !
"It's" - as shown a cuppla times on this page - Acksherly is the abbreviation for "it is". The possesive "its" has no apostrophe.
<grin>
16-Sep-2009
It's worse than that.
'Unicode' is an encoding of characters as integers, which it calls code points.
UTF-8 is a method of encoding each code point as a variable number of (8-bit) bytes, at least one and possibly as many as three.
UCS-2 is actually a subset of Unicode, which encodes the first 64,000 code points (there are more!) as two bytes.
Where UTF-8 needs only one byte, UCS-2 wastes one byte; however, the fixed character width of UCS-2 is easier to process for such apps as Word.
ISO-8859-1 is a one-byte encoding that happens to be identical to UTF-8, for the first 128 characters only. If an interpreter thinks it's looking at 8859-1, it goes wrong when it sees a byte with the top bit set (i.e. a character beyond 127). And vice versa of course.
If you want more, there's an extensive article on Unicode at Wikipedia. It may not help much. I have the problem with Thunderbird, which allows me to specify the character encoding, though it will Auto-Detect. I suspect the problem lies at the sender, for example, by pasting USC-2 text into a UTF-8 message.
Posted by: James at September 16, 2009 9:11 AMWhen I received an email (in Eudora) with those odd characters in it, I copied it into MS word - the correct characters appeared. Then I copied it back into Eudora and the weird characters were gone!
Posted by: George Jensen at September 30, 2009 5:51 PMTo post a comment on "Why do I get odd characters instead of quotes in my documents?", please return to that article's main page.