Latest: Genstatic, my first sip of coffee

Content with Style

Web Technique

UTF-8: Documents with a lot of character

by Pascal Opitz on June 21 2006, 05:21

Sloppy?

Did you ever built a webpage in Homesite and then you didn't encode the html-entities? Then, probably when the client has a look on it, all the german Umlaut characters look awkward on a mac? And did you figure out why? It's because of the charsets and the encoding of the characters in the saved file!

Charset?

Yeah, a charset! Actually a charset is the first bit you have to be aware of when you start using any kind of characters on a computer. Most of the time we still use ASCII with the encoding type “Latin-1”, where every character is encoded by using 8 bit, means 1 byte, means 256 charachters. Obviously this doesn't cover all european languages, that's why, when using ASCII some characters have to be escaped as entities. Or you could use another language encoding than “Latin-1”, there are many of them for languages like turkish or polish.

Wait a minute! Did you say european?

Yeah, and what about asian languages? Or russian? Or chinese? I heard they had more than 3000 signs, how does that fit into those 256 possibilities?
Here we go, good question! An you're not the first one to ask it. That's why, from the information-technology point some people moved on and created the Unicode character set. And again there is many types of Unicode, but there is two really relevant ones: UTF-16 and UTF-8. Both offer 16bit encoded characters, means 65536 possible characters. Obviously this needs more memory to store it, because suddenly a page with the same amount of characters takes up twice the memory. That's why UTF-8 goes a more intelligent way and saves ASCII as single bytes and everything beyond as double-byte. This saves memory.

Alright, got the concept, but how do I use it?

If you are coding plain HTML you can put in this bit of code in the top, to ensure your browser get's it right:

 <meta http-equiv="Content-type" value="text/html; charset=utf-8">

Once you've done that you need to save the files you worked in as UTF-8, which means that you need a text-editor that can save UTF-8, obviously. And you won't believe how rare those are! Anyway, BBedit or Homesite should do the job. And wordpad as well. Ohh, and for those of you that decided to stick to iso Latin-1, please put in that as charset then, rather than giving the browser the possibility to use the wrong encoding. For those who stick to what I said: From now on there's no need to escape german Umlaut characters or chinese signs anymore.
And if you use PHP you can set the content-type with the header already:

 <?
   header("Content-type: text/html; charset=utf-8");
 ?>

BOM trouble

A big trouble could be the BOM, because some older browsers don't get what it means, so they put it in as characters, which has the quite painfull effect to show this on top of the pages: . My workaround for that would be a page saved in UTF-8, set as UTF-8 in the meta-tag, but without BOM. I know it sounds odd, and you have to find a way to do that. But once you use templates and dynamic data it might be worth the effort.

Where does XSLT come in?

I think that UTF-8 starts really making sense when you start using XML and XSLT. Because suddenly, whithin XSL, you can define the output-encoding:

 <xsl:output method="html" indent="no" encoding="utf-8"/> 

The good thing is, that no matter which encoding the XML has, the output will be transformed into UTF-8. Or, if you want to have a special language-encoding like iso-8551-9 for example, you can feed UTF-8 data via XML into the transformation process and get the correct encoding out.

Server stuff

It starts getting really interesting when a database comes on top of that. It makes perfectly sense to have a utf-8 or utf-16 compliant database, so the stored content can contain any kind of language. MySQL supports utf-8 since the version 4.1, The SQL-Server from Microsoft has a built-in UTF-16 support (nvarchar, ntext).
The sad bit is that in PHP dealing with different encodings is not really big fun, but the iconv extension makes it possible to convert encodings. In ASP the support for codepages is much stronger integrated and it is just one line of code to set the codepage that ASP works with internally or the charset it outputs the data with.

Comments

  • nice one. wanna see all the funny unicode characters used on this planet? http://decodeunicode.org/ wiki is the right one for you!

    by tim on May 20 2005, 15:00 - #

  • sweet!
    I like the smooth fluffy navigation through the different character sets, although, I admit, I don’t really get the idea behind all this, besides entertainment. What kind of information would you add to the wiki (besides more sets)?

    by Matthias on May 20 2005, 15:14 - #

  • I just found this site, or rather documentation, by accident. Useful for all kinds of utf8 conversion and replacement crapola.

    by Matthias on June 14 2005, 14:11 - #

  • Are you use Homesite supports UTF-8?

    I used Homesite from 3.0 to 5.5 for years, but finally ditched it because of lack of unicode support. The HomeSite web page also admits that it doesn’t save Unicode (though not checked in about 6 months now).

    I switched to JEdit (http://www.jedit.org) as it is also cross platform, and highly customizeable.

    One issue with the BOM problem I have found is that if you are using server side processing, e.g. ASP and XML/XSL without output set to UTF-8, it tends not to be a problem.

    by Anup Shah on June 16 2005, 14:26 - #

  • Anup, Homesite plus has UTF-8 support, but you have to activate it in the settings.
    Apart from that I didn’t really find a nice editor that supports UTF-8. Notepad does! Editpad is nice as well, I use the Edit pad lite as notepad replacement. And oviously Ultraedit supports UTF-8 and UTF-16.
    A nice tool for XSLT work is Cooktop.

    by Pascal Opitz on June 16 2005, 17:48 - #

  • Hi Pascal,

    I have a question, which I’m hoping you can help shed some light on.

    I’m just about to create a multilingual website, but very confused about the whole issue regarding UTF-8 characters.

    For instance, I am about to create a webpage to display the following language titles in their own language:

    Albanian
    Amharic
    Arabic
    Czech
    Farsi
    French
    Kurdish
    Lingala
    Mandarin
    Pashto
    Polish
    Portuguese
    Romanian
    Russian
    Somali
    Sorani
    Spanish
    Tamil
    Tigrinya
    Turkish

    Is this possible at all using plain text? Also, if a computer hasn’t got the language set installed I presume it will not display properly?

    To make sure that all the text is displayed properly is the best way then to use .GIFs? I however, aware that this is not accessible.

    Hope you can point me in the right direction.

    Regards,
    Nicholas

    by Nicholas Saxlund on September 2 2005, 10:13 - #

  • Hi Nicholas.
    theoretcially UTF-8 can display all these characters in one set. Obviously this would require the right fonts to display the pages correctly. You just need to make sure that you store your data the correct way and avoid conversions to ISO latin-1. And not forget about the doctype, as well.

    Also there is a method to make a browser download the right font that I couldn’t test yet. But I guess this could work for your case.

    The approach of replacing characters with GIFs to me sounds like a rediculous overhead that will be impossible to maintain.

    Another way to deal with multiple languages is to set each language version to a different language charset.
    But as you can see in the page I posted the w3c reommends UTF-8 since it is so much more versatile.

    If you have any questions let me know.

    by Pascal Opitz on September 5 2005, 09:55 - #

  • Just playing devil’s advocate here but what about installing the right fonts on the server and generating the gifs on-the-fly server-side? As long as the docs are saved as UTF-8 the alt text will be correct for systems with the right font sets but the screen display will be correct for everyone. Or sIFR with the entire font outline stored?

    I’ve never had to deal with this so I don’t know what’s best…

    by Mike Stenhouse on September 6 2005, 07:23 - #

  • Thanks for the reply! The link you suggested to “a method to make a browser download the right font” sounds great, only problem is that it seems that the links to both TrueDoc and HexMac are no longer working.

    With regards to using UTF-8, I believe that not all the languages that I have specified are covered. Further the problem is that the majority of users will be using public web access (mainly libraries), which won’t allow the download of new character sets…

    The idea Mike suggested doesn’t sound too bad. Any idea where to find more info and if this is feasible?

    by Nicholas Saxlund on September 6 2005, 14:12 - #

  • You can read about sIFR on Mike Davidson’s site:
    www.mikeindustries.com/sifr/

    The catch is that you’d have to find a font that supports all the characters you’re going to need to use. There are probably only one or two out there… If you can find one though, the rest is easy…

    by Mike Stenhouse on September 8 2005, 07:22 - #

  • Nicholas, why would UTF-8 not cover all languages specified? AFAIK UTF-8 covers pretty much every language out there that ever has been defined as ISO charset anyway.

    The problem I see with sFIR is that the flash itself might not be able cope with the funny characters, but I’m not sure about that one. If the site is used from libraries and public spaces a plugin-based solution might be as bad as a missing font, since you wouldn’t be able to render the page without flash.

    Just as a thought, and based on spectulation only:
    If I am somali, I go to a UTF-8 page and my font is not specified, just serif, sans-serif or something like that, wouldn’t somlai copy automatically be rendered in the right font? Maybe the chinese wouldn’t be rendered at all, but would that matter to me as somali user?

    I definetly know that Arial Unicode can handle chinese and japanese and cyrillic, so you might have a try with this one as well.

    by Pascal Opitz on September 12 2005, 06:27 - #

  • Ohh, found that one for IE only:

    Embedded fonts testpage

    by Pascal Opitz on September 12 2005, 07:11 - #

  • I came accross this interesting blog article about PHP and the BOM while checking referrers:

    http://juicystudio.com/article/utf-byte-order-mark.php

    There are a couple of nice editor suggestions in there.
    Have a read!

    by Pascal Opitz on September 26 2005, 09:13 - #

  • Homesite supports UTF-8????
    Where do i enable it? i cant find it in settings. there is the editor charset but no utf-8 option.

    by ufku on January 23 2006, 13:00 - #

  • In the preferences … see this

    by Pascal Opitz on January 30 2006, 09:40 - #

  • I have read some of artikle in this blog and I found them very interesting. English is not my first language. Very sorry for my horrible spell. Now come to the point and question. I have problem with some letters in Pashto language. Some of characters are not showen as expected. For example, U+067C (&#1660). When I start Character Map in windows, I can see those letters are there and the character code assigned to it corespond to them is also fine. Then what is wrong?

    by Besmellah on June 21 2006, 03:34 - #

  • Besmellah, what kind of typeface are you using? Not every typeface has all characters available. Try Arial Unicode or some other unicode font and see what happens then?

    by Pascal Opitz on June 21 2006, 05:15 - #

  • “Anyway, BBedit or Homesite should do the job. And wordpad as well.”

    You meant notepad, right? Unfortunately Wordpad should be better than Notepad, but in this case it’s a lot worse :/

    by Nuno Oliveira on July 15 2006, 11:27 - #

  • Yeah, I did

    by Pascal Opitz on July 17 2006, 05:59 - #

  • I am planning to update my ecommerce application to support multilanguage.

    The point is, system is already completed and has been deployed on many web sites. So if anyone can guide me that what should be the approach. I am not at all looking forward to re-write the whole thing again.

    If you want to see the system its here
    http://www.ecommerce-xperts.com/demo
    http://www.ecommerce-xperts.com/demo/index2.php

    I really need good suggestions plz. I have been given 1 months time to provide multilanguage support in this system

    by Designer Pro on December 1 2006, 01:48 - #

  • Hi there,

    The description of what you wanna achieve is a bit vague. Often you don’t need to change much to enable UTF-8 based output and input.
    This can be achieved most of the times by changing the charset meta tag or apache/php header for front and back end interfaces.

    However, this doesn’t mean that the application itself will be multilanguage, plus, unless you change the database collation, you will loose some sorting possibility as well.

    To incorporate true language support for every button and label you will have to work with configurations in some way, I am afraid, either in a config file or a database.

    by Pascal Opitz on December 1 2006, 05:41 - #


Comments for this article are closed.