PDA

View Full Version : PHP4/5 - LFS Hostname codepage converter v2


Victor
2nd January 2008, 23:29
The "PHP4/5 - LFS Hostname codepage converter v2" is a function that allows you to convert the different character codepages in LFS hostnames to something you can use on a webpage (ie. conversion to a single codepage, the one you use on your website).

This function is a new one. I had posted one a while back that required the inclusion of a codepage conversion table. But now this is no longer required. Instead you need to have compiled the mbstring library into your PHP install. This will take care of all conversions.
After you have converted a hostname with this function, the only remaining thing left is converting the colour codes (not done in this function).

Two notes about the function below :

1) You will notice this function also replaces the special characters in LFS text such as ^d (\), ^s (/), etc. This is required because not all of these converted characters exist in the same place in every codepage. So they need to be converted from LFS escaped to actual character and then to the appropriate codepage.

2) The function takes two parameters. $str and $conv_to, which is UTF-8 by default. It is here that you indicate to which codepage you want to convert your hostname.
If you use a single byte codepage such as ISO-8859-1 on your website, you should use the HTML-ENTITIES codepage to convert your hostnames into.
For a list of all codepages supported by mbstring, see here : http://nl3.php.net/manual/en/ref.mbstring.php

// L = Latin 1
// G = Greek
// C = Cyrillic
// E = Central Europe
// T = Turkish
// B = Baltic
// J = Japanese
// S = Simplified Chinese
// K = Korean
// H = Traditional Chinese
function codepage_convert ($str, $conv_to = 'UTF-8') {
$sets = array ('L' => 'CP1252',
'G' => 'ISO-8859-7',
'C' => 'CP1251',
'E' => 'ISO-8859-2',
'T' => 'ISO-8859-9',
'B' => 'ISO-8859-13',
'J' => 'SJIS-win',
'S' => 'CP936',
'K' => 'CP949',
'H' => 'CP950');

$tr_ptrn = array ("/\^d/", "/\^s/", "/\^c/", "/\^a/", "/\^q/", "/\^t/", "/\^l/", "/\^r/", "/\^v/");
$tr_ptrn_r = array ("\\", "/", ":", "*", "?", "\"", "<", ">", "|");
$str = preg_replace ($tr_ptrn, $tr_ptrn_r, $str);

$newstr = $tmp = '';
$current_cp = 'L';
$len = strlen ($str);
for ($i=0; $i<$len; $i++) {
if ($str{$i} == '^' && isset ($sets[$str{$i+1}]) && $str{$i-1} != "^") {
if ($tmp != '') {
$newstr .= mb_convert_encoding ($tmp, $conv_to, $sets[$current_cp]);
$tmp = '';
}
$current_cp = $str{++$i};
}
// Filter out every character below 0x20
else if (ord ($str{$i}) > 31)
$tmp .= $str{$i};
}
if ($tmp != '')
$newstr .= mb_convert_encoding ($tmp, $conv_to, $sets[$current_cp]);

// Final special char to convert - could not do that before codepage conversion
return str_replace ('^^', '^', $newstr);
}

St4Lk3R
3rd January 2008, 17:40
Hey Vic,

nice little function there. However, I'd suggest one single change: while mbstring is not part of a "vanilla" PHP-Build (meaning a PHP build without any configure options), the iconv-extension is. This means you can make the function much more interoperable across php installations by changing one line of your script:


// BEFORE:
$newstr .= mb_convert_encoding($tmp, $conv_to, $sets[$current_cp]);
// AFTER:
$newstr .= iconv($sets[$current_cp], $conv_to, $tmp);

suggested change is untested

however, I do not know if iconv also supports the HTML-ENTITIES-codepage you talked about in your post.

kanutron
3rd January 2008, 18:04
however, I do not know if iconv also supports the HTML-ENTITIES-codepage you talked about in your post.

Doesn't support.
You must use htmlentities() instead.
http://www.php.net/htmlentities

But results seems better with mbstring + UTF8.

If you're using UTF8, iconv is ok, but if you're using one-byte charset, iconv+htmlentites doesn't produce good output. At least for me.

Is hard to provide a universal solution for that, and converter v2 seems to be close.

Maybe a check with "function_exists" of the mbstring, and then do a fallback to iconv+htmlentities will improve the function.

Victor
3rd January 2008, 18:50
Is hard to provide a universal solution for that, and converter v2 seems to be close.

I never used iconv before actually. But looking at your and St4lkers suggestions, I think mbstring cures a lot of problems. I don't think it's for nothing that for example stuff like phpmyadmin also uses mbstring, instead of iconv.
Ok maybe mbstring is not supported by default PHP configs - maybe it should be highly encouraged to get mbstring enabled if you don't.

avellis
3rd January 2008, 18:50
however, I do not know if iconv also supports the HTML-ENTITIES-codepage you talked about in your post.

It doesn't, AFAIK.

Another alternative is the recode extension, which also has some entities-"charsets".

From all these extensions, mbstring, iconv and recode, IMHO the best and safest for UTF-8 stuff, I've concluded, is mbstring. iconv() has some peculiarities in some platforms.

If someone wants to write portable code, there is the 'function_exists()' function to check for existence of these functions/extensions.

Lalala.

Krammeh
14th April 2008, 12:43
why do some names output as: Datscher ��� (http://www.livetocruise.net/profile.php?uname=Datscher)

and alike.

Krammeh
15th April 2008, 12:10
After I updated to the newer release of the code, the problem was infact resolved.