View Full Version : PHP4/5 - LFS Hostname codepage converter v2
Victor
2nd January 2008, 23:29
The "PHP4/5 - LFS Hostname codepage converter v2" is a function that allows you to convert the different character codepages in LFS hostnames to something you can use on a webpage (ie. conversion to a single codepage, the one you use on your website).
This function is a new one. I had posted one a while back that required the inclusion of a codepage conversion table. But now this is no longer required. Instead you need to have compiled the mbstring library into your PHP install. This will take care of all conversions.
After you have converted a hostname with this function, the only remaining thing left is converting the colour codes (not done in this function).
Two notes about the function below :
1) You will notice this function also replaces the special characters in LFS text such as ^d (\), ^s (/), etc. This is required because not all of these converted characters exist in the same place in every codepage. So they need to be converted from LFS escaped to actual character and then to the appropriate codepage.
2) The function takes two parameters. $str and $conv_to, which is UTF-8 by default. It is here that you indicate to which codepage you want to convert your hostname.
If you use a single byte codepage such as ISO-8859-1 on your website, you should use the HTML-ENTITIES codepage to convert your hostnames into.
For a list of all codepages supported by mbstring, see here : http://nl3.php.net/manual/en/ref.mbstring.php
// L = Latin 1
// G = Greek
// C = Cyrillic
// E = Central Europe
// T = Turkish
// B = Baltic
// J = Japanese
// S = Simplified Chinese
// K = Korean
// H = Traditional Chinese
function codepage_convert ($str, $conv_to = 'UTF-8') {
$sets = array ('L' => 'CP1252',
'G' => 'ISO-8859-7',
'C' => 'CP1251',
'E' => 'ISO-8859-2',
'T' => 'ISO-8859-9',
'B' => 'ISO-8859-13',
'J' => 'SJIS-win',
'S' => 'CP936',
'K' => 'CP949',
'H' => 'CP950');
$tr_ptrn = array ("/\^d/", "/\^s/", "/\^c/", "/\^a/", "/\^q/", "/\^t/", "/\^l/", "/\^r/", "/\^v/");
$tr_ptrn_r = array ("\\", "/", ":", "*", "?", "\"", "<", ">", "|");
$str = preg_replace ($tr_ptrn, $tr_ptrn_r, $str);
$newstr = $tmp = '';
$current_cp = 'L';
$len = strlen ($str);
for ($i=0; $i<$len; $i++) {
if ($str{$i} == '^' && isset ($sets[$str{$i+1}]) && $str{$i-1} != "^") {
if ($tmp != '') {
$newstr .= mb_convert_encoding ($tmp, $conv_to, $sets[$current_cp]);
$tmp = '';
}
$current_cp = $str{++$i};
}
// Filter out every character below 0x20
else if (ord ($str{$i}) > 31)
$tmp .= $str{$i};
}
if ($tmp != '')
$newstr .= mb_convert_encoding ($tmp, $conv_to, $sets[$current_cp]);
// Final special char to convert - could not do that before codepage conversion
return str_replace ('^^', '^', $newstr);
}
St4Lk3R
3rd January 2008, 17:40
Hey Vic,
nice little function there. However, I'd suggest one single change: while mbstring is not part of a "vanilla" PHP-Build (meaning a PHP build without any configure options), the iconv-extension is. This means you can make the function much more interoperable across php installations by changing one line of your script:
// BEFORE:
$newstr .= mb_convert_encoding($tmp, $conv_to, $sets[$current_cp]);
// AFTER:
$newstr .= iconv($sets[$current_cp], $conv_to, $tmp);
suggested change is untested
however, I do not know if iconv also supports the HTML-ENTITIES-codepage you talked about in your post.
kanutron
3rd January 2008, 18:04
however, I do not know if iconv also supports the HTML-ENTITIES-codepage you talked about in your post.
Doesn't support.
You must use htmlentities() instead.
http://www.php.net/htmlentities
But results seems better with mbstring + UTF8.
If you're using UTF8, iconv is ok, but if you're using one-byte charset, iconv+htmlentites doesn't produce good output. At least for me.
Is hard to provide a universal solution for that, and converter v2 seems to be close.
Maybe a check with "function_exists" of the mbstring, and then do a fallback to iconv+htmlentities will improve the function.
Victor
3rd January 2008, 18:50
Is hard to provide a universal solution for that, and converter v2 seems to be close.
I never used iconv before actually. But looking at your and St4lkers suggestions, I think mbstring cures a lot of problems. I don't think it's for nothing that for example stuff like phpmyadmin also uses mbstring, instead of iconv.
Ok maybe mbstring is not supported by default PHP configs - maybe it should be highly encouraged to get mbstring enabled if you don't.
avellis
3rd January 2008, 18:50
however, I do not know if iconv also supports the HTML-ENTITIES-codepage you talked about in your post.
It doesn't, AFAIK.
Another alternative is the recode extension, which also has some entities-"charsets".
From all these extensions, mbstring, iconv and recode, IMHO the best and safest for UTF-8 stuff, I've concluded, is mbstring. iconv() has some peculiarities in some platforms.
If someone wants to write portable code, there is the 'function_exists()' function to check for existence of these functions/extensions.
Lalala.
Krammeh
14th April 2008, 13:43
why do some names output as: Datscher (http://www.livetocruise.net/profile.php?uname=Datscher)
and alike.
Krammeh
15th April 2008, 13:10
After I updated to the newer release of the code, the problem was infact resolved.
EQ Worry
11th December 2009, 14:36
Anyone attempting to implement full conversion from LFS bytes to Unicode string and from Unicode string back to LFS bytes should try using for ^E codepage 1250 instead of 28592 mentioned above. At least to me CP1250 gives the correct results... :)
Dygear
15th December 2009, 07:50
How do we not have a function that will convert the colors from the LFS strings into HTML strings?!? I mean, they have been posted in other threads, but no one has posted it in it's own thread here. And I find that quite discouraging.
avetere
1st February 2010, 15:39
I mean, they have been posted in other threads, but no one has posted it in it's own thread here. And I find that quite discouraging.
Actually I use colorconversion only after codepage-conversion (In fact I use the one posted here) ... so, to some extend, I may find it useful, to combine both of them into one ...
Currently I use this on converted names:
### define colors for HTML-output ###
function getColorCode($col) {
switch ($col) {
case 0 : return "#000000";
case 1 : return "#ff0000";
case 2 : return "#00ff00";
case 3 : return "#ffff00";
case 4 : return "#0000ff";
case 5 : return "#ff00ff";
case 6 : return "#00ffff";
case 7 : return "#ffffff";
case 8 : return "#000000";
case 9 : return "#000000";
default : return $col;
}
}
### strip colors from names ###
function nameblank($name) {
return stripslashes(preg_replace("/\^[0-9]/","",htmlspecialchars($name)));
}
### get colored names for HTML-output ###
function namecolored($name) {
return stripslashes(preg_replace("/\^([0-9])(.[^\^]*)/e",'"<span style=\"color:".getColorCode($1).";\">$2</span>"',htmlspecialchars($name)));
}
We could as well comnibe all that to get it all racked up into one single query :shrug:
Dygear
2nd February 2010, 18:53
We could as well comnibe all that to get it all racked up into one single query :shrug:
I've been doing that for around a month now in the LFSWorldSDK as so many people how to correctly handle both the color codes and UTF-8 formatting of the multicharacter strings. Heavy based off the function from here, and another function from some one else (whom I've forgotten as this time, but I'm pretty sure I gave them the credit in the code.)
morpha
8th July 2010, 16:07
Anyone attempting to implement full conversion from LFS bytes to Unicode string and from Unicode string back to LFS bytes should try using for ^E codepage 1250 instead of 28592 mentioned above. At least to me CP1250 gives the correct results... :)
To extend this, MSDN's codepage list (http://msdn.microsoft.com/en-us/library/dd317756%28v=VS.85%29.aspx) made me wonder if the codepages used by LFS are really the ISO equivalents or indeed the Windows code pages. In other words, shouldn't it be:J -> Japanese CP932 (Shift-JIS)
S -> Simplified Chinese CP936 (PRC, Singapore, GB2312)
K -> Korean CP949 (Unified Hangul Code)
H -> Traditional Chinese CP950 (Taiwan, Hong Kong SAR, PRC, Big5)
E -> Central European CP1250 (instead of CP28592 -> ISO-8859-2)
C -> Cyrillic CP1251 (instead of CP28595 -> ISO-8859-5)
L -> Latin 1 CP1252 (instead of CP28591 -> ISO-8859-1)
G -> Greek CP1253 (instead of CP28597 -> ISO-8859-7)
T -> Turkish CP1254 (instead of CP28599 -> ISO-8859-9)
B -> Baltic CP1257 (instead of CP28594 -> ISO-8859-4)
Thoughts?
Dygear
8th July 2010, 17:16
To extend this, MSDN's codepage list (http://msdn.microsoft.com/en-us/library/dd317756%28v=VS.85%29.aspx) made me wonder if the codepages used by LFS are really the ISO equivalents or indeed the Windows code pages. In other words, shouldn't it be:J -> Japanese CP932 (Shift-JIS)
S -> Simplified Chinese CP936 (PRC, Singapore, GB2312)
K -> Korean CP949 (Unified Hangul Code)
H -> Traditional Chinese CP950 (Taiwan, Hong Kong SAR, PRC, Big5)
E -> Central European CP1250 (instead of CP28592 -> ISO-8859-2)
C -> Cyrillic CP1251 (instead of CP28595 -> ISO-8859-5)
L -> Latin 1 CP1252 (instead of CP28591 -> ISO-8859-1)
G -> Greek CP1253 (instead of CP28597 -> ISO-8859-7)
T -> Turkish CP1254 (instead of CP28599 -> ISO-8859-9)
B -> Baltic CP1257 (instead of CP28594 -> ISO-8859-4)
Thoughts?
Is there a way we can compare this, programmatically?
morpha
8th July 2010, 19:30
Is there a way we can compare this, programmatically?
Well yes, there is, but I went the graphical route :razz:
Comparing LFS's ingame table with MSCPs and ISO tables.
CP1250 (http://www.microsoft.com/typography/unicode/1250.gif) - ISO 8859-2 (http://www.charset.org/images/iso-8859-2.gif) CP1251 (http://www.microsoft.com/typography/unicode/1251.gif) - ISO 8859-5 (http://www.charset.org/images/iso-8859-5.gif) CP1252 (http://www.microsoft.com/typography/unicode/1252.gif) - ISO 8859-1 (http://www.charset.org/images/iso-8859-1.gif) CP1253 (http://www.microsoft.com/typography/unicode/1253.gif) - ISO 8859-7 (http://www.charset.org/images/iso-8859-7.gif) CP1254 (http://www.microsoft.com/typography/unicode/1254.gif) - ISO 8859-9 (http://www.charset.org/images/iso-8859-9.gif) CP1257 (http://www.microsoft.com/typography/unicode/1257.gif) - ISO 8859-4 (http://www.charset.org/images/iso-8859-4.gif)I came to the conclusion that MS codepages are correct, so it should indeed beJ -> Japanese CP932
S -> Simplified Chinese CP936
K -> Korean CP949
H -> Traditional Chinese CP950
E -> Central European CP1250
C -> Cyrillic CP1251
L -> Latin 1 CP1252
G -> Greek CP1253
T -> Turkish CP1254
B -> Baltic CP1257
EQ Worry
9th July 2010, 12:21
Interesting, I'll try to change the codepages in the Aegio library I'm using, and see if some warnings concerning undecodable byte sequences disappear or are multiplied.
morpha
9th July 2010, 13:59
Interesting, I'll try to change the codepages in the Aegio library I'm using, and see if some warnings concerning undecodable byte sequences disappear or are multiplied.
Please do :)
Converting from LFS to Unicode should be fine, but the other way round could be problematic in some cases because LFS doesn't include all chars. Attached is the chart I made :thumbsup:
EQ Worry
12th July 2010, 09:58
Well, I tried changing the code pages, but I was receiving some errors then. It seems the Greek code page cannot be changed this way. But I concede I did just a quick check, it may very well be a follow-up error, a character formerly incorrectly encoded from LFS to Unicode which now throws error when attempting to change it back from Unicode to LFS using new code pages.
Currently the InSim library running under Airio uses the codepages as suggested by Victor with two exceptions: 01250 is used instead of 28592, and 01257 is used as a fallback if 28603 is not available (that is on Linux). I think these settings work rather well, conversion to Unicode and back is smooth...
morpha
12th July 2010, 12:20
Some discrepancies exist. As I said, not all characters are present in LFS, but all characters that are present match with MSCPs and not ISO tables. However, the complex languages are neither MSCP nor ISO. For example, in Japanese both CP932 and Shift-JIS map 0x7E to an overline instead of a backslash, yet in LFS it is a backslash. However, they also map 0x5C to the yen sign, which is "correctly" mapped in LFS.
We'd need Scawen to clarify this and perhaps provide his original code for the mapping :tilt:
Dygear
12th July 2010, 17:23
We'd need Scawen to clarify this and perhaps provide his original code for the mapping :tilt:
I'd love for Scawen to utilize the already established standards, it would be a night mare to have code page proliferation.
DarkTimes
7th October 2010, 15:42
To extend this, MSDN's codepage list (http://msdn.microsoft.com/en-us/library/dd317756%28v=VS.85%29.aspx) made me wonder if the codepages used by LFS are really the ISO equivalents or indeed the Windows code pages. In other words, shouldn't it be:J -> Japanese CP932 (Shift-JIS)
S -> Simplified Chinese CP936 (PRC, Singapore, GB2312)
K -> Korean CP949 (Unified Hangul Code)
H -> Traditional Chinese CP950 (Taiwan, Hong Kong SAR, PRC, Big5)
E -> Central European CP1250 (instead of CP28592 -> ISO-8859-2)
C -> Cyrillic CP1251 (instead of CP28595 -> ISO-8859-5)
L -> Latin 1 CP1252 (instead of CP28591 -> ISO-8859-1)
G -> Greek CP1253 (instead of CP28597 -> ISO-8859-7)
T -> Turkish CP1254 (instead of CP28599 -> ISO-8859-9)
B -> Baltic CP1257 (instead of CP28594 -> ISO-8859-4)
Thoughts?
Yeah, I'm slow. :p
I realised the other day that the codepage for each language is listed at the top that language's translation file in the LFS directory. Using this source I can confirm that these are correct. I see no reason to presume that the translation files within LFS would specify anything other than the correct codepage.
LFS Translation file (Japanese)
Translated by: AE100, gilles_jpn, highbridge, Sae Kazamori, takaryo, yamakawa
tx_codepage 932
tx_langname “ú–{Œê
tx_noun_adj an
3g_tr_selct TRAINING
3g_tr_title Select Lesson
...
I've been messing around with the encodings over the last few days and to be honest I've learned more about single-byte and double-byte character sets than I can stand :p.
Budda bless UTF8 and long may it reign supreme. :)
Victor
7th October 2010, 17:06
the reason i originally chose some iso's instead of cp's is because mb_string does not know about all the cp's.
http://nl3.php.net/manual/en/mbstring.encodings.php
Maybe this is not the case on Windows versions of php, but it surely is on *nix.
The codepages mentioned in the language files refer to what LFS uses and what the online LFS Translator tool use as browser cp setting. But when talking about mb_string functions and conversions, I had to use some ISO's instead of CP's.
Dygear
8th October 2010, 05:29
I wonder if there is a way we can get this to work without having to have the mb extenstion installed or loaded. I wonder if filur would be willing to do this.
Victor
8th October 2010, 11:33
seeing that LFS still only ues single byte codepages you could potentially just create lookup tables like in my first version of the converter.
You don't need filur to do that ;)
For the rest I don't see other options. Iconv is said to not work 100%, sooo....
DarkTimes
8th October 2010, 11:50
Japanese, Korean, Traditional and Simplified Chinese are double-byte (or really multi-byte) character sets. You can check by typing a Japanese character into LFS and comparing the lead and tail bytes to the CP932 codepage.
avetere
2nd July 2011, 19:24
I'm quite sure, this has been asked before, but since I can't find that anymore:
@Victor: Do you in any way have a unicode-font (preferably .ttf) which contains all those characters used by LfS or even better: One that even looks lke the fonts ingame?
If not, is it possible to create one?
And most important of all: If there is such a font, could you make it available to us, so we could use it?
(I'm asking espescially because I need to create some images that contain player-names using gd-library and I'd like to avoid shipping arialuni with the script ...)
Dygear
2nd July 2011, 20:18
That's a fantastic idea. +20!
avetere
5th July 2011, 19:29
Well, I'll take that as a no, so:
Does anybody know an open source font that
- includes all the required characters
- isn't too large
- doesn't look too shitty
?
avetere
3rd August 2011, 14:54
I encountered some strange behaviour in color converting lately:
I've been playing with some old files and came across this (see attached pic):
which was given as: ^1^2^4Zs^3oo^4ti instead of - as I would expect -:
^1^^2^4Zs^3oo^4ti
Now the question is:
Is my colour conversion wrong and it should display without "^2" (meaning it would had once been possible to "row-up" "color-switches")
Or is this correct and a literal "^" wasn't escaped back then?
Unfortunately I cannot have a look ingame as the corresponding replay was of patch P, which I can't unlock :(
I got this 3 or 4 times (from different drivers) but always from replays of patch P or Q ... so no checking possible on my side :shrug:
EDIT:
Well, by now I figured it out myself by simply naming a new player that way and checking with the unlocked version in demo mode :D
For those wondering: It was handled as consecutive color changes, so no red "^2" in that name and the error was on my part ... I never trusted those regex-things anyway *grmblfx* ;)
Dygear
5th August 2011, 01:37
This ^1^2^4Zs^3oo^4ti, should really be sanitized down to ^4Zs^3oo^4ti that should be Zsooti. The way I read it, the first two colors should have no effect.
avetere
5th August 2011, 21:49
Yeah, quite so ... I also already fixed my functions ;)
Anyway, I wonder, how and why it was ever possible to get such a name as nowadays this doesn't seem to happen anymore :shrug:
Krammeh
6th August 2011, 01:13
Yeah, quite so ... I also already fixed my functions ;)
Anyway, I wonder, how and why it was ever possible to get such a name as nowadays this doesn't seem to happen anymore :shrug:
cfg file backup or editing?
Dygear
6th August 2011, 02:27
cfg file backup or editing?
Manual input into the config file that was not sanitized by the LFS game name parser?
[edit] Oh never mind, apparently Krammeth beat me to it.
avetere
6th August 2011, 09:28
LOL ... yeah ... this was simply too obvious, as I tested it the same way :)
vBulletin® v3.8.6, Copyright ©2000-2012, Jelsoft Enterprises Ltd.