The online racing simulator
PHP4/5 - LFS Hostname codepage converter v2
DEPRICATED FUNCTIONS
Better use the class found here : https://www.lfsforum.net/showthread.php?t=82787
--------------

The "PHP4/5 - LFS Hostname codepage converter v2" is a function that allows you to convert the different character codepages in LFS hostnames to something you can use on a webpage (ie. conversion to a single codepage, the one you use on your website).

This function is a new one. I had posted one a while back that required the inclusion of a codepage conversion table. But now this is no longer required. Instead you need to have compiled the mbstring library into your PHP install. This will take care of all conversions.
After you have converted a hostname with this function, the only remaining thing left is converting the colour codes (not done in this function).

Two notes about the function below :

1) You will notice this function also replaces the special characters in LFS text such as ^d (\), ^s (/), etc. This is required because not all of these converted characters exist in the same place in every codepage. So they need to be converted from LFS escaped to actual character and then to the appropriate codepage.

2) The function takes two parameters. $str and $conv_to, which is UTF-8 by default. It is here that you indicate to which codepage you want to convert your hostname.
If you use a single byte codepage such as ISO-8859-1 on your website, you should use the HTML-ENTITIES codepage to convert your hostnames into.
For a list of all codepages supported by mbstring, see here : http://nl3.php.net/manual/en/ref.mbstring.php


<?php 
// L = Latin 1
// G = Greek
// C = Cyrillic
// E = Central Europe
// T = Turkish
// B = Baltic
// J = Japanese
// S = Simplified Chinese
// K = Korean
// H = Traditional Chinese
function codepage_convert ($str$conv_to 'UTF-8') {
    
$sets = array ('L' => 'CP1252',
                   
'G' => 'ISO-8859-7',
                   
'C' => 'CP1251',
                   
'E' => 'ISO-8859-2',
                   
'T' => 'ISO-8859-9',
                   
'B' => 'ISO-8859-13',
                   
'J' => 'SJIS-win',
                   
'S' => 'CP936',
                   
'K' => 'CP949',
                   
'H' => 'CP950');

    
$tr_ptrn = array ("/\^d/""/\^s/""/\^c/""/\^a/""/\^q/""/\^t/""/\^l/""/\^r/""/\^v/");
    
$tr_ptrn_r = array ("\\""/"":""*""?""\"""<"">""|");
    
$str preg_replace ($tr_ptrn$tr_ptrn_r$str);

    
$newstr $tmp '';
    
$current_cp 'L';
    
$len strlen ($str);
    for (
$i=0$i<$len$i++) {
        if (
$str{$i} == '^' && isset ($sets[$str{$i+1}]) && $str{$i-1} != "^") {
            if (
$tmp != '') {
                
$newstr .= mb_convert_encoding ($tmp$conv_to$sets[$current_cp]);
                
$tmp '';
            }
            
$current_cp $str{++$i};
        }
        
// Filter out every character below 0x20
        
else if (ord ($str{$i}) > 31)
            
$tmp .= $str{$i};
    }
    if (
$tmp != '')
        
$newstr .= mb_convert_encoding ($tmp$conv_to$sets[$current_cp]);
    
    
// Final special char to convert - could not do that before codepage conversion
    
return str_replace ('^^''^'$newstr);
}
?>

Hey Vic,

nice little function there. However, I'd suggest one single change: while mbstring is not part of a "vanilla" PHP-Build (meaning a PHP build without any configure options), the iconv-extension is. This means you can make the function much more interoperable across php installations by changing one line of your script:


<?php 
// BEFORE:
                
$newstr .= mb_convert_encoding($tmp$conv_to$sets[$current_cp]);
// AFTER:
                
$newstr .= iconv($sets[$current_cp], $conv_to$tmp);
?>

suggested change is untested

however, I do not know if iconv also supports the HTML-ENTITIES-codepage you talked about in your post.
Quote from St4Lk3R :
however, I do not know if iconv also supports the HTML-ENTITIES-codepage you talked about in your post.

Doesn't support.
You must use htmlentities() instead.
http://www.php.net/htmlentities

But results seems better with mbstring + UTF8.

If you're using UTF8, iconv is ok, but if you're using one-byte charset, iconv+htmlentites doesn't produce good output. At least for me.

Is hard to provide a universal solution for that, and converter v2 seems to be close.

Maybe a check with "function_exists" of the mbstring, and then do a fallback to iconv+htmlentities will improve the function.
Quote from kanutron :Is hard to provide a universal solution for that, and converter v2 seems to be close.

I never used iconv before actually. But looking at your and St4lkers suggestions, I think mbstring cures a lot of problems. I don't think it's for nothing that for example stuff like phpmyadmin also uses mbstring, instead of iconv.
Ok maybe mbstring is not supported by default PHP configs - maybe it should be highly encouraged to get mbstring enabled if you don't.
Quote from St4Lk3R :however, I do not know if iconv also supports the HTML-ENTITIES-codepage you talked about in your post.

It doesn't, AFAIK.

Another alternative is the recode extension, which also has some entities-"charsets".

From all these extensions, mbstring, iconv and recode, IMHO the best and safest for UTF-8 stuff, I've concluded, is mbstring. iconv() has some peculiarities in some platforms.

If someone wants to write portable code, there is the 'function_exists()' function to check for existence of these functions/extensions.

Lalala.
After I updated to the newer release of the code, the problem was infact resolved.
Anyone attempting to implement full conversion from LFS bytes to Unicode string and from Unicode string back to LFS bytes should try using for ^E codepage 1250 instead of 28592 mentioned above. At least to me CP1250 gives the correct results...
How do we not have a function that will convert the colors from the LFS strings into HTML strings?!? I mean, they have been posted in other threads, but no one has posted it in it's own thread here. And I find that quite discouraging.
Quote from Dygear :I mean, they have been posted in other threads, but no one has posted it in it's own thread here. And I find that quite discouraging.

Actually I use colorconversion only after codepage-conversion (In fact I use the one posted here) ... so, to some extend, I may find it useful, to combine both of them into one ...

Currently I use this on converted names:

<?php 
### define colors for HTML-output ###
function getColorCode($col) {
    switch (
$col) {
        case 
: return "#000000";
        case 
: return "#ff0000";
        case 
: return "#00ff00";
        case 
: return "#ffff00";
        case 
: return "#0000ff";
        case 
: return "#ff00ff";
        case 
: return "#00ffff";
        case 
: return "#ffffff";
        case 
: return "#000000";
        case 
: return "#000000";
        default : return 
$col;
    }
}

### strip colors from names ###
function nameblank($name) {
    return 
stripslashes(preg_replace("/\^[0-9]/","",htmlspecialchars($name)));
}

### get colored names for HTML-output ###
function namecolored($name) {
    return 
stripslashes(preg_replace("/\^([0-9])(.[^\^]*)/e",'"<span style=\"color:".getColorCode($1).";\">$2</span>"',htmlspecialchars($name)));
}
?>

We could as well comnibe all that to get it all racked up into one single query
Quote from avetere :We could as well comnibe all that to get it all racked up into one single query

I've been doing that for around a month now in the LFSWorldSDK as so many people how to correctly handle both the color codes and UTF-8 formatting of the multicharacter strings. Heavy based off the function from here, and another function from some one else (whom I've forgotten as this time, but I'm pretty sure I gave them the credit in the code.)
Quote from EQ Worry :Anyone attempting to implement full conversion from LFS bytes to Unicode string and from Unicode string back to LFS bytes should try using for ^E codepage 1250 instead of 28592 mentioned above. At least to me CP1250 gives the correct results...

To extend this, MSDN's codepage list made me wonder if the codepages used by LFS are really the ISO equivalents or indeed the Windows code pages. In other words, shouldn't it be:
J -> Japanese CP932 (Shift-JIS)
S -> Simplified Chinese CP936 (PRC, Singapore, GB2312)
K -> Korean CP949 (Unified Hangul Code)
H -> Traditional Chinese CP950 (Taiwan, Hong Kong SAR, PRC, Big5)
E -> Central European CP1250 (instead of CP28592 -> ISO-8859-2)
C -> Cyrillic CP1251 (instead of CP28595 -> ISO-8859-5)
L -> Latin 1 CP1252 (instead of CP28591 -> ISO-8859-1)
G -> Greek CP1253 (instead of CP28597 -> ISO-8859-7)
T -> Turkish CP1254 (instead of CP28599 -> ISO-8859-9)
B -> Baltic CP1257 (instead of CP28594 -> ISO-8859-4)

Thoughts?
Quote from morpha :To extend this, MSDN's codepage list made me wonder if the codepages used by LFS are really the ISO equivalents or indeed the Windows code pages. In other words, shouldn't it be:
J -> Japanese CP932 (Shift-JIS)
S -> Simplified Chinese CP936 (PRC, Singapore, GB2312)
K -> Korean CP949 (Unified Hangul Code)
H -> Traditional Chinese CP950 (Taiwan, Hong Kong SAR, PRC, Big5)
E -> Central European CP1250 (instead of CP28592 -> ISO-8859-2)
C -> Cyrillic CP1251 (instead of CP28595 -> ISO-8859-5)
L -> Latin 1 CP1252 (instead of CP28591 -> ISO-8859-1)
G -> Greek CP1253 (instead of CP28597 -> ISO-8859-7)
T -> Turkish CP1254 (instead of CP28599 -> ISO-8859-9)
B -> Baltic CP1257 (instead of CP28594 -> ISO-8859-4)

Thoughts?

Is there a way we can compare this, programmatically?
Quote from Dygear :Is there a way we can compare this, programmatically?

Well yes, there is, but I went the graphical route
Comparing LFS's ingame table with MSCPs and ISO tables.I came to the conclusion that MS codepages are correct, so it should indeed be
J -> Japanese CP932
S -> Simplified Chinese CP936
K -> Korean CP949
H -> Traditional Chinese CP950
E -> Central European CP1250
C -> Cyrillic CP1251
L -> Latin 1 CP1252
G -> Greek CP1253
T -> Turkish CP1254
B -> Baltic CP1257

Interesting, I'll try to change the codepages in the Aegio library I'm using, and see if some warnings concerning undecodable byte sequences disappear or are multiplied.
Quote from EQ Worry :Interesting, I'll try to change the codepages in the Aegio library I'm using, and see if some warnings concerning undecodable byte sequences disappear or are multiplied.

Please do
Converting from LFS to Unicode should be fine, but the other way round could be problematic in some cases because LFS doesn't include all chars. Attached is the chart I made
Attached images
lfs_cp_comparison.jpg
Well, I tried changing the code pages, but I was receiving some errors then. It seems the Greek code page cannot be changed this way. But I concede I did just a quick check, it may very well be a follow-up error, a character formerly incorrectly encoded from LFS to Unicode which now throws error when attempting to change it back from Unicode to LFS using new code pages.

Currently the InSim library running under Airio uses the codepages as suggested by Victor with two exceptions: 01250 is used instead of 28592, and 01257 is used as a fallback if 28603 is not available (that is on Linux). I think these settings work rather well, conversion to Unicode and back is smooth...
Some discrepancies exist. As I said, not all characters are present in LFS, but all characters that are present match with MSCPs and not ISO tables. However, the complex languages are neither MSCP nor ISO. For example, in Japanese both CP932 and Shift-JIS map 0x7E to an overline instead of a backslash, yet in LFS it is a backslash. However, they also map 0x5C to the yen sign, which is "correctly" mapped in LFS.

We'd need Scawen to clarify this and perhaps provide his original code for the mapping
Quote from morpha :We'd need Scawen to clarify this and perhaps provide his original code for the mapping

I'd love for Scawen to utilize the already established standards, it would be a night mare to have code page proliferation.
Quote from morpha :To extend this, MSDN's codepage list made me wonder if the codepages used by LFS are really the ISO equivalents or indeed the Windows code pages. In other words, shouldn't it be:
J -> Japanese CP932 (Shift-JIS)
S -> Simplified Chinese CP936 (PRC, Singapore, GB2312)
K -> Korean CP949 (Unified Hangul Code)
H -> Traditional Chinese CP950 (Taiwan, Hong Kong SAR, PRC, Big5)
E -> Central European CP1250 (instead of CP28592 -> ISO-8859-2)
C -> Cyrillic CP1251 (instead of CP28595 -> ISO-8859-5)
L -> Latin 1 CP1252 (instead of CP28591 -> ISO-8859-1)
G -> Greek CP1253 (instead of CP28597 -> ISO-8859-7)
T -> Turkish CP1254 (instead of CP28599 -> ISO-8859-9)
B -> Baltic CP1257 (instead of CP28594 -> ISO-8859-4)

Thoughts?

Yeah, I'm slow.

I realised the other day that the codepage for each language is listed at the top that language's translation file in the LFS directory. Using this source I can confirm that these are correct. I see no reason to presume that the translation files within LFS would specify anything other than the correct codepage.

LFS Translation file (Japanese)
Translated by: AE100, gilles_jpn, highbridge, Sae Kazamori, takaryo, yamakawa

[B]tx_codepage 932[/B]
tx_langname “ú–{Œê
tx_noun_adj an
3g_tr_selct TRAINING
3g_tr_title Select Lesson

...

I've been messing around with the encodings over the last few days and to be honest I've learned more about single-byte and double-byte character sets than I can stand .

Budda bless UTF8 and long may it reign supreme.
the reason i originally chose some iso's instead of cp's is because mb_string does not know about all the cp's.
http://nl3.php.net/manual/en/mbstring.encodings.php

Maybe this is not the case on Windows versions of php, but it surely is on *nix.

The codepages mentioned in the language files refer to what LFS uses and what the online LFS Translator tool use as browser cp setting. But when talking about mb_string functions and conversions, I had to use some ISO's instead of CP's.
I wonder if there is a way we can get this to work without having to have the mb extenstion installed or loaded. I wonder if filur would be willing to do this.
seeing that LFS still only ues single byte codepages you could potentially just create lookup tables like in my first version of the converter.

You don't need filur to do that

For the rest I don't see other options. Iconv is said to not work 100%, sooo....
Japanese, Korean, Traditional and Simplified Chinese are double-byte (or really multi-byte) character sets. You can check by typing a Japanese character into LFS and comparing the lead and tail bytes to the CP932 codepage.
I'm quite sure, this has been asked before, but since I can't find that anymore:

@Victor: Do you in any way have a unicode-font (preferably .ttf) which contains all those characters used by LfS or even better: One that even looks lke the fonts ingame?
If not, is it possible to create one?
And most important of all: If there is such a font, could you make it available to us, so we could use it?
(I'm asking espescially because I need to create some images that contain player-names using gd-library and I'd like to avoid shipping arialuni with the script ...)
1
This thread is closed

PHP4/5 - LFS Hostname codepage converter v2
(46 posts, closed, started )
FGED GREDG RDFGDR GSFDG