Tag: encoding
How to detect if a string is utf8 on php?
by z3n on Apr.24, 2010, under Coding, Tips & Hints
Problem:
During the debug of utf8 strings i came across a string that could or not be a utf8 strings, thanks to IE. There’s no such function as is_utf8 or a specific function to detect if a string is actually utf8.
Solution:
define('_is_utf8_split',5000);
function is_utf8($string) { // v1.01
if (strlen($string) > _is_utf8_split) {
// Based on: http://mobile-website.mobi/php-utf8-vs-iso-8859-1-59
for ($i=0,$s=_is_utf8_split,$j=ceil(strlen($string)/_is_utf8_split);$i < $j;$i++,$s+=_is_utf8_split) {
if (is_utf8(substr($string,$s,_is_utf8_split)))
return true;
}
return false;
} else {
// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
}
Notes:
According to some posts on php and this specific posting, there’s a bug that happens on strings bigger than 5000 chars, this function will split those strings and test their parts.