~ overflow ~

Tag: encoding

How to detect if a string is utf8 on php?

by z3n on Apr.24, 2010, under Coding, Tips & Hints

Problem:
During the debug of utf8 strings i came across a string that could or not be a utf8 strings, thanks to IE. There’s no such function as is_utf8 or a specific function to detect if a string is actually utf8.

Solution:

define('_is_utf8_split',5000);

function is_utf8($string) { // v1.01
	if (strlen($string) > _is_utf8_split) {
		// Based on: http://mobile-website.mobi/php-utf8-vs-iso-8859-1-59
		for ($i=0,$s=_is_utf8_split,$j=ceil(strlen($string)/_is_utf8_split);$i < $j;$i++,$s+=_is_utf8_split) {
			if (is_utf8(substr($string,$s,_is_utf8_split)))
				return true;
		}
		return false;
	} else {
		// From http://w3.org/International/questions/qa-forms-utf-8.html
		return preg_match('%^(?:
				[\x09\x0A\x0D\x20-\x7E]            # ASCII
			| [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
			|  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
			| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
			|  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
			|  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
			| [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
			|  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
		)*$%xs', $string);
	}
}  

Notes:

According to some posts on php and this specific posting, there’s a bug that happens on strings bigger than 5000 chars, this function will split those strings and test their parts.

1 Comment :, , , , more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!