~ overflow ~

Tag: utf8

UTF8 Conversion Functions

by z3n on Apr.24, 2010, under Coding, Tips & Hints

Problem:

There are many issues when you run across a non utf8 string and even worse a mixed variable, that could or couldn’t be utf8 at same time.

Solution:

Based on my previous postings i’ve improved my utf8 functions taking as referral this great post.

define('_is_utf8_split',5000);

function utf8_encode_array(&$x) {
	if (is_array($x)) {
		foreach ($x as &$v) // loop through arrays and/or items
			$v=utf8_encode_array($v);
		return $x;
	} else // not array
		return !is_utf8($x) ? utf8_encode($x) : $x;
}
function to_utf8($x) { // v1.01
	/*
		This function will convert a string or an array to utf8.
		The input can have mixed encodings.

		-- 100424
	*/
	if (!is_utf8($x)) {
		if (is_array($x)) {
			foreach ($x as &$v) {
				$v=to_utf8($v);
			}
		} else {
			$x=utf8_encode($x);
		}
	}
	return $x;
}
function is_utf8($string) { // v1.03
	if (is_array($string)) {
		foreach ($string as $v) {
			if (is_utf8($string))
				return true;
		}
		return false;
	} elseif (strlen($string) > _is_utf8_split) {
		// Based on: http://mobile-website.mobi/php-utf8-vs-iso-8859-1-59
		for ($s=$i=0,$j=ceil(strlen($string)/_is_utf8_split);$i < $j;$i++,$s+=_is_utf8_split) {
			if (!is_utf8(substr($string,$s,_is_utf8_split)))
				return false;
		}
		return true;
	} else {
		// From http://w3.org/International/questions/qa-forms-utf-8.html
		return preg_match('%^(?:
				[\x09\x0A\x0D\x20-\x7E]            # ASCII
			| [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
			|  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
			| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
			|  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
			|  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
			| [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
			|  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
		)*$%xs', $string);
	}
}
function _r_json($x) {
	$x=utf8_encode_array($x);
	echo json_encode($x);
}
Leave a Comment :, , , , more...

How to detect if a string is utf8 on php?

by z3n on Apr.24, 2010, under Coding, Tips & Hints

Problem:
During the debug of utf8 strings i came across a string that could or not be a utf8 strings, thanks to IE. There’s no such function as is_utf8 or a specific function to detect if a string is actually utf8.

Solution:

define('_is_utf8_split',5000);

function is_utf8($string) { // v1.01
	if (strlen($string) > _is_utf8_split) {
		// Based on: http://mobile-website.mobi/php-utf8-vs-iso-8859-1-59
		for ($i=0,$s=_is_utf8_split,$j=ceil(strlen($string)/_is_utf8_split);$i < $j;$i++,$s+=_is_utf8_split) {
			if (is_utf8(substr($string,$s,_is_utf8_split)))
				return true;
		}
		return false;
	} else {
		// From http://w3.org/International/questions/qa-forms-utf-8.html
		return preg_match('%^(?:
				[\x09\x0A\x0D\x20-\x7E]            # ASCII
			| [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
			|  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
			| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
			|  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
			|  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
			| [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
			|  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
		)*$%xs', $string);
	}
}  

Notes:

According to some posts on php and this specific posting, there’s a bug that happens on strings bigger than 5000 chars, this function will split those strings and test their parts.

1 Comment :, , , , more...

MySQL importing .sql with accents causing issues

by z3n on Dec.30, 2009, under Uncategorized

Problem:

When importing a .sql with entries with accents, like not regular english, it may lead to issues, like:

São Paulo‘ instead of ‘São Paulo

Solution:

Even mysqld default charset being latin1, sometimes it don’t work with accents, depending on the imports you’re doing.

So you may need to force it to fallback to utf8, on my case i just added this to the beggining of the .sql file i was importing:

charset utf8 \c

and it worked just fine.

Note: If you are using asian chars (japanese/chinese specific), then utf8 might not be enough to cover all chars.

Leave a Comment :, , , , , more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!