広告が嫌いですか? 行く 広告なし 今日

UTF-8 and Unicode Why That Emoji Broke Your Database

更新日

Your app inserted an emoji and MySQL threw Incorrect string value. Here's why — code points vs bytes, the MySQL utf8 vs utf8mb4 lie, JavaScript surrogate pairs, and how to actually fix it.

UTF-8 and Unicode: Why That Emoji Broke Your Database 1

Your app was working fine. Then a user typed an emoji in a text field, and MySQL threw Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio'. Or maybe the emoji silently disappeared. Or the entire INSERT failed and you lost data. All because of a four-byte character your database column didn’t expect.

This isn’t a MySQL quirk or a PHP bug. It’s a consequence of how Unicode → UTF-8 encoding actually works, and once you understand it, you’ll never be surprised by it again.

Code points vs bytes: the actual difference

Unicode assigns every character a code point — a number. The letter A is U+0041. The Euro sign is U+20AC. The 😀 emoji is U+1F600. That’s the abstract identity of the character.

UTF-8 is an エンコーディング — a way to store code points as bytes. The trick is that UTF-8 is variable-width: it uses 1 to 4 bytes depending on the code point value. This is how it stays backward-compatible with ASCII (all ASCII characters are 1 byte in UTF-8) while also encoding every character in existence.

The encoding rules:

  • U+0000 to U+007F (ASCII) → 1 byte
  • U+0080 to U+07FF (Latin extended, Arabic, Hebrew, etc.) → 2 bytes
  • U+0800 to U+FFFF (most of CJK, punctuation, symbols) → 3 bytes
  • U+10000 to U+10FFFF (emoji, rare scripts, math symbols) → 4 bytes

This is why the 😀 emoji (U+1F600) takes 4 bytes: its code point is above U+FFFF.

UTF-8 byte sizes: a reference table

Here’s what common characters actually cost in bytes:

文字説明ユニコードコードポイントUTF-8 Bytes (hex)バイト数
Latin capital AU+0041411
éLatin e with acuteU+00E9C3 A92
Euro signU+20ACE2 82 AC3
Chinese character “middle”U+4E2DE4 B8 AD3
😀Grinning face emojiU+1F600F0 9F 98 804
🔥Fire emojiU+1F525F0 9F 94 A54
𝕳Mathematical fraktur HU+1D573F0 9D 95 B34

To verify this yourself, use the 文字列長計算機 — it shows both character count and byte count for any text you paste in. Paste 😀 and you’ll see 1 character but 4 bytes.

The MySQL utf8 lie

Here’s where developers get burned. MySQL has a charset called utf8. Sounds right. It’s wrong — MySQL’s utf8 only supports up to 3-byte sequences. Emoji (4 bytes) are not supported.

The actual full UTF-8 charset in MySQL is utf8mb4 (introduced in MySQL 5.5.3, released 2010). If your column uses utf8 and someone inserts an emoji, MySQL either silently truncates the data or throws:

Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio' at row 1

The fix:

-- Convert the table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Or for a specific column
ALTER TABLE users MODIFY bio TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- And set your connection charset
SET NAMES utf8mb4;

Also update your application’s database connection config. In MySQL PDO:

$pdo = new PDO($dsn, $user, $pass, [
    PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8mb4"
]);

The VARCHAR(255) trap

VARCHAR(255) in MySQL means 255 characters, not 255 bytes — but the storage limit for a single row is calculated in bytes. With utf8mb4, each character can take up to 4 bytes, so a VARCHAR(255) column reserves up to 1,020 bytes. This matters when you’re using InnoDB’s default prefix index limit (767 bytes) for indexing varchar columns:

-- This fails on older MySQL with default innodb_large_prefix=OFF
CREATE INDEX idx_email ON users (email);  -- email is VARCHAR(255) utf8mb4

-- Fix: use a prefix index
CREATE INDEX idx_email ON users (email(191));  -- 191 * 4 = 764 bytes, under 767

-- Or enable large prefixes (MySQL 5.7+, on by default in 8.0)
-- Set innodb_large_prefix = ON in my.cnf

JavaScript and the surrogate pair problem

JavaScript uses UTF-16 internally, not UTF-8. And UTF-16 has its own multi-unit encoding for code points above U+FFFF: surrogate pairs — two 16-bit code units that together represent one character.

This means String.length in JavaScript counts UTF-16 code units, not characters:

'😀'.length        // → 2 (two UTF-16 surrogate code units)
[...'😀'].length   // → 1 (spread operator uses Unicode code points)

// Checking the character at index 0
'😀'[0]            // → '\uD83D' (the high surrogate, not the emoji)
'😀'.codePointAt(0) // → 128512 (0x1F600, correct)

For string operations that need to be character-aware, use the spread operator or Intl.Segmenter:

// Count actual grapheme clusters
const segmenter = new Intl.Segmenter();
const chars = [...segmenter.segment('👨‍👩‍👧‍👦')];
chars.length     // → 1 (family emoji is one grapheme cluster)
'👨‍👩‍👧‍👦'.length  // → 11 (UTF-16 code units)

The family emoji example is worth pausing on. 👨‍👩‍👧‍👦 is four emoji joined by Zero Width Joiners (U+200D). A naïve .length gives you 11. Actual grapheme clusters: 1. This matters if you’re implementing character limits — a limit based on String.length will behave unexpectedly when users type emoji sequences.

How to check encoding in practice

パイソン

s = '😀'
print(len(s))                    # 1 (Python 3 counts code points)
print(len(s.encode('utf-8')))    # 4 bytes
print(s.encode('utf-8').hex())   # f09f9880

PHP

$s = '😀';
echo strlen($s);          // 4 (bytes, not characters)
echo mb_strlen($s);       // 1 (characters)
echo mb_strlen($s, '8bit'); // 4 (bytes, explicit)

strlen() in PHP counts bytes, not characters. This catches PHP developers constantly when dealing with multi-byte strings — a 10-emoji string will report a length of 40. Use mb_strlen() when you care about character count.

MySQL

-- Check charset of a table
SHOW CREATE TABLE users\G

-- Check charset of a specific column
SELECT CHARACTER_SET_NAME, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'users' AND COLUMN_NAME = 'bio';

-- Character count vs byte count
SELECT CHAR_LENGTH('😀'), LENGTH('😀');
-- → 1, 4

Quick sanity check

If you want to see byte vs character count for arbitrary text without writing code, the 文字列長計算機 handles it instantly — paste in any text and it shows character count, word count, and byte count side by side.

The encoding bug checklist

  • MySQL charset: Is it utf8mb4、ではなく utf8? Check with SHOW CREATE TABLE.
  • MySQL connection: Is your app sending SET NAMES utf8mb4? Check your DSN or connection config.
  • PHP strlen vs mb_strlen: Are you using byte-count functions where you need character counts?
  • JavaScript .length: Are you counting code units where you need grapheme clusters?
  • HTTP headers: Is your response sending Content-Type: text/html; charset=utf-8?
  • File encoding: Are your source files and SQL dumps saved as UTF-8 without BOM?
広告なしで楽しみたいですか? 今すぐ広告なしで

拡張機能をインストールする

お気に入りのブラウザにIOツールを追加して、すぐにアクセスし、検索を高速化します。

に追加 Chrome拡張機能 に追加 エッジ拡張 に追加 Firefox 拡張機能 に追加 Opera 拡張機能

スコアボードが到着しました!

スコアボード ゲームを追跡する楽しい方法です。すべてのデータはブラウザに保存されます。さらに多くの機能がまもなく登場します!

ニュースコーナー 技術ハイライト付き

参加する

価値ある無料ツールの提供を継続するためにご協力ください

コーヒーを買って