Why does MySQL have utf8 if it doesn’t support all of UTF-8?

Historical mistake. MySQL added UTF-8 support in version 4.1 (2003) with a 3-byte limit, predating wide emoji adoption. When 4-byte characters became common, they added utf8mb4 instead of fixing utf8 — changing utf8 would have broken existing data. The utf8 alias still exists for backward compatibility and causes confusion to this day.

Does switching to utf8mb4 break existing data?

No. Existing ASCII and 3-byte UTF-8 data is fully compatible with utf8mb4 — the encoding is a superset. You won’t corrupt anything already stored. The risk is index key length (767-byte limit on older InnoDB configs), so audit your indexes before running a table conversion in production.

Is UTF-8 the same as Unicode?

No. Unicode is the standard that assigns code points to characters. UTF-8 is one encoding of those code points into bytes. UTF-16 and UTF-32 are two other encodings of the same Unicode code points. JavaScript uses UTF-16 internally; most web APIs, files, and databases use UTF-8.

What’s the difference between a character, a code point, and a grapheme cluster?

A code point is a number in the Unicode standard (e.g. U+1F600). A “character” in programmer-speak usually means a code point. A grapheme cluster is what a human perceives as a single visible character — which can be multiple code points joined together (e.g. an emoji with a skin tone modifier, or a family emoji with ZWJ sequences). Intl.Segmenter in JavaScript handles grapheme clusters correctly; String.length does not.

Tidak suka iklan? Pergi Bebas Iklan Hari ini 

UTF-8 and Unicode Why That Emoji Broke Your Database

Diperbarui pada Jun 2, 2026

Your app inserted an emoji and MySQL threw Incorrect string value. Here's why — code points vs bytes, the MySQL utf8 vs utf8mb4 lie, JavaScript surrogate pairs, and how to actually fix it.

UTF-8 and Unicode: Why That Emoji Broke Your Database 1

IKLAN · HAPUS?

Your app was working fine. Then a user typed an emoji in a text field, and MySQL threw Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio'. Or maybe the emoji silently disappeared. Or the entire INSERT failed and you lost data. All because of a four-byte character your database column didn’t expect.

This isn’t a MySQL quirk or a PHP bug. It’s a consequence of how Unicode → UTF-8 encoding actually works, and once you understand it, you’ll never be surprised by it again.

Code points vs bytes: the actual difference

Unicode assigns every character a code point — a number. The letter A is U+0041. The Euro sign is U+20AC. The 😀 emoji is U+1F600. That’s the abstract identity of the character.

UTF-8 is an pengkodean — a way to store code points as bytes. The trick is that UTF-8 is variable-width: it uses 1 to 4 bytes depending on the code point value. This is how it stays backward-compatible with ASCII (all ASCII characters are 1 byte in UTF-8) while also encoding every character in existence.

The encoding rules:

U+0000 to U+007F (ASCII) → 1 byte
U+0080 to U+07FF (Latin extended, Arabic, Hebrew, etc.) → 2 bytes
U+0800 to U+FFFF (most of CJK, punctuation, symbols) → 3 bytes
U+10000 to U+10FFFF (emoji, rare scripts, math symbols) → 4 bytes

This is why the 😀 emoji (U+1F600) takes 4 bytes: its code point is above U+FFFF.

UTF-8 byte sizes: a reference table

Here’s what common characters actually cost in bytes:

Karakter	Keterangan	Titik Kode Unicode	UTF-8 Bytes (hex)	Jumlah Byte
A	Latin capital A	U+0041	41	1
é	Latin e with acute	U+00E9	C3 A9	2
€	Euro sign	U+20AC	E2 82 AC	3
中	Chinese character “middle”	U+4E2D	E4 B8 AD	3
😀	Grinning face emoji	U+1F600	F0 9F 98 80	4
🔥	Fire emoji	U+1F525	F0 9F 94 A5	4
𝕳	Mathematical fraktur H	U+1D573	F0 9D 95 B3	4

To verify this yourself, use the Kalkulator Panjang String — it shows both character count and byte count for any text you paste in. Paste 😀 and you’ll see 1 character but 4 bytes.

The MySQL utf8 lie

Here’s where developers get burned. MySQL has a charset called utf8. Sounds right. It’s wrong — MySQL’s utf8 only supports up to 3-byte sequences. Emoji (4 bytes) are not supported.

The actual full UTF-8 charset in MySQL is utf8mb4 (introduced in MySQL 5.5.3, released 2010). If your column uses utf8 and someone inserts an emoji, MySQL either silently truncates the data or throws:

Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio' at row 1

The fix:

-- Convert the table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Or for a specific column
ALTER TABLE users MODIFY bio TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- And set your connection charset
SET NAMES utf8mb4;

Also update your application’s database connection config. In MySQL PDO:

$pdo = new PDO($dsn, $user, $pass, [
    PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8mb4"
]);

The VARCHAR(255) trap

VARCHAR(255) in MySQL means 255 characters, not 255 bytes — but the storage limit for a single row is calculated in bytes. With utf8mb4, each character can take up to 4 bytes, so a VARCHAR(255) column reserves up to 1,020 bytes. This matters when you’re using InnoDB’s default prefix index limit (767 bytes) for indexing varchar columns:

-- This fails on older MySQL with default innodb_large_prefix=OFF
CREATE INDEX idx_email ON users (email);  -- email is VARCHAR(255) utf8mb4

-- Fix: use a prefix index
CREATE INDEX idx_email ON users (email(191));  -- 191 * 4 = 764 bytes, under 767

-- Or enable large prefixes (MySQL 5.7+, on by default in 8.0)
-- Set innodb_large_prefix = ON in my.cnf

JavaScript and the surrogate pair problem

JavaScript uses UTF-16 internally, not UTF-8. And UTF-16 has its own multi-unit encoding for code points above U+FFFF: surrogate pairs — two 16-bit code units that together represent one character.

This means String.length in JavaScript counts UTF-16 code units, not characters:

'😀'.length        // → 2 (two UTF-16 surrogate code units)
[...'😀'].length   // → 1 (spread operator uses Unicode code points)

// Checking the character at index 0
'😀'[0]            // → '\uD83D' (the high surrogate, not the emoji)
'😀'.codePointAt(0) // → 128512 (0x1F600, correct)

For string operations that need to be character-aware, use the spread operator or Intl.Segmenter:

// Count actual grapheme clusters
const segmenter = new Intl.Segmenter();
const chars = [...segmenter.segment('👨‍👩‍👧‍👦')];
chars.length     // → 1 (family emoji is one grapheme cluster)
'👨‍👩‍👧‍👦'.length  // → 11 (UTF-16 code units)

The family emoji example is worth pausing on. 👨‍👩‍👧‍👦 is four emoji joined by Zero Width Joiners (U+200D). A naïve .length gives you 11. Actual grapheme clusters: 1. This matters if you’re implementing character limits — a limit based on String.length will behave unexpectedly when users type emoji sequences.

How to check encoding in practice

Ular piton

s = '😀'
print(len(s))                    # 1 (Python 3 counts code points)
print(len(s.encode('utf-8')))    # 4 bytes
print(s.encode('utf-8').hex())   # f09f9880

PHP

$s = '😀';
echo strlen($s);          // 4 (bytes, not characters)
echo mb_strlen($s);       // 1 (characters)
echo mb_strlen($s, '8bit'); // 4 (bytes, explicit)

strlen() in PHP counts bytes, not characters. This catches PHP developers constantly when dealing with multi-byte strings — a 10-emoji string will report a length of 40. Use mb_strlen() when you care about character count.

MySQL

-- Check charset of a table
SHOW CREATE TABLE users\G

-- Check charset of a specific column
SELECT CHARACTER_SET_NAME, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'users' AND COLUMN_NAME = 'bio';

-- Character count vs byte count
SELECT CHAR_LENGTH('😀'), LENGTH('😀');
-- → 1, 4

Quick sanity check

If you want to see byte vs character count for arbitrary text without writing code, the Kalkulator Panjang String handles it instantly — paste in any text and it shows character count, word count, and byte count side by side.

The encoding bug checklist

MySQL charset: Is it utf8mb4, bukan utf8? Check with SHOW CREATE TABLE.
MySQL connection: Is your app sending SET NAMES utf8mb4? Check your DSN or connection config.
PHP strlen vs mb_strlen: Are you using byte-count functions where you need character counts?
JavaScript .length: Are you counting code units where you need grapheme clusters?
HTTP headers: Is your response sending Content-Type: text/html; charset=utf-8?
File encoding: Are your source files and SQL dumps saved as UTF-8 without BOM?