UTF-8 and Unicode Why That Emoji Broke Your Database
Your app inserted an emoji and MySQL threw Incorrect string value. Here's why — code points vs bytes, the MySQL utf8 vs utf8mb4 lie, JavaScript surrogate pairs, and how to actually fix it.
Your app was working fine. Then a user typed an emoji in a text field, and MySQL threw Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio'. Or maybe the emoji silently disappeared. Or the entire INSERT failed and you lost data. All because of a four-byte character your database column didn’t expect.
This isn’t a MySQL quirk or a PHP bug. It’s a consequence of how Unicode → UTF-8 encoding actually works, and once you understand it, you’ll never be surprised by it again.
Code points vs bytes: the actual difference
Unicode assigns every character a code point — a number. The letter A is U+0041. The Euro sign is U+20AC. The 😀 emoji is U+1F600. That’s the abstract identity of the character.
UTF-8 is an pengkodean — a way to store code points as bytes. The trick is that UTF-8 is variable-width: it uses 1 to 4 bytes depending on the code point value. This is how it stays backward-compatible with ASCII (all ASCII characters are 1 byte in UTF-8) while also encoding every character in existence.
The encoding rules:
- U+0000 to U+007F (ASCII) → 1 byte
- U+0080 to U+07FF (Latin extended, Arabic, Hebrew, etc.) → 2 bytes
- U+0800 to U+FFFF (most of CJK, punctuation, symbols) → 3 bytes
- U+10000 to U+10FFFF (emoji, rare scripts, math symbols) → 4 bytes
This is why the 😀 emoji (U+1F600) takes 4 bytes: its code point is above U+FFFF.
UTF-8 byte sizes: a reference table
Here’s what common characters actually cost in bytes:
| Karakter | Keterangan | Titik Kode Unicode | UTF-8 Bytes (hex) | Jumlah Byte |
|---|---|---|---|---|
| A | Latin capital A | U+0041 | 41 | 1 |
| é | Latin e with acute | U+00E9 | C3 A9 | 2 |
| € | Euro sign | U+20AC | E2 82 AC | 3 |
| 中 | Chinese character “middle” | U+4E2D | E4 B8 AD | 3 |
| 😀 | Grinning face emoji | U+1F600 | F0 9F 98 80 | 4 |
| 🔥 | Fire emoji | U+1F525 | F0 9F 94 A5 | 4 |
| 𝕳 | Mathematical fraktur H | U+1D573 | F0 9D 95 B3 | 4 |
To verify this yourself, use the Kalkulator Panjang String — it shows both character count and byte count for any text you paste in. Paste 😀 and you’ll see 1 character but 4 bytes.
The MySQL utf8 lie
Here’s where developers get burned. MySQL has a charset called utf8. Sounds right. It’s wrong — MySQL’s utf8 only supports up to 3-byte sequences. Emoji (4 bytes) are not supported.
The actual full UTF-8 charset in MySQL is utf8mb4 (introduced in MySQL 5.5.3, released 2010). If your column uses utf8 and someone inserts an emoji, MySQL either silently truncates the data or throws:
Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio' at row 1
The fix:
-- Convert the table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- Or for a specific column
ALTER TABLE users MODIFY bio TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- And set your connection charset
SET NAMES utf8mb4;
Also update your application’s database connection config. In MySQL PDO:
$pdo = new PDO($dsn, $user, $pass, [
PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8mb4"
]);
The VARCHAR(255) trap
VARCHAR(255) in MySQL means 255 characters, not 255 bytes — but the storage limit for a single row is calculated in bytes. With utf8mb4, each character can take up to 4 bytes, so a VARCHAR(255) column reserves up to 1,020 bytes. This matters when you’re using InnoDB’s default prefix index limit (767 bytes) for indexing varchar columns:
-- This fails on older MySQL with default innodb_large_prefix=OFF
CREATE INDEX idx_email ON users (email); -- email is VARCHAR(255) utf8mb4
-- Fix: use a prefix index
CREATE INDEX idx_email ON users (email(191)); -- 191 * 4 = 764 bytes, under 767
-- Or enable large prefixes (MySQL 5.7+, on by default in 8.0)
-- Set innodb_large_prefix = ON in my.cnf
JavaScript and the surrogate pair problem
JavaScript uses UTF-16 internally, not UTF-8. And UTF-16 has its own multi-unit encoding for code points above U+FFFF: surrogate pairs — two 16-bit code units that together represent one character.
This means String.length in JavaScript counts UTF-16 code units, not characters:
'😀'.length // → 2 (two UTF-16 surrogate code units)
[...'😀'].length // → 1 (spread operator uses Unicode code points)
// Checking the character at index 0
'😀'[0] // → '\uD83D' (the high surrogate, not the emoji)
'😀'.codePointAt(0) // → 128512 (0x1F600, correct)
For string operations that need to be character-aware, use the spread operator or Intl.Segmenter:
// Count actual grapheme clusters
const segmenter = new Intl.Segmenter();
const chars = [...segmenter.segment('👨👩👧👦')];
chars.length // → 1 (family emoji is one grapheme cluster)
'👨👩👧👦'.length // → 11 (UTF-16 code units)
The family emoji example is worth pausing on. 👨👩👧👦 is four emoji joined by Zero Width Joiners (U+200D). A naïve .length gives you 11. Actual grapheme clusters: 1. This matters if you’re implementing character limits — a limit based on String.length will behave unexpectedly when users type emoji sequences.
How to check encoding in practice
Ular piton
s = '😀'
print(len(s)) # 1 (Python 3 counts code points)
print(len(s.encode('utf-8'))) # 4 bytes
print(s.encode('utf-8').hex()) # f09f9880
PHP
$s = '😀';
echo strlen($s); // 4 (bytes, not characters)
echo mb_strlen($s); // 1 (characters)
echo mb_strlen($s, '8bit'); // 4 (bytes, explicit)
strlen() in PHP counts bytes, not characters. This catches PHP developers constantly when dealing with multi-byte strings — a 10-emoji string will report a length of 40. Use mb_strlen() when you care about character count.
MySQL
-- Check charset of a table
SHOW CREATE TABLE users\G
-- Check charset of a specific column
SELECT CHARACTER_SET_NAME, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'users' AND COLUMN_NAME = 'bio';
-- Character count vs byte count
SELECT CHAR_LENGTH('😀'), LENGTH('😀');
-- → 1, 4
Quick sanity check
If you want to see byte vs character count for arbitrary text without writing code, the Kalkulator Panjang String handles it instantly — paste in any text and it shows character count, word count, and byte count side by side.
The encoding bug checklist
- MySQL charset: Is it
utf8mb4, bukanutf8? Check withSHOW CREATE TABLE. - MySQL connection: Is your app sending
SET NAMES utf8mb4? Check your DSN or connection config. - PHP strlen vs mb_strlen: Are you using byte-count functions where you need character counts?
- JavaScript .length: Are you counting code units where you need grapheme clusters?
- HTTP headers: Is your response sending
Content-Type: text/html; charset=utf-8? - File encoding: Are your source files and SQL dumps saved as UTF-8 without BOM?
Instal Ekstensi Kami
Tambahkan alat IO ke browser favorit Anda untuk akses instan dan pencarian lebih cepat
恵 Papan Skor Telah Tiba!
Papan Skor adalah cara yang menyenangkan untuk melacak permainan Anda, semua data disimpan di browser Anda. Lebih banyak fitur akan segera hadir!
Alat Wajib Coba
Lihat semua Pendatang baru
Lihat semuaMemperbarui: Kita alat terbaru was added on Jun 2, 2026
