UTF-8 and Unicode Why That Emoji Broke Your Database
Your app inserted an emoji and MySQL threw Incorrect string value. Here's why — code points vs bytes, the MySQL utf8 vs utf8mb4 lie, JavaScript surrogate pairs, and how to actually fix it.
Your app was working fine. Then a user typed an emoji in a text field, and MySQL threw Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio'. Or maybe the emoji silently disappeared. Or the entire INSERT failed and you lost data. All because of a four-byte character your database column didn’t expect.
This isn’t a MySQL quirk or a PHP bug. It’s a consequence of how Unicode → UTF-8 encoding actually works, and once you understand it, you’ll never be surprised by it again.
Code points vs bytes: the actual difference
Unicode assigns every character a code point — a number. The letter A is U+0041. The Euro sign is U+20AC. The 😀 emoji is U+1F600. That’s the abstract identity of the character.
UTF-8 is an ترميز — a way to store code points as bytes. The trick is that UTF-8 is variable-width: it uses 1 to 4 bytes depending on the code point value. This is how it stays backward-compatible with ASCII (all ASCII characters are 1 byte in UTF-8) while also encoding every character in existence.
The encoding rules:
- U+0000 to U+007F (ASCII) → 1 byte
- U+0080 to U+07FF (Latin extended, Arabic, Hebrew, etc.) → 2 bytes
- U+0800 to U+FFFF (most of CJK, punctuation, symbols) → 3 bytes
- U+10000 to U+10FFFF (emoji, rare scripts, math symbols) → 4 bytes
This is why the 😀 emoji (U+1F600) takes 4 bytes: its code point is above U+FFFF.
UTF-8 byte sizes: a reference table
Here’s what common characters actually cost in bytes:
| حرف | وصف | نقطة رمز يونيكود | UTF-8 Bytes (hex) | عدد البايتات |
|---|---|---|---|---|
| أ | Latin capital A | U+0041 | 41 | 1 |
| é | Latin e with acute | U+00E9 | C3 A9 | 2 |
| € | Euro sign | U+20AC | E2 82 AC | 3 |
| 中 | Chinese character “middle” | U+4E2D | E4 B8 AD | 3 |
| 😀 | Grinning face emoji | U+1F600 | F0 9F 98 80 | 4 |
| 🔥 | Fire emoji | U+1F525 | F0 9F 94 A5 | 4 |
| 𝕳 | Mathematical fraktur H | U+1D573 | F0 9D 95 B3 | 4 |
To verify this yourself, use the حاسبة طول السلسلة — it shows both character count and byte count for any text you paste in. Paste 😀 and you’ll see 1 character but 4 bytes.
The MySQL utf8 lie
Here’s where developers get burned. MySQL has a charset called utf8. Sounds right. It’s wrong — MySQL’s utf8 only supports up to 3-byte sequences. Emoji (4 bytes) are not supported.
The actual full UTF-8 charset in MySQL is utf8mb4 (introduced in MySQL 5.5.3, released 2010). If your column uses utf8 and someone inserts an emoji, MySQL either silently truncates the data or throws:
Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio' at row 1
The fix:
-- Convert the table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- Or for a specific column
ALTER TABLE users MODIFY bio TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- And set your connection charset
SET NAMES utf8mb4;
Also update your application’s database connection config. In MySQL PDO:
$pdo = new PDO($dsn, $user, $pass, [
PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8mb4"
]);
The VARCHAR(255) trap
VARCHAR(255) in MySQL means 255 characters, not 255 bytes — but the storage limit for a single row is calculated in bytes. With utf8mb4, each character can take up to 4 bytes, so a VARCHAR(255) column reserves up to 1,020 bytes. This matters when you’re using InnoDB’s default prefix index limit (767 bytes) for indexing varchar columns:
-- This fails on older MySQL with default innodb_large_prefix=OFF
CREATE INDEX idx_email ON users (email); -- email is VARCHAR(255) utf8mb4
-- Fix: use a prefix index
CREATE INDEX idx_email ON users (email(191)); -- 191 * 4 = 764 bytes, under 767
-- Or enable large prefixes (MySQL 5.7+, on by default in 8.0)
-- Set innodb_large_prefix = ON in my.cnf
JavaScript and the surrogate pair problem
JavaScript uses UTF-16 internally, not UTF-8. And UTF-16 has its own multi-unit encoding for code points above U+FFFF: surrogate pairs — two 16-bit code units that together represent one character.
This means String.length in JavaScript counts UTF-16 code units, not characters:
'😀'.length // → 2 (two UTF-16 surrogate code units)
[...'😀'].length // → 1 (spread operator uses Unicode code points)
// Checking the character at index 0
'😀'[0] // → '\uD83D' (the high surrogate, not the emoji)
'😀'.codePointAt(0) // → 128512 (0x1F600, correct)
For string operations that need to be character-aware, use the spread operator or Intl.Segmenter:
// Count actual grapheme clusters
const segmenter = new Intl.Segmenter();
const chars = [...segmenter.segment('👨👩👧👦')];
chars.length // → 1 (family emoji is one grapheme cluster)
'👨👩👧👦'.length // → 11 (UTF-16 code units)
The family emoji example is worth pausing on. 👨👩👧👦 is four emoji joined by Zero Width Joiners (U+200D). A naïve .length gives you 11. Actual grapheme clusters: 1. This matters if you’re implementing character limits — a limit based on String.length will behave unexpectedly when users type emoji sequences.
How to check encoding in practice
بايثون
s = '😀'
print(len(s)) # 1 (Python 3 counts code points)
print(len(s.encode('utf-8'))) # 4 bytes
print(s.encode('utf-8').hex()) # f09f9880
PHP
$s = '😀';
echo strlen($s); // 4 (bytes, not characters)
echo mb_strlen($s); // 1 (characters)
echo mb_strlen($s, '8bit'); // 4 (bytes, explicit)
strlen() in PHP counts bytes, not characters. This catches PHP developers constantly when dealing with multi-byte strings — a 10-emoji string will report a length of 40. Use mb_strlen() when you care about character count.
MySQL
-- Check charset of a table
SHOW CREATE TABLE users\G
-- Check charset of a specific column
SELECT CHARACTER_SET_NAME, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'users' AND COLUMN_NAME = 'bio';
-- Character count vs byte count
SELECT CHAR_LENGTH('😀'), LENGTH('😀');
-- → 1, 4
Quick sanity check
If you want to see byte vs character count for arbitrary text without writing code, the حاسبة طول السلسلة handles it instantly — paste in any text and it shows character count, word count, and byte count side by side.
The encoding bug checklist
- MySQL charset: Is it
utf8mb4القيم المنطقية تبقى كنصوص.utf8? Check withSHOW CREATE TABLE. - MySQL connection: Is your app sending
SET NAMES utf8mb4? Check your DSN or connection config. - PHP strlen vs mb_strlen: Are you using byte-count functions where you need character counts?
- JavaScript .length: Are you counting code units where you need grapheme clusters?
- HTTP headers: Is your response sending
Content-Type: text/html; charset=utf-8? - File encoding: Are your source files and SQL dumps saved as UTF-8 without BOM?
تثبيت ملحقاتنا
أضف أدوات IO إلى متصفحك المفضل للوصول الفوري والبحث بشكل أسرع
恵 وصلت لوحة النتائج!
لوحة النتائج هي طريقة ممتعة لتتبع ألعابك، يتم تخزين جميع البيانات في متصفحك. المزيد من الميزات قريبا!
