UTF-8 and Unicode Why That Emoji Broke Your Database

تحديث في

Your app inserted an emoji and MySQL threw Incorrect string value. Here's why — code points vs bytes, the MySQL utf8 vs utf8mb4 lie, JavaScript surrogate pairs, and how to actually fix it.

UTF-8 and Unicode: Why That Emoji Broke Your Database 1
إعلان · حذف؟

Your app was working fine. Then a user typed an emoji in a text field, and MySQL threw Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio'. Or maybe the emoji silently disappeared. Or the entire INSERT failed and you lost data. All because of a four-byte character your database column didn’t expect.

This isn’t a MySQL quirk or a PHP bug. It’s a consequence of how Unicode → UTF-8 encoding actually works, and once you understand it, you’ll never be surprised by it again.

Code points vs bytes: the actual difference

Unicode assigns every character a code point — a number. The letter A is U+0041. The Euro sign is U+20AC. The 😀 emoji is U+1F600. That’s the abstract identity of the character.

UTF-8 is an ترميز — a way to store code points as bytes. The trick is that UTF-8 is variable-width: it uses 1 to 4 bytes depending on the code point value. This is how it stays backward-compatible with ASCII (all ASCII characters are 1 byte in UTF-8) while also encoding every character in existence.

The encoding rules:

  • U+0000 to U+007F (ASCII) → 1 byte
  • U+0080 to U+07FF (Latin extended, Arabic, Hebrew, etc.) → 2 bytes
  • U+0800 to U+FFFF (most of CJK, punctuation, symbols) → 3 bytes
  • U+10000 to U+10FFFF (emoji, rare scripts, math symbols) → 4 bytes

This is why the 😀 emoji (U+1F600) takes 4 bytes: its code point is above U+FFFF.

UTF-8 byte sizes: a reference table

Here’s what common characters actually cost in bytes:

حرفوصفنقطة رمز يونيكودUTF-8 Bytes (hex)عدد البايتات
أLatin capital AU+0041411
éLatin e with acuteU+00E9C3 A92
Euro signU+20ACE2 82 AC3
Chinese character “middle”U+4E2DE4 B8 AD3
😀Grinning face emojiU+1F600F0 9F 98 804
🔥Fire emojiU+1F525F0 9F 94 A54
𝕳Mathematical fraktur HU+1D573F0 9D 95 B34

To verify this yourself, use the حاسبة طول السلسلة — it shows both character count and byte count for any text you paste in. Paste 😀 and you’ll see 1 character but 4 bytes.

The MySQL utf8 lie

Here’s where developers get burned. MySQL has a charset called utf8. Sounds right. It’s wrong — MySQL’s utf8 only supports up to 3-byte sequences. Emoji (4 bytes) are not supported.

The actual full UTF-8 charset in MySQL is utf8mb4 (introduced in MySQL 5.5.3, released 2010). If your column uses utf8 and someone inserts an emoji, MySQL either silently truncates the data or throws:

Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio' at row 1

The fix:

-- Convert the table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Or for a specific column
ALTER TABLE users MODIFY bio TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- And set your connection charset
SET NAMES utf8mb4;

Also update your application’s database connection config. In MySQL PDO:

$pdo = new PDO($dsn, $user, $pass, [
    PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8mb4"
]);

The VARCHAR(255) trap

VARCHAR(255) in MySQL means 255 characters, not 255 bytes — but the storage limit for a single row is calculated in bytes. With utf8mb4, each character can take up to 4 bytes, so a VARCHAR(255) column reserves up to 1,020 bytes. This matters when you’re using InnoDB’s default prefix index limit (767 bytes) for indexing varchar columns:

-- This fails on older MySQL with default innodb_large_prefix=OFF
CREATE INDEX idx_email ON users (email);  -- email is VARCHAR(255) utf8mb4

-- Fix: use a prefix index
CREATE INDEX idx_email ON users (email(191));  -- 191 * 4 = 764 bytes, under 767

-- Or enable large prefixes (MySQL 5.7+, on by default in 8.0)
-- Set innodb_large_prefix = ON in my.cnf

JavaScript and the surrogate pair problem

JavaScript uses UTF-16 internally, not UTF-8. And UTF-16 has its own multi-unit encoding for code points above U+FFFF: surrogate pairs — two 16-bit code units that together represent one character.

This means String.length in JavaScript counts UTF-16 code units, not characters:

'😀'.length        // → 2 (two UTF-16 surrogate code units)
[...'😀'].length   // → 1 (spread operator uses Unicode code points)

// Checking the character at index 0
'😀'[0]            // → '\uD83D' (the high surrogate, not the emoji)
'😀'.codePointAt(0) // → 128512 (0x1F600, correct)

For string operations that need to be character-aware, use the spread operator or Intl.Segmenter:

// Count actual grapheme clusters
const segmenter = new Intl.Segmenter();
const chars = [...segmenter.segment('👨‍👩‍👧‍👦')];
chars.length     // → 1 (family emoji is one grapheme cluster)
'👨‍👩‍👧‍👦'.length  // → 11 (UTF-16 code units)

The family emoji example is worth pausing on. 👨‍👩‍👧‍👦 is four emoji joined by Zero Width Joiners (U+200D). A naïve .length gives you 11. Actual grapheme clusters: 1. This matters if you’re implementing character limits — a limit based on String.length will behave unexpectedly when users type emoji sequences.

How to check encoding in practice

بايثون

s = '😀'
print(len(s))                    # 1 (Python 3 counts code points)
print(len(s.encode('utf-8')))    # 4 bytes
print(s.encode('utf-8').hex())   # f09f9880

PHP

$s = '😀';
echo strlen($s);          // 4 (bytes, not characters)
echo mb_strlen($s);       // 1 (characters)
echo mb_strlen($s, '8bit'); // 4 (bytes, explicit)

strlen() in PHP counts bytes, not characters. This catches PHP developers constantly when dealing with multi-byte strings — a 10-emoji string will report a length of 40. Use mb_strlen() when you care about character count.

MySQL

-- Check charset of a table
SHOW CREATE TABLE users\G

-- Check charset of a specific column
SELECT CHARACTER_SET_NAME, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'users' AND COLUMN_NAME = 'bio';

-- Character count vs byte count
SELECT CHAR_LENGTH('😀'), LENGTH('😀');
-- → 1, 4

Quick sanity check

If you want to see byte vs character count for arbitrary text without writing code, the حاسبة طول السلسلة handles it instantly — paste in any text and it shows character count, word count, and byte count side by side.

The encoding bug checklist

  • MySQL charset: Is it utf8mb4القيم المنطقية تبقى كنصوص. utf8? Check with SHOW CREATE TABLE.
  • MySQL connection: Is your app sending SET NAMES utf8mb4? Check your DSN or connection config.
  • PHP strlen vs mb_strlen: Are you using byte-count functions where you need character counts?
  • JavaScript .length: Are you counting code units where you need grapheme clusters?
  • HTTP headers: Is your response sending Content-Type: text/html; charset=utf-8?
  • File encoding: Are your source files and SQL dumps saved as UTF-8 without BOM?
هل تريد حذف الإعلانات؟ تخلص من الإعلانات اليوم

تثبيت ملحقاتنا

أضف أدوات IO إلى متصفحك المفضل للوصول الفوري والبحث بشكل أسرع

أضف لـ إضافة كروم أضف لـ امتداد الحافة أضف لـ إضافة فايرفوكس أضف لـ ملحق الأوبرا

وصلت لوحة النتائج!

لوحة النتائج هي طريقة ممتعة لتتبع ألعابك، يتم تخزين جميع البيانات في متصفحك. المزيد من الميزات قريبا!

إعلان · حذف؟
إعلان · حذف؟
إعلان · حذف؟

ركن الأخبار مع أبرز التقنيات

شارك

ساعدنا على الاستمرار في تقديم أدوات مجانية قيمة

اشتري لي قهوة
إعلان · حذف؟