This project is archived and is in readonly mode.

#1112 ✓stale
Jamis Buck

"redundant UTF-8 sequence" in String#to_json

Reported by Jamis Buck | September 25th, 2008 @ 04:41 PM | in 2.x

Certain strings (which are otherwise valid utf-8 sequences) will cause String#to_json to raise an ArgumentError (redundant UTF-8 sequence). Upon investigating, it turns out to be due to String#to_json's use of String#unpack:


Further investigating showed that any two byte sequence beginning with 0xC0 or 0xC1, with the second byte in the range 0x80..0xBF, would cause String#unpack("U*") to raise that exception. Because String#to_json explicitly includes 0xC0 and 0xC1 in it's gsub regex, it seems simplest to have it check only for 0xC2 and up, to avoid the ArgumentError. (The alternative would be to find some way to normalize the redundant sequences to their shorter equivalences...but I'm not clear on how to make that happen.)

I've attached a patch making this (trivial) change, as well as a script that can be used to demonstrate the error for the ranges mentioned above.

Comments and changes to this ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

<h2 style="font-size: 14px">Tickets have moved to Github</h2>

The new ticket tracker is available at <a href=""></a>