This project is archived and is in readonly mode.
Inflector#transliterate fails on many European characters
Reported by Norman Clarke | April 12th, 2010 @ 08:32 PM | in 3.0.2
Inflector.transliterate fails with many common European characters:
Inflector.transliterate("Ærøskøbing") # "rskbing"
In this case, the example is ironically a city in Denmark, DHH's
home country.
The full list of failing characters is as follows:
Æ Ð Ø Þ ß æ ð ø þ Đ đ Ħ ħ ı IJ ij ĸ Ŀ ŀ Ł ł ʼn Ŋ ŋ Œ œ Ŧ ŧ
The reason for the failure is that ActiveSupport::Inflector::Transliterate relies Iconv, which gives variable and often insufficient results; or on UTF-8 decomposition, but the characters listed above (some surprisingly) do not decompose.
The patch I will attach after filing this ticket resolves this issue by removing Iconv, and relying on UTF-8 decomposition plus a check against a hash of approximations.
The patch also invokes #tidy_bytes
before
transliterating. This can rescue ISO-8859-1 and CP-1252 letter
characters that the current implementation simply deletes, and
allows for the removal of some code from
Inflector#parameterize
.
I have also added several tests that put
#transliterate
through its paces with all characters
that have reasonable approximations in order to avoid future
regressions.
A few observations:
Performance
The new code performs a bit more slowly than the current implementation:
benchmark do
10_000.times do
ActiveSupport::Inflector.transliterate("Malmö, Sweden")
end
end
# 1.9.1 before patch: 472.4ms
# 1.9.1 after patch: 704.8ms
# ree before patch: 932.6ms
# ree after patch: 1597.9ms
I think the improvement in reliability and correctness makes up for this difference in performance - it's still fast, just not AS fast.
Why I removed Iconv
The current code tries to use Iconv when available:
# Replaces accented characters with their ascii equivalents.
def transliterate(string)
Iconv.iconv('ascii//ignore//translit', 'utf-8', string).to_s
end
if RUBY_VERSION >= '1.9'
undef_method :transliterate
def transliterate(string)
proxy = ActiveSupport::Multibyte.proxy_class.new(string)
proxy.normalize(:kd).gsub(/[^\x00-\x7F]+/, '')
end
# The iconv transliteration code doesn't function correctly
# on some platforms, but it's very fast where it does function.
elsif "foo" != (Inflector.transliterate("föö") rescue nil)
undef_method :transliterate
def transliterate(string)
string.mb_chars.normalize(:kd). # Decompose accented characters
gsub(/[^\x00-\x7F]+/, '') # Remove anything non-ASCII entirely (e.g. diacritics).
end
end
but in practice rarely uses it:
# encoding: utf-8
require "iconv"
p Iconv.iconv('ascii//translit//ignore', 'utf-8', "föö").to_s
# Ubuntu 9.10, REE: "f??" (not used)
# Ubuntu 9.10, 1.9.1: "[\"foo\"]" (transliterated per the conditional above, but not used)
# Snow Leopard, Jruby 1.4: "f\303\266\303\266" (not used)
# Snow Leopard, REE: "f\"o\"o" (not used)
# Snow Leopard, 1.8.6: "f\"o\"o" (not used)
# Snow Leopard, 1.9.1: "[\"f\\\"o\\\"o\"]" (not used)
# Snow Leopard, 1.9.2: "[\"f\\\"o\\\"o\"]" (not used)
So many, many people are already using the non-Iconv code right now.
Even if the code were modified to ensure that Iconv is used, it still would not work reliably on Ubuntu because many of the characters in the approximation table are not handled.
So while Iconv could in theory offer excellent performance, its behavior is so variable and unreliable that I think it should not be used.
Comments and changes to this ticket
-
Norman Clarke April 12th, 2010 @ 08:35 PM
- Tag changed from transliterate parameterize, activesupport, multibyte to transliterate parameterize, activesupport, multibyte, patch
Patch attached.
-
Jeremy Kemper April 13th, 2010 @ 07:21 AM
- State changed from new to open
- Milestone cleared.
-
Repository April 13th, 2010 @ 07:21 AM
- State changed from open to resolved
(from [dceef0828a23e8298dd9a9aab1a33c49e84f17d6]) Improve reliability of Inflector.transliterate. [#4374 state:resolved]
Signed-off-by: Jeremy Kemper jeremy@bitsweat.net
http://github.com/rails/rails/commit/dceef0828a23e8298dd9a9aab1a33c... -
Jeremy Kemper October 15th, 2010 @ 11:01 PM
- Milestone set to 3.0.2
- Importance changed from to Low
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile »
<h2 style="font-size: 14px">Tickets have moved to Github</h2>
The new ticket tracker is available at <a href="https://github.com/rails/rails/issues">https://github.com/rails/rails/issues</a>
People watching this ticket
Attachments
Referenced by
- 4374 Inflector#transliterate fails on many European characters (from [dceef0828a23e8298dd9a9aab1a33c49e84f17d6]) Improve...