This project is archived and is in readonly mode.

#4374 ✓resolved
Norman Clarke

Inflector#transliterate fails on many European characters

Reported by Norman Clarke | April 12th, 2010 @ 08:32 PM | in 3.0.2

Inflector.transliterate fails with many common European characters:

Inflector.transliterate("Ærøskøbing") # "rskbing"

In this case, the example is ironically a city in Denmark, DHH's home country.

The full list of failing characters is as follows:

Æ Ð Ø Þ ß æ ð ø þ Đ đ Ħ ħ ı IJ ij ĸ Ŀ ŀ Ł ł ʼn Ŋ ŋ Œ œ Ŧ ŧ

The reason for the failure is that ActiveSupport::Inflector::Transliterate relies Iconv, which gives variable and often insufficient results; or on UTF-8 decomposition, but the characters listed above (some surprisingly) do not decompose.

The patch I will attach after filing this ticket resolves this issue by removing Iconv, and relying on UTF-8 decomposition plus a check against a hash of approximations.

The patch also invokes #tidy_bytes before transliterating. This can rescue ISO-8859-1 and CP-1252 letter characters that the current implementation simply deletes, and allows for the removal of some code from Inflector#parameterize.

I have also added several tests that put #transliterate through its paces with all characters that have reasonable approximations in order to avoid future regressions.

A few observations:

Performance

The new code performs a bit more slowly than the current implementation:

benchmark do
  10_000.times do
    ActiveSupport::Inflector.transliterate("Malmö, Sweden")
  end
end

# 1.9.1 before patch: 472.4ms
# 1.9.1 after patch: 704.8ms
# ree before patch: 932.6ms
# ree after patch: 1597.9ms

I think the improvement in reliability and correctness makes up for this difference in performance - it's still fast, just not AS fast.

Why I removed Iconv

The current code tries to use Iconv when available:

# Replaces accented characters with their ascii equivalents.
def transliterate(string)
  Iconv.iconv('ascii//ignore//translit', 'utf-8', string).to_s
end

if RUBY_VERSION >= '1.9'
  undef_method :transliterate
  def transliterate(string)
    proxy = ActiveSupport::Multibyte.proxy_class.new(string)
    proxy.normalize(:kd).gsub(/[^\x00-\x7F]+/, '')
  end

# The iconv transliteration code doesn't function correctly
# on some platforms, but it's very fast where it does function.
elsif "foo" != (Inflector.transliterate("föö") rescue nil)
  undef_method :transliterate
  def transliterate(string)
    string.mb_chars.normalize(:kd). # Decompose accented characters
      gsub(/[^\x00-\x7F]+/, '')     # Remove anything non-ASCII entirely (e.g. diacritics).
  end
end

but in practice rarely uses it:

# encoding: utf-8
require "iconv"
p Iconv.iconv('ascii//translit//ignore', 'utf-8', "föö").to_s

# Ubuntu 9.10, REE: "f??" (not used)
# Ubuntu 9.10, 1.9.1: "[\"foo\"]" (transliterated per the conditional above, but not used)
# Snow Leopard, Jruby 1.4: "f\303\266\303\266" (not used)
# Snow Leopard, REE: "f\"o\"o" (not used)
# Snow Leopard, 1.8.6: "f\"o\"o" (not used)
# Snow Leopard, 1.9.1: "[\"f\\\"o\\\"o\"]" (not used)
# Snow Leopard, 1.9.2: "[\"f\\\"o\\\"o\"]" (not used)

So many, many people are already using the non-Iconv code right now.

Even if the code were modified to ensure that Iconv is used, it still would not work reliably on Ubuntu because many of the characters in the approximation table are not handled.

So while Iconv could in theory offer excellent performance, its behavior is so variable and unreliable that I think it should not be used.

Comments and changes to this ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

<h2 style="font-size: 14px">Tickets have moved to Github</h2>

The new ticket tracker is available at <a href="https://github.com/rails/rails/issues">https://github.com/rails/rails/issues</a>

People watching this ticket

Attachments

Referenced by

Pages