#4374 Inflector#transliterate fails on many European characters - Ruby on Rails

Type	To find
responsible:me	tickets assigned to you
tagged:"@high"	tickets tagged @high
milestone:next	tickets in the upcoming milestone
state:invalid	tickets with the state invalid
created:"last week"	tickets created last week
sort:number, importance, updated	tickets sorted by #, importance or updated
Combine keywords for powerful searching.
Use advanced searching »

This project is archived and is in readonly mode.

#4374 ✓resolved

Inflector#transliterate fails on many European characters

Reported by Norman Clarke | April 12th, 2010 @ 08:32 PM | in 3.0.2

Inflector.transliterate fails with many common European characters:

Inflector.transliterate("Ærøskøbing") # "rskbing"

In this case, the example is ironically a city in Denmark, DHH's home country.

The full list of failing characters is as follows:

Æ Ð Ø Þ ß æ ð ø þ Đ đ Ħ ħ ı Ĳ ĳ ĸ Ŀ ŀ Ł ł ŉ Ŋ ŋ Œ œ Ŧ ŧ

The reason for the failure is that ActiveSupport::Inflector::Transliterate relies Iconv, which gives variable and often insufficient results; or on UTF-8 decomposition, but the characters listed above (some surprisingly) do not decompose.

The patch I will attach after filing this ticket resolves this issue by removing Iconv, and relying on UTF-8 decomposition plus a check against a hash of approximations.

The patch also invokes #tidy_bytes before transliterating. This can rescue ISO-8859-1 and CP-1252 letter characters that the current implementation simply deletes, and allows for the removal of some code from Inflector#parameterize.

I have also added several tests that put #transliterate through its paces with all characters that have reasonable approximations in order to avoid future regressions.

A few observations:

Performance

The new code performs a bit more slowly than the current implementation:

benchmark do
  10_000.times do
    ActiveSupport::Inflector.transliterate("Malmö, Sweden")
  end
end

# 1.9.1 before patch: 472.4ms
# 1.9.1 after patch: 704.8ms
# ree before patch: 932.6ms
# ree after patch: 1597.9ms

I think the improvement in reliability and correctness makes up for this difference in performance - it's still fast, just not AS fast.

Why I removed Iconv

The current code tries to use Iconv when available:

# Replaces accented characters with their ascii equivalents.
def transliterate(string)
  Iconv.iconv('ascii//ignore//translit', 'utf-8', string).to_s
end

if RUBY_VERSION >= '1.9'
  undef_method :transliterate
  def transliterate(string)
    proxy = ActiveSupport::Multibyte.proxy_class.new(string)
    proxy.normalize(:kd).gsub(/[^\x00-\x7F]+/, '')
  end

# The iconv transliteration code doesn't function correctly
# on some platforms, but it's very fast where it does function.
elsif "foo" != (Inflector.transliterate("föö") rescue nil)
  undef_method :transliterate
  def transliterate(string)
    string.mb_chars.normalize(:kd). # Decompose accented characters
      gsub(/[^\x00-\x7F]+/, '')     # Remove anything non-ASCII entirely (e.g. diacritics).
  end
end

but in practice rarely uses it:

# encoding: utf-8
require "iconv"
p Iconv.iconv('ascii//translit//ignore', 'utf-8', "föö").to_s

# Ubuntu 9.10, REE: "f??" (not used)
# Ubuntu 9.10, 1.9.1: "[\"foo\"]" (transliterated per the conditional above, but not used)
# Snow Leopard, Jruby 1.4: "f\303\266\303\266" (not used)
# Snow Leopard, REE: "f\"o\"o" (not used)
# Snow Leopard, 1.8.6: "f\"o\"o" (not used)
# Snow Leopard, 1.9.1: "[\"f\\\"o\\\"o\"]" (not used)
# Snow Leopard, 1.9.2: "[\"f\\\"o\\\"o\"]" (not used)

So many, many people are already using the non-Iconv code right now.

Even if the code were modified to ensure that Iconv is used, it still would not work reliably on Ubuntu because many of the characters in the approximation table are not handled.

So while Iconv could in theory offer excellent performance, its behavior is so variable and unreliable that I think it should not be used.

Comments and changes to this ticket

You flagged this item as spam.
Norman Clarke April 12th, 2010 @ 08:35 PM
- Tag changed from “transliterate parameterize, activesupport, multibyte” to “transliterate parameterize, activesupport, multibyte, patch”
Patch attached.
- improve-transliterate.diff 7.7 KB
Jeremy Kemper April 13th, 2010 @ 07:21 AM
- State changed from “new” to “open”
- Milestone cleared.
Repository April 13th, 2010 @ 07:21 AM
- State changed from “open” to “resolved”
(from [dceef0828a23e8298dd9a9aab1a33c49e84f17d6]) Improve reliability of Inflector.transliterate. [#4374 state:resolved]

Signed-off-by: Jeremy Kemper jeremy@bitsweat.net
http://github.com/rails/rails/commit/dceef0828a23e8298dd9a9aab1a33c...
Jeremy Kemper October 15th, 2010 @ 11:01 PM
- Milestone set to “3.0.2”
- Importance changed from “” to “Low”
[bulk edit]

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

<h2 style="font-size: 14px">Tickets have moved to Github</h2>

The new ticket tracker is available at <a href="https://github.com/rails/rails/issues">https://github.com/rails/rails/issues</a>

Rails Ruby on Rails → 3.0.2

Inflector#transliterate fails on many European characters

Performance

Why I removed Iconv

Comments and changes to this ticket

Norman Clarke April 12th, 2010 @ 08:35 PM

Jeremy Kemper April 13th, 2010 @ 07:21 AM

Repository April 13th, 2010 @ 07:21 AM

Jeremy Kemper October 15th, 2010 @ 11:01 PM

Create your profile

Shared Ticket Bins (Sort)

People watching this ticket

Attachments

Tags

Referenced by

Pages

Rails Ruby on Rails → 3.0.2

Keyword searching

Inflector#transliterate fails on many European characters

Performance

Why I removed Iconv

Comments and changes to this ticket

Norman Clarke April 12th, 2010 @ 08:35 PM

Jeremy Kemper April 13th, 2010 @ 07:21 AM

Repository April 13th, 2010 @ 07:21 AM

Jeremy Kemper October 15th, 2010 @ 11:01 PM

Create your profile

Shared Ticket Bins (Sort)

People watching this ticket

Attachments

Tags

Referenced by

Pages