This project is archived and is in readonly mode.

#2476 ✓stale
Hector E. Gomez Morales

ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 1.9.1

Reported by Hector E. Gomez Morales | April 10th, 2009 @ 04:36 PM | in 2.3.10

From #2188:

Hello! We've got the same problem! Only the error occurs when we fetch data from the database. We're using Mysql and Charset is UTF-8, but the Active Record returns ASCII-8BIT. Is it possible to do similar changes to the activerecord as you did to the actionpack? Seems as we're not the only ones with that problem

Problem

Fetching data from any database (Mysql, Postgresql, Sqlite2 & 3), all configured to have UTF-8 as it's character set, returns the data with ASCII-8BIT in ruby 1.9.1 and rails 2.3.2.1.

This has been reported in #2188 and in the rails talk group (1).

Possible Solution

Again like in #2188 rails is not the culprit here, the only problem with rails is it inherent trust that all the data it gets is UTF-8. When the data has another encoding is when the problems arise.

The real problem is that all the current adapters use native C extensions as glue in which they use rb_str_new function that in ruby 1.9.1 creates a String with ASCII-8BIT encoding (2). So that is why all the data is returned with this encoding.

Because the initial problems where detected in MySQL. I made the needed modifications and created a fork in github (3) for mysql-ruby . This fork is only 1.9.1 compatible, returns ASCII-8BIT for binary fields and UTF-8 for all other fields.

With this modified mysql-ruby gem, all activerecord test for mysql passes except test_validate_case_sensitive_uniqueness. The test will fail for all adapters that are encoding aware, this is because in the implementation of this validation the value that need to be unique is converted to downcase and a query using LOWER(#{field}) in the unique field is executed. The downcase in ruby 1.8.1 for non-ASCII strings is done with MultiByte, given that in ruby 1.9.1 downcase still does nothing for non-ASCII encoding strings I use Multibyte#downcase to do the conversion.

I attach a patch so validates_uniqueness uses Multibyte#downcase on the string if we are using ruby 1.9.1. With this patch all test pass for test_mysql in activerecord.

TODO

  • Make all the other adapters 1.9.1 compatible AND encoding aware.
  • Remove hardcoded encoding use of UTF-8 and use the character set used by the DB.

Links

  1. Rails Group Post
  2. ASCII-8BIT default in rb_str_new
  3. mysql-ruby fork

Comments and changes to this ticket

  • Hector E. Gomez Morales

    Hector E. Gomez Morales April 10th, 2009 @ 04:38 PM

    • Tag changed from 2.3.2, activerecord, bug, patch to 2.3.2, activerecord, bug, patch, ruby19
  • Manfred Stienstra

    Manfred Stienstra April 11th, 2009 @ 02:13 PM

    Instead of

    
    nvalue = ActiveSupport::Multibyte::Chars.new(value)
    condition_params = [nvalue.downcase]
    

    You can just do

    
    condition_params  = [value.mb_chars.downcase]
    
  • Hector E. Gomez Morales

    Hector E. Gomez Morales April 11th, 2009 @ 04:51 PM

    Well the problem is that $KCODE is not set in ruby19 so the call to mb_chars doesn't proxy the string. So that why I did the explicit wrapping.

    @@@ Ruby activesupport/lib/active_support/core_ext/string/multibyte.rb def mb_chars if ActiveSupport::Multibyte.proxy_class.wants?(self)

    ActiveSupport::Multibyte.proxy_class.new(self)
    
    

    else

    self
    
    

    end end

    activesupport/lib/active_support/multibyte/chars.rb def self.wants?(string) $KCODE == 'UTF8' && consumes?(string) end

    railties/lib/initializer.rb def initialize_encoding $KCODE='u' if RUBY_VERSION < '1.9' end

    
    
  • Hector E. Gomez Morales

    Hector E. Gomez Morales April 11th, 2009 @ 04:54 PM

    Sorry again posting the code

    
    
    # activesupport/lib/active_support/core_ext/string/multibyte.rb
    def mb_chars
      if ActiveSupport::Multibyte.proxy_class.wants?(self)
        ActiveSupport::Multibyte.proxy_class.new(self)
      else
        self
      end
    end
    
    # activesupport/lib/active_support/multibyte/chars.rb
    def self.wants?(string)
      $KCODE == 'UTF8' && consumes?(string)
    end
    
    # railties/lib/initializer.rb 
    def initialize_encoding
      $KCODE='u' if RUBY_VERSION < '1.9'
    end
    
    
  • Manfred Stienstra

    Manfred Stienstra April 11th, 2009 @ 09:52 PM

    Ah, right. I'll try to think of a way to access the proxy in Ruby 1.9.

  • smixok (at gmail)

    smixok (at gmail) May 4th, 2009 @ 11:06 PM

    is there any patch for activerecord or sqlite and postgre-pr gem?

  • Dimitrij Denissenko

    Dimitrij Denissenko May 9th, 2009 @ 04:30 PM

    Sorry Hector, but your patch to Ruby-MySQL doesn't fully solve the problem.

    @@@ if (fields[i].type == MYSQL_TYPE_BLOB) @@@

    The comparison with MYSQL_TYPE_BLOB also includes TEXT (TINYTEXT, MEDIUMTEXT, LONGTEXT) fields. All content stored in these also comes back ASCII-8BIT encoded.

    Link: http://github.com/hectoregm/mysq...

  • qoobaa

    qoobaa May 13th, 2009 @ 09:05 AM

    I've fixed sqlite3-ruby gem (version 1.2.5) http://github.com/qoobaa/sqlite3-ruby/tree/master. The problem with ASCII-8BIT encoding is in Rack also, I've created a patch to fix it.

  • James Healy

    James Healy June 16th, 2009 @ 12:59 PM

    I agree with the goal of a database driver that correctly sets the encoding of strings, however hard coding all strings to UTF-8 seems like we'd be shooting ourselves in the foot.

    The problem with encoding issues is that often the wrong solution does the right thing 95% of the time. Sure most of us writing web apps these days operate in UTF-8, but when some poor sod rocks up with a UTF-16 encoded database we'd break her data.

    I've never used the MySQL C API, but surely there's a way to detect the encoding of the current DB/table/column?

  • Brendan Schwartz

    Brendan Schwartz June 16th, 2009 @ 07:51 PM

    I second James Healy's approach. Blindly setting the encoding of all strings from the database to UTF-8 is short-sighted.

  • Ken Collins

    Ken Collins June 16th, 2009 @ 07:57 PM

    Agreed. I do the same thing in the SQL Server adapter.

  • Manfred Stienstra

    Manfred Stienstra June 16th, 2009 @ 09:30 PM

    James & Brendan, patches are very welcome!

  • James Healy

    James Healy June 17th, 2009 @ 02:15 AM

    I decided to investigate further and picked MySQL as my guinea pig.

    I can see two possible approaches.

    1. We patch the MySQL/Ruby driver to return strings marked with an appropriate encoding. To do so it would need to track the value of the character_set_results MySQL variable which indicates the character set MySQL will return results in. I'm not sure what 'tracking' that variable would involve. Since it can be changed at any time, the driver would need to regularly check (and cache?) the value.

    2. We patch the ActiveRecord MySQL Adapter. This leaves the driver encoding unaware - it more or less just passes byte arrays between AR and the MySQL server in blissful ignorance of the encoding. If the AR MySQL Adapter notices it has encoding: set in it's config, it can take the ASCII-8BIT/BINARY strings the driver hands it and force the encoding to something appropriate.

    Either approach looks achievable without too much work. I'm keen to attempt on of them, but I think I'll mull over the options for a while first.

    Any thoughts? Ken - it sounds like you modified the SQL Server Adapter and not the driver?

  • James Healy

    James Healy July 4th, 2009 @ 04:45 PM

    It took my a little while to get it, but I've got a proposed patch for the MysqlAdaptor on github @ http://github.com/yob/rails/commit/986b8c99331d68087eaa0a703f4121c5....

    I took approach (2) from my earlier comment. The Mysql driver remains encoding unaware, and all results are stored in the AR model attributes hash marked as "BINARY" encoding.

    Traditionally non string attributes are type cast on demand (converted to ints, dates, etc) and strings are left untouched. This patch adds a type casting process for string attributes that "fixes" the encoding to match what the user has specified in database.yml.

    The commit message has a few extra details.

  • runa
  • James Healy

    James Healy July 19th, 2009 @ 02:21 AM

    What's the best way for me to get feedback on my patch?

    I'm still in two minds about whether the encoding of strings from the database should be fixed in the DB driver or ActiveRecord, so it would be nice to get some discussion going.

  • Michael H Buselli

    Michael H Buselli August 4th, 2009 @ 04:21 PM

    I think ActiveRecord is the right place to handle encoding, though some DB drivers may also be able to help if there is a non-standard way the particular database handles encoding. For databases that do nothing with encoding and just push raw bits, ActiveRecord should provide a configuration option for the encoding and return Strings properly encoded, perhaps even transformed to another encoding if the data is encoded differently than the user wants.

    That's my 2¢, anyway.

  • Michael H Buselli

    Michael H Buselli August 5th, 2009 @ 10:13 PM

    I wrote a gem to enhance ActiveRecord::Base as described above: http://github.com/cosine/active_record_encoding/tree/master

    It assumes the database just pushes bits and doesn't understand its encoding, which is true in my case.

    I have not thoroughly tested it, yet. Proceed with caution if you use it. Even so, I would love some feedback.

  • Manfred Stienstra

    Manfred Stienstra August 6th, 2009 @ 07:42 AM

    It assumes the database just pushes bits and doesn't understand its encoding

    Unfortunately that's not the case. For instance, when you try to store certain UTF-8 characters in a Latin-1 database you will loose information. The database storage engine and the database protocol are generally encoding aware so we will have to deal with that.

    Even so, I would love some feedback.

    If you're serious in pursuing this plugin I would recommend writing a lot of tests.

  • Yugui (Yuki Sonoda)

    Yugui (Yuki Sonoda) August 7th, 2009 @ 07:20 AM

    The database storage engine and the database protocol are generally encoding aware so we will have to deal with that. Yes. So the approach (1) is ideal.

    I sent a patch for ruby-pg. The next release of ruby-pg will be encoding-aware.
    Mysql/Ruby should be fixed as ruby-pg. And I think my article http://yugui.jp/articles/838 can help to fix database drivers.

  • pyromaniac

    pyromaniac September 6th, 2009 @ 07:17 PM

    Hi. I try to fix this issue and there is two moments:
    1. First, db driver give us utf string with forced ASCII-8BIT encoding.
    2. Some utf views are forced to ASCII-8BIT during the compilation.
    So, we have a multi problem.
    Let's look:

    >> p = "привет"
    => "привет"
    >> t = "мир"
    => "мир"
    >> t.force_encoding Encoding::ASCII_8BIT
    => "\xD0\xBC\xD0\xB8\xD1\x80"
    >> p << t
    Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT

        from (irb):4
        from /home/pyromaniac/.rvm/ruby-1.9.1-p243/bin/irb:12:in `&lt;main&gt;'</code>
    
    
    
    
    So, during the concatination of such strings, we have exception "incompatible character encodings: UTF-8 and ASCII-8BIT"

    This hack works well for full utf projects

    class String
      alias_method(:orig_concat, :concat)
      def concat(value)

    orig_concat value.force_encoding(Encoding::UTF_8)
    
    
    
    
    end end
    But problem still not solved. We need to patch db drivers, by the way, http://github.com/jzajpt/mysql-ruby/tree/master - patched mysql, and we need to solve problem with AV.
    What do you think?
  • Aleksander Pohl

    Aleksander Pohl October 22nd, 2009 @ 12:00 AM

    I tested ruby-mysql posted by pyromaniac and it seems to work fine.

  • Loren Segal

    Loren Segal November 7th, 2009 @ 02:14 AM

    You can use this little pure Ruby hack to get things working. No modifications to the mysql gem are needed, so it's pretty easy to drop into an existing app:

    http://gnuu.org/2009/11/06/ruby19-rails-mysql-utf8/

  • Michael Hasenstein

    Michael Hasenstein April 4th, 2010 @ 10:05 AM

    And now??? I have this exact issue with Rails 3 Beta 2, Ruby 1.9.2-head.

  • Ivan Ukhov

    Ivan Ukhov April 7th, 2010 @ 12:58 AM

    For those who dont want to overwrite String::concat and use HAML for views, here is my solution (http://gist.github.com/358275):

    module Haml
      class Buffer
        class UTF8String < String
          def << text; super text.toutf8; end
        end
    
        alias original_initialize initialize
    
        def initialize *args
          original_initialize *args
          @buffer = UTF8String.new
        end
      end
    end
    
  • Jeremy Kemper

    Jeremy Kemper April 24th, 2010 @ 09:23 PM

    • Milestone changed from 2.x to 2.3.6
    • State changed from “new” to “open”
    • Assigned user set to “Jeremy Kemper”

    Any luck getting mysql encoding into the released gem?

  • Cezary Baginski

    Cezary Baginski May 2nd, 2010 @ 12:09 AM

    As for mysql gem, I might take a shot at it ... based od Yugui's article.

  • Cezary Baginski

    Cezary Baginski May 2nd, 2010 @ 10:29 AM

    I have the impression that there are too many "potential" mysql-ruby sources and most of them are unmaintained.

    I'm still investigating, so sorry if I am missing something important.

    Here is an interesting summary as to why:

    http://github.com/luislavena/mysql-gem/issues/labels/documentation#...

    I propose:

    • changing the name from 'mysql','ruby-mysql','mysql-win' into something that stands out from the rest, just to avoid confusion
    • update the Gemfile and source files to require this recommended new libmysql wrapper.

    @Hector - Thanks for the great work with the mysql-ruby fork!

    • Should I use it as a base for testing and patches?
    • Can we rename it and make it the official gem for Rails?
  • Rizwan Reza

    Rizwan Reza May 16th, 2010 @ 02:41 AM

    • Tag changed from 2.3.2, activerecord, bug, patch, ruby19 to 2.3.2, activerecord, bug, bugmash, patch, ruby19
  • Jeremy Kemper

    Jeremy Kemper May 23rd, 2010 @ 05:54 PM

    • Milestone changed from 2.3.6 to 2.3.7
  • Jeremy Kemper

    Jeremy Kemper May 24th, 2010 @ 09:40 AM

    • Milestone changed from 2.3.7 to 2.3.8
  • Jeremy Kemper

    Jeremy Kemper May 25th, 2010 @ 11:45 PM

    • Milestone changed from 2.3.8 to 2.3.9
  • Jeremy Kemper

    Jeremy Kemper August 30th, 2010 @ 02:28 AM

    • Milestone changed from 2.3.9 to 2.3.10
    • Importance changed from “” to “”
  • Santiago Pastorino

    Santiago Pastorino February 2nd, 2011 @ 04:32 PM

    • Tag changed from 2.3.2, activerecord, bug, bugmash, patch, ruby19 to 232, activerecord, bug, bugmash, patch, ruby19

    This issue has been automatically marked as stale because it has not been commented on for at least three months.

    The resources of the Rails core team are limited, and so we are asking for your help. If you can still reproduce this error on the 3-0-stable branch or on master, please reply with all of the information you have about it and add "[state:open]" to your comment. This will reopen the ticket for review. Likewise, if you feel that this is a very important feature for Rails to include, please reply with your explanation so we can consider it.

    Thank you for all your contributions, and we hope you will understand this step to focus our efforts where they are most helpful.

  • Santiago Pastorino

    Santiago Pastorino February 2nd, 2011 @ 04:32 PM

    • State changed from “open” to “stale”

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

<h2 style="font-size: 14px">Tickets have moved to Github</h2>

The new ticket tracker is available at <a href="https://github.com/rails/rails/issues">https://github.com/rails/rails/issues</a>

Referenced by

Pages