This project is archived and is in readonly mode.
ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 1.9.1
Reported by Hector E. Gomez Morales | April 10th, 2009 @ 04:36 PM | in 2.3.10
From #2188:
Hello! We've got the same problem! Only the error occurs when we fetch data from the database. We're using Mysql and Charset is UTF-8, but the Active Record returns ASCII-8BIT. Is it possible to do similar changes to the activerecord as you did to the actionpack? Seems as we're not the only ones with that problem
Problem
Fetching data from any database (Mysql, Postgresql, Sqlite2 & 3), all configured to have UTF-8 as it's character set, returns the data with ASCII-8BIT in ruby 1.9.1 and rails 2.3.2.1.
This has been reported in #2188 and in the rails talk group (1).
Possible Solution
Again like in #2188 rails is not the culprit here, the only problem with rails is it inherent trust that all the data it gets is UTF-8. When the data has another encoding is when the problems arise.
The real problem is that all the current adapters use native C extensions as glue in which they use rb_str_new function that in ruby 1.9.1 creates a String with ASCII-8BIT encoding (2). So that is why all the data is returned with this encoding.
Because the initial problems where detected in MySQL. I made the needed modifications and created a fork in github (3) for mysql-ruby . This fork is only 1.9.1 compatible, returns ASCII-8BIT for binary fields and UTF-8 for all other fields.
With this modified mysql-ruby gem, all activerecord test for mysql passes except test_validate_case_sensitive_uniqueness. The test will fail for all adapters that are encoding aware, this is because in the implementation of this validation the value that need to be unique is converted to downcase and a query using LOWER(#{field}) in the unique field is executed. The downcase in ruby 1.8.1 for non-ASCII strings is done with MultiByte, given that in ruby 1.9.1 downcase still does nothing for non-ASCII encoding strings I use Multibyte#downcase to do the conversion.
I attach a patch so validates_uniqueness uses Multibyte#downcase on the string if we are using ruby 1.9.1. With this patch all test pass for test_mysql in activerecord.
TODO
- Make all the other adapters 1.9.1 compatible AND encoding aware.
- Remove hardcoded encoding use of UTF-8 and use the character set used by the DB.
Links
Comments and changes to this ticket
-
Hector E. Gomez Morales April 10th, 2009 @ 04:38 PM
- Tag changed from 2.3.2, activerecord, bug, patch to 2.3.2, activerecord, bug, patch, ruby19
-
Manfred Stienstra April 11th, 2009 @ 02:13 PM
Instead of
nvalue = ActiveSupport::Multibyte::Chars.new(value) condition_params = [nvalue.downcase]
You can just do
condition_params = [value.mb_chars.downcase]
-
Hector E. Gomez Morales April 11th, 2009 @ 04:51 PM
Well the problem is that $KCODE is not set in ruby19 so the call to mb_chars doesn't proxy the string. So that why I did the explicit wrapping.
@@@ Ruby activesupport/lib/active_support/core_ext/string/multibyte.rb def mb_chars if ActiveSupport::Multibyte.proxy_class.wants?(self)
ActiveSupport::Multibyte.proxy_class.new(self)
else
self
end end
activesupport/lib/active_support/multibyte/chars.rb def self.wants?(string) $KCODE == 'UTF8' && consumes?(string) end
railties/lib/initializer.rb def initialize_encoding $KCODE='u' if RUBY_VERSION < '1.9' end
-
Hector E. Gomez Morales April 11th, 2009 @ 04:54 PM
Sorry again posting the code
# activesupport/lib/active_support/core_ext/string/multibyte.rb def mb_chars if ActiveSupport::Multibyte.proxy_class.wants?(self) ActiveSupport::Multibyte.proxy_class.new(self) else self end end # activesupport/lib/active_support/multibyte/chars.rb def self.wants?(string) $KCODE == 'UTF8' && consumes?(string) end # railties/lib/initializer.rb def initialize_encoding $KCODE='u' if RUBY_VERSION < '1.9' end
-
Manfred Stienstra April 11th, 2009 @ 09:52 PM
Ah, right. I'll try to think of a way to access the proxy in Ruby 1.9.
-
smixok (at gmail) May 4th, 2009 @ 11:06 PM
is there any patch for activerecord or sqlite and postgre-pr gem?
-
Dimitrij Denissenko May 9th, 2009 @ 04:30 PM
Sorry Hector, but your patch to Ruby-MySQL doesn't fully solve the problem.
@@@ if (fields[i].type == MYSQL_TYPE_BLOB) @@@
The comparison with MYSQL_TYPE_BLOB also includes TEXT (TINYTEXT, MEDIUMTEXT, LONGTEXT) fields. All content stored in these also comes back ASCII-8BIT encoded.
-
qoobaa May 13th, 2009 @ 09:05 AM
I've fixed sqlite3-ruby gem (version 1.2.5) http://github.com/qoobaa/sqlite3-ruby/tree/master. The problem with ASCII-8BIT encoding is in Rack also, I've created a patch to fix it.
-
James Healy June 16th, 2009 @ 12:59 PM
I agree with the goal of a database driver that correctly sets the encoding of strings, however hard coding all strings to UTF-8 seems like we'd be shooting ourselves in the foot.
The problem with encoding issues is that often the wrong solution does the right thing 95% of the time. Sure most of us writing web apps these days operate in UTF-8, but when some poor sod rocks up with a UTF-16 encoded database we'd break her data.
I've never used the MySQL C API, but surely there's a way to detect the encoding of the current DB/table/column?
-
Brendan Schwartz June 16th, 2009 @ 07:51 PM
I second James Healy's approach. Blindly setting the encoding of all strings from the database to UTF-8 is short-sighted.
-
James Healy June 17th, 2009 @ 02:15 AM
I decided to investigate further and picked MySQL as my guinea pig.
I can see two possible approaches.
-
We patch the MySQL/Ruby driver to return strings marked with an appropriate encoding. To do so it would need to track the value of the character_set_results MySQL variable which indicates the character set MySQL will return results in. I'm not sure what 'tracking' that variable would involve. Since it can be changed at any time, the driver would need to regularly check (and cache?) the value.
-
We patch the ActiveRecord MySQL Adapter. This leaves the driver encoding unaware - it more or less just passes byte arrays between AR and the MySQL server in blissful ignorance of the encoding. If the AR MySQL Adapter notices it has encoding: set in it's config, it can take the ASCII-8BIT/BINARY strings the driver hands it and force the encoding to something appropriate.
Either approach looks achievable without too much work. I'm keen to attempt on of them, but I think I'll mull over the options for a while first.
Any thoughts? Ken - it sounds like you modified the SQL Server Adapter and not the driver?
-
-
James Healy July 4th, 2009 @ 04:45 PM
It took my a little while to get it, but I've got a proposed patch for the MysqlAdaptor on github @ http://github.com/yob/rails/commit/986b8c99331d68087eaa0a703f4121c5....
I took approach (2) from my earlier comment. The Mysql driver remains encoding unaware, and all results are stored in the AR model attributes hash marked as "BINARY" encoding.
Traditionally non string attributes are type cast on demand (converted to ints, dates, etc) and strings are left untouched. This patch adds a type casting process for string attributes that "fixes" the encoding to match what the user has specified in database.yml.
The commit message has a few extra details.
-
runa July 10th, 2009 @ 07:28 PM
(the correct URL for James patch is http://github.com/yob/rails/commit/986b8c99331d68087eaa0a703f4121c5... )
-
James Healy July 19th, 2009 @ 02:21 AM
What's the best way for me to get feedback on my patch?
I'm still in two minds about whether the encoding of strings from the database should be fixed in the DB driver or ActiveRecord, so it would be nice to get some discussion going.
-
Michael H Buselli August 4th, 2009 @ 04:21 PM
I think ActiveRecord is the right place to handle encoding, though some DB drivers may also be able to help if there is a non-standard way the particular database handles encoding. For databases that do nothing with encoding and just push raw bits, ActiveRecord should provide a configuration option for the encoding and return Strings properly encoded, perhaps even transformed to another encoding if the data is encoded differently than the user wants.
That's my 2¢, anyway.
-
Michael H Buselli August 5th, 2009 @ 10:13 PM
I wrote a gem to enhance ActiveRecord::Base as described above: http://github.com/cosine/active_record_encoding/tree/master
It assumes the database just pushes bits and doesn't understand its encoding, which is true in my case.
I have not thoroughly tested it, yet. Proceed with caution if you use it. Even so, I would love some feedback.
-
Manfred Stienstra August 6th, 2009 @ 07:42 AM
It assumes the database just pushes bits and doesn't understand its encoding
Unfortunately that's not the case. For instance, when you try to store certain UTF-8 characters in a Latin-1 database you will loose information. The database storage engine and the database protocol are generally encoding aware so we will have to deal with that.
Even so, I would love some feedback.
If you're serious in pursuing this plugin I would recommend writing a lot of tests.
-
Yugui (Yuki Sonoda) August 7th, 2009 @ 07:20 AM
The database storage engine and the database protocol are generally encoding aware so we will have to deal with that. Yes. So the approach (1) is ideal.
I sent a patch for ruby-pg. The next release of ruby-pg will be encoding-aware.
Mysql/Ruby should be fixed as ruby-pg. And I think my article http://yugui.jp/articles/838 can help to fix database drivers. -
pyromaniac September 6th, 2009 @ 07:17 PM
Hi. I try to fix this issue and there is two moments:
1. First, db driver give us utf string with forced ASCII-8BIT encoding.
2. Some utf views are forced to ASCII-8BIT during the compilation.
So, we have a multi problem.
Let's look:>> p = "привет" => "привет" >> t = "мир" => "мир" >> t.force_encoding Encoding::ASCII_8BIT => "\xD0\xBC\xD0\xB8\xD1\x80" >> p << t Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT
from (irb):4 from /home/pyromaniac/.rvm/ruby-1.9.1-p243/bin/irb:12:in `<main>'</code>
This hack works well for full utf projects
class String alias_method(:orig_concat, :concat) def concat(value)
orig_concat value.force_encoding(Encoding::UTF_8)
end end
What do you think? -
Aleksander Pohl October 22nd, 2009 @ 12:00 AM
I tested ruby-mysql posted by pyromaniac and it seems to work fine.
-
Loren Segal November 7th, 2009 @ 02:14 AM
You can use this little pure Ruby hack to get things working. No modifications to the mysql gem are needed, so it's pretty easy to drop into an existing app:
-
Michael Hasenstein April 4th, 2010 @ 10:05 AM
And now??? I have this exact issue with Rails 3 Beta 2, Ruby 1.9.2-head.
-
Ivan Ukhov April 7th, 2010 @ 12:58 AM
For those who dont want to overwrite String::concat and use HAML for views, here is my solution (http://gist.github.com/358275):
module Haml class Buffer class UTF8String < String def << text; super text.toutf8; end end alias original_initialize initialize def initialize *args original_initialize *args @buffer = UTF8String.new end end end
-
Jeremy Kemper April 24th, 2010 @ 09:23 PM
- Milestone changed from 2.x to 2.3.6
- State changed from new to open
- Assigned user set to Jeremy Kemper
Any luck getting mysql encoding into the released gem?
-
Cezary Baginski May 2nd, 2010 @ 12:09 AM
As for mysql gem, I might take a shot at it ... based od Yugui's article.
-
Cezary Baginski May 2nd, 2010 @ 10:29 AM
I have the impression that there are too many "potential" mysql-ruby sources and most of them are unmaintained.
I'm still investigating, so sorry if I am missing something important.
Here is an interesting summary as to why:
http://github.com/luislavena/mysql-gem/issues/labels/documentation#...
I propose:
- changing the name from 'mysql','ruby-mysql','mysql-win' into something that stands out from the rest, just to avoid confusion
- update the Gemfile and source files to require this recommended new libmysql wrapper.
@Hector - Thanks for the great work with the mysql-ruby fork!
- Should I use it as a base for testing and patches?
- Can we rename it and make it the official gem for Rails?
-
Rizwan Reza May 16th, 2010 @ 02:41 AM
- Tag changed from 2.3.2, activerecord, bug, patch, ruby19 to 2.3.2, activerecord, bug, bugmash, patch, ruby19
-
Jeremy Kemper August 30th, 2010 @ 02:28 AM
- Milestone changed from 2.3.9 to 2.3.10
- Importance changed from to
-
Santiago Pastorino February 2nd, 2011 @ 04:32 PM
- Tag changed from 2.3.2, activerecord, bug, bugmash, patch, ruby19 to 232, activerecord, bug, bugmash, patch, ruby19
This issue has been automatically marked as stale because it has not been commented on for at least three months.
The resources of the Rails core team are limited, and so we are asking for your help. If you can still reproduce this error on the 3-0-stable branch or on master, please reply with all of the information you have about it and add "[state:open]" to your comment. This will reopen the ticket for review. Likewise, if you feel that this is a very important feature for Rails to include, please reply with your explanation so we can consider it.
Thank you for all your contributions, and we hope you will understand this step to focus our efforts where they are most helpful.
-
Santiago Pastorino February 2nd, 2011 @ 04:32 PM
- State changed from open to stale
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile »
<h2 style="font-size: 14px">Tickets have moved to Github</h2>
The new ticket tracker is available at <a href="https://github.com/rails/rails/issues">https://github.com/rails/rails/issues</a>
People watching this ticket
- Akira Matsuda
- Andre Lewis
- Brendan Schwartz
- Cezary Baginski
- Falk Pauser
- George Deglin
- Hector E. Gomez Morales
- iquesada
- James Healy
- Jeremy Kemper
- Johan Sørensen
- kdgundermann
- Ken Collins
- Manfred Stienstra
- Marius Mathiesen
- Michael H Buselli
- Phil Ross
- Rudolf Gavlas
- Saimon Moore
- Sebastian
- smixok (at gmail)
- Sven Riedel
- The Doctor What
Attachments
Referenced by
- 2188 Encoding error in Ruby1.9 for templates Hi, sorry to be so late but I got some solutions to this ...