#951 Multibyte revisited - Ruby on Rails

Type	To find
responsible:me	tickets assigned to you
tagged:"@high"	tickets tagged @high
milestone:next	tickets in the upcoming milestone
state:invalid	tickets with the state invalid
created:"last week"	tickets created last week
sort:number, importance, updated	tickets sorted by #, importance or updated
Combine keywords for powerful searching.
Use advanced searching »

This project is archived and is in readonly mode.

#951 ✓resolved

Multibyte revisited

Reported by Manfred Stienstra | September 1st, 2008 @ 02:59 PM

Multibyte Revisited is an attempt to clean up the current implementation of ActiveSupport::Multibyte and make it compatible with Ruby 1.9.

http://github.com/Manfred/rails/...

Comments and changes to this ticket

Michael Koziarski September 3rd, 2008 @ 03:27 PM
- Assigned user set to “Jeremy Kemper”
Could you write up the changes in a little more detail. We also need to think about how we handle the migration for people using chars to .mb_chars, and which release streams need it
Manfred Stienstra September 4th, 2008 @ 06:01 PM
The most significant change in the API is the change from String#chars to String#mb_chars. Because Ruby 1.8.7 and Ruby 1.9 use String#chars as an iterator over the characters in a string we can no longer use that method for our purposes.

All uses of String#chars in other parts of Rails have been changed to use the new method. In 1.8.6 String#chars is aliased to String#mb_chars.

|| <= 1.8.6 | 1.8.7 | 1.9 | | String#chars | multibyte accessor | iterator | iterator | | String#mb_chars| multibyte accessor | multibyte accessor | multibyte accessor |

The other big change is that a level of indirection has been removed. In the old version you would define a handler and the Chars proxy class would delegate methods to the handler.
```
String#chars --> ActiveSupport::Multibyte::Chars --> ActiveSupport::Multibyte::Handlers::UTF8Handler
```
In the new version methods are called directly on the proxy class.
```
String#mb_chars --> ActiveSupport::Multibyte::Chars
```
Of course you can still define your own proxy class if you want to add support for another encoding:
```
class UTF32Chars < ActiveSupport::Multibyte::Chars
  def size
    @wrapped_string.length / 4
  end
end

ActiveSupport::Multibyte.proxy_class = UTF32Chars
```
This change means less code and less inderection during execution. Because of this the implementation is faster than the old implementation. The speedup varies somewhere between zero and 25% depending on the method.

We can choose to deprecate the call to String#chars on 1.8.6, but there is no really pressing reason why people should stop using it. When they decide to run their application on 1.8.7 or 1.9 this is probably going to be a concious decision anyway and other stuff will break because the Ruby API changed. I'm not sure if we need to do anything besides warn people about it.
```
> "Hello".chars.upcase
NoMethodError: undefined method `upcase' for #<Enumerator:0x3ab600>
  from (irb):1
  from /opt/ruby19/bin/irb:12:in `<main>'
```
Michael Koziarski September 11th, 2008 @ 02:53 PM
This seems good to me, there's a bunch of commented out code in the parsing code.

As for handling the deprecation of #chars. I think that for 1.8.6 and earlier we define a #chars method which is deprecated and delegates to mb_chars, for 1.8.7 and up we just define the .mb_chars method.

Am I missing anything?
Manfred Stienstra September 11th, 2008 @ 03:20 PM
The commented lines in the parsing code are for Unicode character properties we don't use. It's also in the current parse code. I can remove it if it really bothers you?

I'm fine with deprecating #chars. I guess it'll save people some trouble if they move over to Ruby 1.9.
Michael Koziarski September 11th, 2008 @ 03:22 PM
Nah, I don't mind about the commented code, was just wondering if it was deliberate or not. That stuff's only used when generating the tables anyway, right?

Ironically we chose chars so people could easily move to 1.9... So can you update with the deprecation behaviour, then I think we're probably good to go.

Jeremy, any thoughts?
Manfred Stienstra September 11th, 2008 @ 03:31 PM
Yes, the parsing code is only used to generate the unicode tables.
Manfred Stienstra September 11th, 2008 @ 11:36 PM
Updated patch.
- multibyte-revisited.diff 580.7 KB
Michael Koziarski September 12th, 2008 @ 02:36 PM
I don't see any tests which assert_deprecated 'mb_chars' { "something".chars}

Apart from that this is looking great. Thanks again for all your awesome work on this stuff.

Feel free to push this to a rebased branch on github if it's easier for you.
Manfred Stienstra September 12th, 2008 @ 02:46 PM
Hmm, I must have messed up the last merge. I'll fix it this weekend.
Manfred Stienstra September 21st, 2008 @ 05:06 PM
- Tag changed from “multibyte, patch” to “multibyte, patch”
Michael, I pushed my changes to Manfred/rails:multibyte-revisited.

Do you guys have any idea where this fits into the Rails timeline? Is this 2.1 or 2.2 material?
Michael Koziarski September 22nd, 2008 @ 07:51 AM
- Milestone cleared.
I think we should probably merge this for 2.2 as it helps with the 1.9 / 1.8.7 compatibility stuff.

Any objections jeremy? Otherwise I'll get to it this week
Michael Koziarski September 22nd, 2008 @ 08:58 PM
Applied and merged and all that.

Really nice work!
Pratik October 17th, 2008 @ 05:16 PM
- Assigned user changed from “Jeremy Kemper” to “Michael Koziarski”
@koz : This is already resolved afaik ?
Michael Koziarski October 17th, 2008 @ 05:17 PM
- State changed from “new” to “resolved”
indeed