This project is archived and is in readonly mode.
Multibyte cleanup cleanup - performance and readability
Reported by Sam Ruby | September 7th, 2009 @ 03:34 PM | in 2.x
The following code is difficult to follow, and may be inefficient.
if expression = valid_character
stripped = []; for c in string.split(//)
stripped << c if valid_character.match(c)
end; stripped.join
At a minimum, a comment that KCODE='UTF-8' is presumed would be in order, use of the value of 'expression' makes sense, and there are builtin functions which will do the iteration in a way that likely will be faster on both MRI and on other Ruby implementations. A patch is attached.
Comments and changes to this ticket
-
Manfred Stienstra September 7th, 2009 @ 03:59 PM
- Milestone set to 2.x
- State changed from new to open
Just to cover my ass, the code was meant as a first iteration to be able to do a security release. Improvements are most welcome!
Actually the code also works when $KCODE is set to 'SJIS' so a better comment would be: "Splits the string on character boundaries, which are determined based on $KCODE."
-
Manfred Stienstra September 7th, 2009 @ 04:01 PM
Some simple benchmarks on 100.000 iterations on a relatively short string:
Before: Clean valid UTF-8 : 2.620000 0.020000 2.640000 ( 2.631768) Clean invalid UTF-8: 2.570000 0.010000 2.580000 ( 2.586888) Clean valid UTF-8 : 1.390000 0.010000 1.400000 ( 1.397572) Clean invalid UTF-8: 2.350000 0.000000 2.350000 ( 2.354005) After: Clean valid UTF-8 : 1.780000 0.010000 1.790000 ( 1.786033) Clean invalid UTF-8: 1.680000 0.020000 1.700000 ( 1.698262) Clean valid UTF-8 : 1.300000 0.010000 1.310000 ( 1.315102) Clean invalid UTF-8: 1.840000 0.030000 1.870000 ( 1.866965)
-
Sam Ruby September 7th, 2009 @ 05:40 PM
I assume that smaller numbers are better in those benchmarks?
If so, this change does not produce a regression in performance, and doesn't have much effect on sustained calls on short strings. But in all cases, the performance effect is an improvement (again, presuming that I'm reading this correctly), and in some cases a noticeable one (on the order of 20%).
And, yes your comment is much better than the one I proposed. Rereading the rdocs on String#split, it does say characters, so perhaps a comment is not needed at all (I would still argue that this is not obvious, but perhaps that's just because it caught me).
-
Manfred Stienstra September 7th, 2009 @ 05:56 PM
I assume that smaller numbers are better in those benchmarks?
Ah yes, I must have been spacing when I posted the benchmarks, sorry about that. It's total runtime for User, System, Total (Wall clock). Top two rows are for the clean method and bottom two rows are for the verify method.
So you can read it as 10% to 40% faster for short strings. Currently the cleanup method is only used in Rails to clean up attribute values, we can assume most string will be relatively short.
I would still argue that this is not obvious, but perhaps that's just because it caught me.
I don't think many people know that regular expressions changes depending on how you set $KCODE, so it's rather esoteric and should probably be commented on.
Michael, can you commit this if you approve?
-
Repository November 4th, 2009 @ 03:19 PM
- State changed from open to committed
(from [62a022891b5f12284205920a03a3cbbe5b8ecd27]) Improve performance of Multibyte::Utils.
Replace explicit for-loops by faster enumeration methods.
[#3158 state:committed] http://github.com/rails/rails/commit/62a022891b5f12284205920a03a3cb...
-
Repository March 16th, 2010 @ 05:04 PM
(from [2d3c58068c2f35e0a63d67a11685d1e5057412ff]) Improve performance of Multibyte::Utils.
Replace explicit for-loops by faster enumeration methods.
[#3158]
Signed-off-by: Jeremy Kemper jeremy@bitsweat.net
http://github.com/rails/rails/commit/2d3c58068c2f35e0a63d67a11685d1...
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile »
<h2 style="font-size: 14px">Tickets have moved to Github</h2>
The new ticket tracker is available at <a href="https://github.com/rails/rails/issues">https://github.com/rails/rails/issues</a>
People watching this ticket
Attachments
Referenced by
- 3158 Multibyte cleanup cleanup - performance and readability [#3158 state:committed] http://github.com/rails/rails/co...
- 3158 Multibyte cleanup cleanup - performance and readability [#3158]