This project is archived and is in readonly mode.
[PATCH] XmlMini - Fixed bugs and improved speed of LibXML and Nokogiri backend
Reported by Willem van Bergen | December 31st, 2009 @ 11:04 AM | in 2.3.6
While I was at it, I decided to fix the bugs in the LibXML backend of XmlMini as well, because it still is the fastest option available. This implementation fixes the following bugs, compared to the REXML backend.
Parsing CDATA elements works.
The previous version simply ignored all CDATA blocks, e.g.:
<tag><![CDATA[foo "bar" & <baz>]]></tag
Content that is mixed with tags is also parsed correctly.
<tag> text <other-tag /> more text </tag>
Now, Rails's testsuite runs fine with this backend set as default, and all equality tests also run correctly (see http://github.com/stepheneb/rails_hash_from_xml). Moreover, the code is more readable (IMHO), and I added specific tests for the issues the old version of the backend had.
Performance benchmark result
Still the fastest backend. See http://github.com/stepheneb/rails_hash_from_xml for the benchmark tool.
Rehearsal ---------------------------------------------
rexml 24.300000 0.110000 24.410000 ( 24.408301)
libxml 0.970000 0.020000 0.990000 ( 0.981194)
nokogiri 10.740000 0.060000 10.800000 ( 10.805523)
----------------------------------- total: 37.290000sec
user system total real
rexml 23.430000 0.050000 23.480000 ( 23.468661)
libxml 0.490000 0.000000 0.490000 ( 0.490357)
nokogiri 10.710000 0.020000 10.730000 ( 10.734065)
Comments and changes to this ticket
-
Jeremy Kemper December 31st, 2009 @ 08:18 PM
- State changed from new to open
- Milestone set to 2.3.6
Nice work. This pulls the implementation out of sync with master, however. Could you patch master instead and backport to 2-3-stable?
-
Willem van Bergen January 1st, 2010 @ 11:26 AM
When I finally set up a working Rails 3 environment, I found out that most of these bugs have already been fixed in Rails 3. However, these fixes have reduced the performance of the backend quite a bit. Moreover, I did find a new incompatibility with the REXML backend: Whitespace in the content of a tag is not preserved correctly when the tag has attributes, e.g.:
<product type="file"> </product>
I found that the Nokogiri backend has the same issue. I have attached a new patch, which fixes this issue in both backends. It actually is pretty much a rewrite of both backends. Except for some small API differences between LibXML and Nokogiri, both backends now use the exact same code to build the hash. This makes it easier to fix future bugs in both backends at the same time. Moreover, in both cases, the code is cleaner than the previous versions, and for both backends the performance is significantly improved.
Performance
Running the performance tests on the master branch:
user system total real REXML 17.200000 0.080000 17.280000 ( 17.327603) LibXML 2.100000 0.110000 2.210000 ( 2.218409) Nokogiri 5.290000 0.030000 5.320000 ( 5.343453)
Now, with my patch applied:
user system total real rexml 24.170000 0.110000 24.280000 ( 24.334927) libxml 0.670000 0.000000 0.670000 ( 0.674913) nokogiri 2.390000 0.020000 2.410000 ( 2.417661)
-
Willem van Bergen January 1st, 2010 @ 11:28 AM
- Tag changed from activesupport, backend, libxml, patch, xmlmini to activesupport, backend, libxml, nokogiri, patch, xmlmini
- Title changed from [PATCH] XmlMini - Fixed LibXML backend issues to [PATCH] XmlMini - Fixed bugs and improved speed of LibXML and Nokogiri backend
The last patch was for the master branch. I backported the changes to the 2-3-stable branch, basically by copy pasting the relevant bits of the code and tests. See the attachment. Can this be done smarter with git?
-
Willem van Bergen January 1st, 2010 @ 11:35 AM
Note that I also wrote a SAX-based backend using Nokogiri, which is even faster. See my other ticket at https://rails.lighthouseapp.com/projects/8994/tickets/3636. But let's first fix the current backends before adding new ones :-)
-
Willem van Bergen January 1st, 2010 @ 01:13 PM
I see I ran the benchmark with my patch applied on my less powerful notebook (see the difference in REXML performance). So the performance is even more impressive with the patch applied, compared to the master branch :-)
user system total real REXML 17.420000 0.080000 17.500000 ( 17.617890) LibXML 0.520000 0.000000 0.520000 ( 0.535704) Nokogiri 1.840000 0.020000 1.860000 ( 1.871491)
-
Jeremy Kemper January 1st, 2010 @ 09:15 PM
Thanks Willem! To backport quickly, you can checkout 2-3-stable then cherry-pick the revision you made to master.
-
Repository January 1st, 2010 @ 09:19 PM
(from [12f6fd0f2687f083bc23ad63fdc82c7e65cb8984]) Bugfixes, speed improvements and code cleanup for Nokogiri's and LibXML's XmlMini backend
[#3641]
Signed-off-by: Jeremy Kemper jeremy@bitsweat.net
http://github.com/rails/rails/commit/12f6fd0f2687f083bc23ad63fdc82c... -
Repository January 1st, 2010 @ 09:19 PM
- State changed from open to committed
(from [34b03cebf9c9f2ce2a53511a4b2160485e270f12]) Code cleanup, bugfixes and speed improvements for the Nokogiri and LibXML XmlMini backends
[#3641 state:committed]
Signed-off-by: Jeremy Kemper jeremy@bitsweat.net
http://github.com/rails/rails/commit/34b03cebf9c9f2ce2a53511a4b2160... -
Stephen Bannasch January 2nd, 2010 @ 06:47 AM
I updated my benchmark and testing suite here for Rails 3 master: http://github.com/stepheneb/rails_hash_from_xml.
All my tests comparing REXML equality pass on both the Rails 3 master and 2-3-stable branches.
Running in MRI I compare Nokogiri and libxml-ruby to REXML.
In JRuby I compare the JDOM implementation to REXML.
I also tested the FFI version of Nokogiri in JRuby but all of the tests fail.
It's very nice to see the Nokogiri speedups on MRI. Here are my results:
# Ruby 1.8.6 (2008-08-11 rev 287), platform: universal-darwin9.0 $ ruby -I$RAILS_SOURCE/activesupport/lib bench_hash_from_xml.rb user system total real REXML 10.960000 0.270000 11.230000 ( 11.464308) Nokogiri 1.230000 0.020000 1.250000 ( 1.256476) LibXML 0.430000 0.000000 0.430000 ( 0.434689)
There are some nice speedups in JRuby also for both REXML and JDOM -- Nokogiri (FFI) is quite slow however (one reason is that Nokogiri in JRuby requires objectspace to be turned on).
$ jruby --server -I$RAILS_SOURCE/activesupport/lib bench_hash_from_xml.rb 2 # OpenJDK Server VM 1.7.0-internal: user system total real REXML 3.448000 0.000000 3.448000 ( 3.448000) JDOM 0.761000 0.000000 0.761000 ( 0.761000) Nokogiri 6.083000 0.000000 6.083000 ( 6.082000)
For more details see: http://github.com/stepheneb/rails_hash_from_xml/blob/master/readme.txt
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile »
<h2 style="font-size: 14px">Tickets have moved to Github</h2>
The new ticket tracker is available at <a href="https://github.com/rails/rails/issues">https://github.com/rails/rails/issues</a>
People watching this ticket
Attachments
Referenced by
- 3636 SAX/Nokogiri backend for XmlMini I have also rewrote the current LibXML and Nokogiri backe...
- 3636 SAX/Nokogiri backend for XmlMini The following results are for the Rails 3 branch. The REX...
- 3641 [PATCH] XmlMini - Fixed bugs and improved speed of LibXML and Nokogiri backend [#3641]
- 3641 [PATCH] XmlMini - Fixed bugs and improved speed of LibXML and Nokogiri backend [#3641 state:committed]