This project is archived and is in readonly mode.

#3641 ✓committed
Willem van Bergen

[PATCH] XmlMini - Fixed bugs and improved speed of LibXML and Nokogiri backend

Reported by Willem van Bergen | December 31st, 2009 @ 11:04 AM | in 2.3.6

While I was at it, I decided to fix the bugs in the LibXML backend of XmlMini as well, because it still is the fastest option available. This implementation fixes the following bugs, compared to the REXML backend.

Parsing CDATA elements works.

The previous version simply ignored all CDATA blocks, e.g.:

<tag><![CDATA[foo "bar" & <baz>]]></tag

Content that is mixed with tags is also parsed correctly.

<tag> text <other-tag /> more text </tag>

Now, Rails's testsuite runs fine with this backend set as default, and all equality tests also run correctly (see http://github.com/stepheneb/rails_hash_from_xml). Moreover, the code is more readable (IMHO), and I added specific tests for the issues the old version of the backend had.

Performance benchmark result

Still the fastest backend. See http://github.com/stepheneb/rails_hash_from_xml for the benchmark tool.

Rehearsal ---------------------------------------------
rexml      24.300000   0.110000  24.410000 ( 24.408301)
libxml      0.970000   0.020000   0.990000 (  0.981194)
nokogiri   10.740000   0.060000  10.800000 ( 10.805523)
----------------------------------- total: 37.290000sec

                user     system      total        real
rexml      23.430000   0.050000  23.480000 ( 23.468661)
libxml      0.490000   0.000000   0.490000 (  0.490357)
nokogiri   10.710000   0.020000  10.730000 ( 10.734065)

Comments and changes to this ticket

  • Jeremy Kemper

    Jeremy Kemper December 31st, 2009 @ 08:18 PM

    • State changed from “new” to “open”
    • Milestone set to 2.3.6

    Nice work. This pulls the implementation out of sync with master, however. Could you patch master instead and backport to 2-3-stable?

  • Willem van Bergen

    Willem van Bergen January 1st, 2010 @ 11:26 AM

    When I finally set up a working Rails 3 environment, I found out that most of these bugs have already been fixed in Rails 3. However, these fixes have reduced the performance of the backend quite a bit. Moreover, I did find a new incompatibility with the REXML backend: Whitespace in the content of a tag is not preserved correctly when the tag has attributes, e.g.:

    <product type="file">    </product>
    

    I found that the Nokogiri backend has the same issue. I have attached a new patch, which fixes this issue in both backends. It actually is pretty much a rewrite of both backends. Except for some small API differences between LibXML and Nokogiri, both backends now use the exact same code to build the hash. This makes it easier to fix future bugs in both backends at the same time. Moreover, in both cases, the code is cleaner than the previous versions, and for both backends the performance is significantly improved.

    Performance

    Running the performance tests on the master branch:

                   user     system      total        real
    REXML     17.200000   0.080000  17.280000 ( 17.327603)
    LibXML     2.100000   0.110000   2.210000 (  2.218409)
    Nokogiri   5.290000   0.030000   5.320000 (  5.343453)
    

    Now, with my patch applied:

                    user     system      total        real
    rexml      24.170000   0.110000  24.280000 ( 24.334927)
    libxml      0.670000   0.000000   0.670000 (  0.674913)
    nokogiri    2.390000   0.020000   2.410000 (  2.417661)
    
  • Willem van Bergen

    Willem van Bergen January 1st, 2010 @ 11:28 AM

    • Tag changed from activesupport, backend, libxml, patch, xmlmini to activesupport, backend, libxml, nokogiri, patch, xmlmini
    • Title changed from “[PATCH] XmlMini - Fixed LibXML backend issues” to “[PATCH] XmlMini - Fixed bugs and improved speed of LibXML and Nokogiri backend”

    The last patch was for the master branch. I backported the changes to the 2-3-stable branch, basically by copy pasting the relevant bits of the code and tests. See the attachment. Can this be done smarter with git?

  • Willem van Bergen

    Willem van Bergen January 1st, 2010 @ 11:35 AM

    Note that I also wrote a SAX-based backend using Nokogiri, which is even faster. See my other ticket at https://rails.lighthouseapp.com/projects/8994/tickets/3636. But let's first fix the current backends before adding new ones :-)

  • Willem van Bergen

    Willem van Bergen January 1st, 2010 @ 01:13 PM

    I see I ran the benchmark with my patch applied on my less powerful notebook (see the difference in REXML performance). So the performance is even more impressive with the patch applied, compared to the master branch :-)

                   user     system      total        real
    REXML     17.420000   0.080000  17.500000 ( 17.617890)
    LibXML     0.520000   0.000000   0.520000 (  0.535704)
    Nokogiri   1.840000   0.020000   1.860000 (  1.871491)
    
  • Aaron Patterson
  • Jeremy Kemper

    Jeremy Kemper January 1st, 2010 @ 09:15 PM

    Thanks Willem! To backport quickly, you can checkout 2-3-stable then cherry-pick the revision you made to master.

  • Repository

    Repository January 1st, 2010 @ 09:19 PM

    (from [12f6fd0f2687f083bc23ad63fdc82c7e65cb8984]) Bugfixes, speed improvements and code cleanup for Nokogiri's and LibXML's XmlMini backend

    [#3641]

    Signed-off-by: Jeremy Kemper jeremy@bitsweat.net
    http://github.com/rails/rails/commit/12f6fd0f2687f083bc23ad63fdc82c...

  • Repository

    Repository January 1st, 2010 @ 09:19 PM

    • State changed from “open” to “committed”

    (from [34b03cebf9c9f2ce2a53511a4b2160485e270f12]) Code cleanup, bugfixes and speed improvements for the Nokogiri and LibXML XmlMini backends

    [#3641 state:committed]

    Signed-off-by: Jeremy Kemper jeremy@bitsweat.net
    http://github.com/rails/rails/commit/34b03cebf9c9f2ce2a53511a4b2160...

  • Stephen Bannasch

    Stephen Bannasch January 2nd, 2010 @ 06:47 AM

    I updated my benchmark and testing suite here for Rails 3 master: http://github.com/stepheneb/rails_hash_from_xml.

    All my tests comparing REXML equality pass on both the Rails 3 master and 2-3-stable branches.

    Running in MRI I compare Nokogiri and libxml-ruby to REXML.

    In JRuby I compare the JDOM implementation to REXML.

    I also tested the FFI version of Nokogiri in JRuby but all of the tests fail.

    It's very nice to see the Nokogiri speedups on MRI. Here are my results:

    # Ruby 1.8.6 (2008-08-11 rev 287), platform: universal-darwin9.0
    
    $ ruby -I$RAILS_SOURCE/activesupport/lib bench_hash_from_xml.rb
    
                  user     system      total        real
    REXML     10.960000   0.270000  11.230000 ( 11.464308)
    Nokogiri   1.230000   0.020000   1.250000 (  1.256476)
    LibXML     0.430000   0.000000   0.430000 (  0.434689)
    

    There are some nice speedups in JRuby also for both REXML and JDOM -- Nokogiri (FFI) is quite slow however (one reason is that Nokogiri in JRuby requires objectspace to be turned on).

    $ jruby --server -I$RAILS_SOURCE/activesupport/lib bench_hash_from_xml.rb 2
    
    # OpenJDK Server VM 1.7.0-internal:
    
                   user     system      total        real
    REXML      3.448000   0.000000   3.448000 (  3.448000)
    JDOM       0.761000   0.000000   0.761000 (  0.761000)
    Nokogiri   6.083000   0.000000   6.083000 (  6.082000)
    

    For more details see: http://github.com/stepheneb/rails_hash_from_xml/blob/master/readme.txt

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

<h2 style="font-size: 14px">Tickets have moved to Github</h2>

The new ticket tracker is available at <a href="https://github.com/rails/rails/issues">https://github.com/rails/rails/issues</a>

People watching this ticket

Referenced by

Pages