Err the Blog Atom Feed Icon
Err the Blog
Rubyisms and Railities
  • “Parse XML with Hpricot”
    – PJ on January 08, 2007

    Advertisement

    Given a piece of XML:

    <Export>
      <Product>
        <SKU>403276</SKU>
        <ItemName>Trivet</ItemName>
        <CollectionNo>0</CollectionNo>
        <Pages>0</Pages>
      </Product>
    </Export>
    

    One might assume that REXML is the way to parse it, but we all know how slow it is.

    Enter _why’s HTML parser, Hpricot. It’s written in C and since XHTML is a subset of XML, there’s no reason it shouldn’t be able to parse my file.

    Turns out it does, it’s really fast, and the code is dead simple.

    FIELDS = %w[SKU ItemName CollectionNo Pages]
    
    doc = Hpricot.parse(File.read("my.xml"))
    (doc/:product).each do |xml_product|
      product = Product.new
      for field in FIELDS
        product[field] = (xml_product/field.intern).first.innerHTML
      end
      product.save
    end
    

    Update: Slight refactoring of the code above. Chris figured out last night that you can use innerHTML which eliminated the only ugly part of the code.

  • J. Weir, 3 months later:

    Works great, replaced a project which was using REXML with Hpricot. The tests used to take 4.6 seconds, now they take 0.6. Big improvement.

  • piggybox, 2 months later:

    That saves a ton of time indeed. Thanks.

  • edvardg, 6 months later:

    I have found error : If you have tag in your file ( like in kml file type) then it is not recognized. Perhaps because this is markup for css style in html document?

  • Nicholas, about 1 year later:

    Hi it is really good, but when i run this code i get an error saying “uninitialized constant Product (NameError)”.

    How to solve this?

    -Nicholas I

  • Michael Johnston, about 1 year later:

    @Nicholas I

    you can solve that error by putting the following code in first:

    class Product;def initialize;puts “Clue Missing”;end;end;

  • Five people have commented.
    Chime in.



    Textile is permitted.

Projects

  • Cheat! Sheets
  • Subtlety: RSSin' Your SVN
  • cache_fu
  • acts_as_textiled
  • mofo [microformat parsing]
  • require 'errtheblog'

Information

  • Dynamite! — The Err Free Weblog
  • Err Free: Ruby Development & Consulting
  • Err on GitHub
  • Err on Twitter
  • Report Err Plugin Bugs (Lighthouse Tracker)
  • Contact
This is Err, the weblog of PJ Hyett and Chris Wanstrath.
All original content copyright ©2006-2008 the aforementioned.