Err the Blog Atom Feed Icon
Err the Blog
Rubyisms and Railities
  • “Parse XML with Hpricot”
    – PJ on July 31, 2006

    Given a piece of XML:

    <Export>
      <Product>
        <SKU>403276</SKU>
        <ItemName>Trivet</ItemName>
        <CollectionNo>0</CollectionNo>
        <Pages>0</Pages>
      </Product>
    </Export>
    

    One might assume that REXML is the way to parse it, but we all know how slow it is.

    Enter _why’s HTML parser, Hpricot. It’s written in C and since XHTML is a subset of XML, there’s no reason it shouldn’t be able to parse my file.

    Turns out it does, it’s really fast, and the code is dead simple.

    FIELDS = %w[SKU ItemName CollectionNo Pages]
    
    doc = Hpricot.parse(File.read("my.xml"))
    (doc/:product).each do |xml_product|
      product = Product.new
      for field in FIELDS
        product[field] = (xml_product/field.intern).first.innerHTML
      end
      product.save
    end
    

    Update: Slight refactoring of the code above. Chris figured out last night that you can use innerHTML which eliminated the only ugly part of the code.

  • J. Weir, about 1 month later:

    Works great, replaced a project which was using REXML with Hpricot. The tests used to take 4.6 seconds, now they take 0.6. Big improvement.

  • piggybox, 3 months later:

    That saves a ton of time indeed. Thanks.

  • edvardg, 12 months later:

    I have found error : If you have tag in your file ( like in kml file type) then it is not recognized. Perhaps because this is markup for css style in html document?

  • Nicholas, about 1 year later:

    Hi it is really good, but when i run this code i get an error saying “uninitialized constant Product (NameError)”.

    How to solve this?

    -Nicholas I

  • Four people have commented.
    Chime in.
    Sorry, no more comments :(
This is Err, the weblog of PJ Hyett and Chris Wanstrath.
All original content copyright ©2006-2008 the aforementioned.