Hey, here’s a fun one. Just last week cdcarter needed to scrape and Rubyify the SciFi channel’s listings. (Wow, those guys really like tables, huh? And nondescript markup. And PHP3. (PHP3 was sooo the best.))
Quickly carter and I dusted off Hpricot and, with it, scraped the hell out of the listing page. We then turned each listing into a Show object, easy. With OpenStruct.
%w[open-uri rubygems hpricot ostruct].each { |f| require f } class Show < OpenStruct LISTINGS = 'http://www.scifi.com/schedulebot/index.php3?feed_req=US:Central:E' def to_s "#{time}: #{title}" << (program ? " [#{program}]" : '') end def self.find_all_from_today shows = [] doc = Hpricot open(LISTINGS) tds = (doc/:td).select { |td| td.respond_to?(:[]) && td['class'] == 'text' } tds.each_with_index do |td, i| next unless td.innerHTML =~ /:.+(AM|PM)/ time = td.innerHTML program = tds[i+1].innerHTML.gsub(/<a.+>(.+)<\/a>/, '\1') title = tds[i+2].innerHTML shows << new(:time => time, :program => program, :title => title) end shows end end # print all found shows for today Show.find_all_from_today.each { |show| puts show.to_s }
Run it. I get something like this:
5:00 AM: [PAID PROGRAMMING] 7:00 AM: SHADOW PLAY [TWILIGHT ZONE, THE] 7:30 AM: BLACK MARKET [BATTLESTAR GALACTICA (SEASON 2)] 8:30 AM: SCAR [BATTLESTAR GALACTICA (SEASON 2)] 9:30 AM: SACRIFICE [BATTLESTAR GALACTICA (SEASON 2)] ...
Way cool (even though I’m dying to slip in some returning action). You can imagine how this might be expanded into a nice little pirate RSS feed or something.
Any more cool Struct or OpenStruct uses floating around out there? Jay Fields has done lots of messin’ with OpenStruct and kindly sprinkles a few write-ups throughout his blog. How’s about yous?
Nice article. I’ve been trying to find some “excuse” to try Hpricot. This gives me an idea for stats on nfl.com.
Thanks for giving me the metion :) I think I’m gonna turn the Buggy code to use openstruct. It’s so clever. So many classes could just inherit from it, like a superjavabean
I started playing with Hpricot just today! Me likeyy.
What is the advantage of using openstruct in this case? I can see why module opts_parse uses is because the methods vary based on program options, but you are just using methods: time,program,title. no?
I used hpricot on my twitter gem and absolutely fell in love. Wouldn’t have thought of it without this article. Unfortunately, using open struct slipped my mind. That is a really nice touch. Glad I revisted this.
Chime in.