Re: HTML Parser: Which one is better?



On 2007-05-31 02:36:57 -0700, "Richard Conroy" <richard.conroy@xxxxxxxxx> said:

On 5/31/07, *** Davies <rasputnik@xxxxxxxxx> wrote:
Hpricot is a good starting point.

Yeah Hpricot is good, but in general the quality of the Ruby web scraping
choices is pretty impressive. There are variants that are just built on top
of Hpricot but provide an even simpler API.

However your second problem is a bit trickier, where you encounter
alternate encodings. To do any kind of real work with multiple code
pages you want to be converting it to unicode (UTF-8) at fetch time.


I've had great success with this. Just make sure you're using a later version of Ruby 1.8.5+ (that includes the NKF library) and you should be fine.

.