Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CA+E3k90sMhdh25x+=xH20VihODoboiOfkQGPdzFje7An5o2nvQ@mail.gmail.com>
Date: Tue, 11 Sep 2012 11:45:51 -0800
From: Royce Williams <royce@...ho.org>
To: john-users@...ts.openwall.com
Subject: screen scraper recommendation (was: re: Passphrase Creation)

On Tue, Sep 11, 2012 at 10:45 AM, Matt Weir <cweir@...edu> wrote:
> Scraping web content is pretty much a universal problem from everyone
> I've talked to, (though I'll admit my code tends to achieve a certain
> level of bugginess above and beyond the norm ;). I have to imagine
> there's been a lot of work/research/tools developed to do this for
> other problems besides password cracking though. It might be useful
> for one of us to look into existing solutions rather than re-invent
> the wheel of developing our own scrapers.

Semi-OT, since CDDB data is available in text form -- but for folks
trying to solve similar screen-scrape problems, indeed, reinventing
the wheel is not necessary.

For Python, lxml and BeautifulSoup are both good -- each with pros and
cons, especially in terms of handling invalid markup.  Both will clean
up the markup as best they can to make parsing work better.

I've worked more with BeautifulSoup for a couple of projects.  It lets
you do the real-world queries necessary to do useful screen-scraping
-- things like "Give me the link text from each second href that
follows each h2 that matches this regex", etc.  In my first
BeautifulSoup exposure, I went from knowing zero Python to a usable
screen scrape with ~95% conversion success for slurping in 30K records
from a 300+ page set of HTML in about eight effort-hours.

They both have good questions and answers on StackExchange sites, and
both have good docs.

http://www.crummy.com/software/BeautifulSoup/
http://lxml.de/

Royce

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.