|
Message-ID: <CA+E3k90sMhdh25x+=xH20VihODoboiOfkQGPdzFje7An5o2nvQ@mail.gmail.com> Date: Tue, 11 Sep 2012 11:45:51 -0800 From: Royce Williams <royce@...ho.org> To: john-users@...ts.openwall.com Subject: screen scraper recommendation (was: re: Passphrase Creation) On Tue, Sep 11, 2012 at 10:45 AM, Matt Weir <cweir@...edu> wrote: > Scraping web content is pretty much a universal problem from everyone > I've talked to, (though I'll admit my code tends to achieve a certain > level of bugginess above and beyond the norm ;). I have to imagine > there's been a lot of work/research/tools developed to do this for > other problems besides password cracking though. It might be useful > for one of us to look into existing solutions rather than re-invent > the wheel of developing our own scrapers. Semi-OT, since CDDB data is available in text form -- but for folks trying to solve similar screen-scrape problems, indeed, reinventing the wheel is not necessary. For Python, lxml and BeautifulSoup are both good -- each with pros and cons, especially in terms of handling invalid markup. Both will clean up the markup as best they can to make parsing work better. I've worked more with BeautifulSoup for a couple of projects. It lets you do the real-world queries necessary to do useful screen-scraping -- things like "Give me the link text from each second href that follows each h2 that matches this regex", etc. In my first BeautifulSoup exposure, I went from knowing zero Python to a usable screen scrape with ~95% conversion success for slurping in 30K records from a 300+ page set of HTML in about eight effort-hours. They both have good questions and answers on StackExchange sites, and both have good docs. http://www.crummy.com/software/BeautifulSoup/ http://lxml.de/ Royce
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.