LINGUIST List 23.1026

Wed Feb 29 2012

Qs: Looking for a Web Crawler for Corpus Analysis

Editor for this issue: Zac Smith <zaclinguistlist.org>



Date: 23-Feb-2012
From: Ana Popescu <discusurihotmail.com>
Subject: Looking for a Web Crawler for Corpus Analysis
E-mail this message to a friend

Dear All,

I would like to know if there is a web crawler that could download websitesin text format - I have a list of aprox. 100 links from which I would liketo collect the text (not the pdfs) and then take the .txt files and runthem in a concording programme (I already have Wordsmith and another one).I want to be able to get word lists, but also to be able to find the filewhere a particular word originates - this is why I need the text files.

I am interested in a crawler suitable for OS/Windows. Also, I want thecrawler to be able to download the sites recursively, if asked to do so.

I have found different free software of this kind on the Internet but theydon't do everything I need.

Thanks.

Linguistic Field(s): Computational Linguistics

Page Updated: 29-Feb-2012