* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *


LINGUIST List 23.1026

Wed Feb 29 2012

Qs: Looking for a Web Crawler for Corpus Analysis

Editor for this issue: Zac Smith <zaclinguistlist.org>


We'd like to remind readers that the responses to queries are usually best posted to the individual asking the question. That individual is then strongly encouraged to post a summary to the list. This policy was instituted to help control the huge volume of mail on LINGUIST; so we would appreciate your cooperating with it whenever it seems appropriate.

In addition to posting a summary, we'd like to remind people that it is usually a good idea to personally thank those individuals who have taken the trouble to respond to the query.

To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
Date: 23-Feb-2012
From: Ana Popescu <discusurihotmail.com>
Subject: Looking for a Web Crawler for Corpus Analysis
E-mail this message to a friend

Dear All,

I would like to know if there is a web crawler that could download websites
in text format - I have a list of aprox. 100 links from which I would like
to collect the text (not the pdfs) and then take the .txt files and run
them in a concording programme (I already have Wordsmith and another one).
I want to be able to get word lists, but also to be able to find the file
where a particular word originates - this is why I need the text files.

I am interested in a crawler suitable for OS/Windows. Also, I want the
crawler to be able to download the sites recursively, if asked to do so.

I have found different free software of this kind on the Internet but they
don't do everything I need.

Thanks.

Linguistic Field(s): Computational Linguistics

Read more issues|LINGUIST home page|Top of issue



Page Updated: 29-Feb-2012

Supported in part by the National Science Foundation       About LINGUIST    |   Contact Us       ILIT Logo
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.