Publishing Partner: Cambridge University Press CUP Extra Wiley-Blackwell Publisher Login
amazon logo
More Info

New from Oxford University Press!


Language Planning as a Sociolinguistic Experiment

By: Ernst Jahr

Provides richly detailed insight into the uniqueness of the Norwegian language development. Marks the 200th anniversary of the birth of the Norwegian nation following centuries of Danish rule

New from Cambridge University Press!


Acquiring Phonology: A Cross-Generational Case-Study

By Neil Smith

The study also highlights the constructs of current linguistic theory, arguing for distinctive features and the notion 'onset' and against some of the claims of Optimality Theory and Usage-based accounts.

New from Brill!


Language Production and Interpretation: Linguistics meets Cognition

By Henk Zeevat

The importance of Henk Zeevat's new monograph cannot be overstated. [...] I recommend it to anyone who combines interests in language, logic, and computation [...]. David Beaver, University of Texas at Austin

Query Details

Query Subject:   Looking for a Web Crawler for Corpus Analysis
Author:   Ana Popescu
Submitter Email:  click here to access email

Linguistic LingField(s):  Computational Linguistics

Query:   Dear All,

I would like to know if there is a web crawler that could download websites
in text format - I have a list of aprox. 100 links from which I would like
to collect the text (not the pdfs) and then take the .txt files and run
them in a concording programme (I already have Wordsmith and another one).
I want to be able to get word lists, but also to be able to find the file
where a particular word originates - this is why I need the text files.

I am interested in a crawler suitable for OS/Windows. Also, I want the
crawler to be able to download the sites recursively, if asked to do so.

I have found different free software of this kind on the Internet but they
don't do everything I need.

LL Issue: 23.1026
Date posted: 29-Feb-2012


Sums main page