Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info

New from Oxford University Press!


Oxford Handbook of Corpus Phonology

Edited by Jacques Durand, Ulrike Gut, and Gjert Kristoffersen

Offers the first detailed examination of corpus phonology and serves as a practical guide for researchers interested in compiling or using phonological corpora

New from Cambridge University Press!


The Languages of the Jews: A Sociolinguistic History

By Bernard Spolsky

A vivid commentary on Jewish survival and Jewish speech communities that will be enjoyed by the general reader, and is essential reading for students and researchers interested in the study of Middle Eastern languages, Jewish studies, and sociolinguistics.

New from Brill!


Indo-European Linguistics

New Open Access journal on Indo-European Linguistics is now available!

Query Details

Query Subject:   Looking for a Web Crawler for Corpus Analysis
Author:   Ana Popescu
Submitter Email:  click here to access email

Linguistic LingField(s):  Computational Linguistics

Query:   Dear All,

I would like to know if there is a web crawler that could download websites
in text format - I have a list of aprox. 100 links from which I would like
to collect the text (not the pdfs) and then take the .txt files and run
them in a concording programme (I already have Wordsmith and another one).
I want to be able to get word lists, but also to be able to find the file
where a particular word originates - this is why I need the text files.

I am interested in a crawler suitable for OS/Windows. Also, I want the
crawler to be able to download the sites recursively, if asked to do so.

I have found different free software of this kind on the Internet but they
don't do everything I need.

LL Issue: 23.1026
Date posted: 29-Feb-2012


Sums main page