I was notified by Michael at Social Patterns that Alexa opened up their web index to automated tool builders and users of all kinds. There’s a lot of great data on their site, but the general gist is:
Alexa provides compute and storage resources that allow users to quickly process and store large amounts of web data. Users can view the results of their processes interactively, transfer the results to their home machine, or publish them as a new web service…
Size: Each Web crawl consists of 100 Terabytes of Web content spanning 4 billion pages and 8 million sites.
Frequency: In addition to daily crawls of popular Web content, Alexa crawls a broad cross-section of the Web in 2-month snapshots.
Example: Edward wants to create a collection of JPEG images. He utilizes Alexa’s Advanced Search to create a private search collection that locates all documents in the archive with a MIME type of jpeg, response code 200, url extension of jpg and a size of 64k or larger.
It sounds like an incredible service, but I note that it’s not free. Michael’s write-up has more details. I wish I had more time to investigate and use it in the upcoming weeks and months, but, sadly, we’ve got a fairly full schedule. I did apply for an account, so maybe if I have some spare developer time I’ll ask someone to look into it.