1. Write a web spider that extracts a list of all Wikipedia article names. (You can start at [login to view URL]:AllPages.)
2. For each article name, determine if the name is the primary name of an article or a redirect. (For example, "Jackie Onassis" is a redirect to primary name "Jacqueline Kennedy Onassis".
3. For each article, calculate the relevance by looking at the first page of history (e.g., [login to view URL]). From this page (only look at first page of history), extract the number of revisions, and the dates of the first and last revisions.
4. For each article, extract the geo coordinates (if they exist)
5. Deliver results in a tab-separated flat file with columns (name, primary name, url, latlon, relevance[123]) as defined below.
name - the name of the article (primary name or redirect name)
primary name - the primary name of the article
url - the url (e.g., [login to view URL])
latlon - the latlon in this format: 37.461853,-121.0968 (or empty)
relevance1 - number of revisions
relevance2 - date of first revision
relevance3 - date of last revision
ADDED:
Because the spidering will take a long time to run, the program should save its state as it goes. It should be able to restart from where it left off after a crash.
## Deliverables
This version of the spider program should deal with Wikipedia English only ([login to view URL]). A follow on project will extend the scope of the spider program to other languages.
* * *This broadcast message was sent to all bidders on Thursday Jun 3, 2010 3:22:16 PM:
I have received many bid requests and am inclined to accept one (or more) bids at or below $250. Given the great deal of interest, I want to focus on delivery time: Please let me know if you can get a first result finished by Thu. June 10 (midnight Pacific Time), and then any necessary tuning to be finished the following week.