Hi,
I'm not sure how to demonstrate that I understand what you're trying to accomplish, other than saying what you already said in your brief. You want a histogram of phrase occurrences within a particular site; each page on the site should be indexed (I assume by looking for valid anchors on the page, and following those links).
This seems like it should be a fairly straightforward project. You just have to use cURL or LWP::UserAgent (depending on whether you want this in PHP or Perl). You request the principle URL (the one you specify in the input file), you grep it (or use a string matching function like strrpos, or regular expressions if you want to match patterns) for each phrase you want to match, and then you record the results in a table. You should also record something like the canonical address of a webpage, to make sure you don't end up scraping a million urls that point to the same page (that could easily happen with several PHP frameworks - so basically you check the canonical address and see if you scraped the page already, if so, move on).
You also look for all anchor elements on the page using some kind of DOM walker. You extract the href attribute of each link, and throw these urls on to a stack. You basically then work your way down the stack (you call this function iteratively - this is why getting stuck in a loop is something to watch out for.. hence the canonical url thing). When your stack is empty, your job is complete. Request the next site.