This project is for a script to scrape data from a public website.
DO NOT BID UNLESS YOU HAVE DONE THESE TYPES OF PROJECTS BEFORE!!!
The script:
1. must work on Redhat Linux via command line, but otherwise can be written in the language of your choice. You must provide any package/installation requirements to run the script successfully
2. must
a) crawl and copy the visited pages from the site first
b) then parse & harvest html for required data (I will provide the required data)
c) output data into a comma separated file
3. must use multi-threading to be able to download/crawl the pages in parallel with a configurable multi-threads attribute
Crawler should be able to mask its identity to prevent blocking.
Required scraped data must be extracted from either of the two websites:
[login to view URL]
[login to view URL]
The following data needs to be scraped from either of the above websites in an efficient way:
- Job Category (this data becomes visible, once you click "Browse all titles" link
- Location
- Title
- Base Pay: 25th percentile, Median, 75th Percentile
- Job description
- Bonuses