Simple Spider Script that Logs and Documents Pre-Defined Text Strings Found

$250-750 USD

クローズ

投稿日:

10年以上前

$250-750 USD

完了時にお支払い

PHASE 1 I have approximately 1400-1450 microsites that were built with an Enterprise Content Management system. I need a script or tool or program developed that will meet the following requirements: - Ability to copy and paste and/or upload a csv file that contained a list of URLs - I need the tool to, one at a time, spider each site to search for specific, pre-defined “text strings” found on sub-pages within each site. - As the spider finds the pre-defined text strings, it will log the URL, the text string that was found, the number of times the text string was found on a specific page if found more than once, etc. (more requirements listed below) - Ability to upload at least 10 search strings at one time that could be up to 6 words per phrase in some instances. - The tool must be able to provide a report that documents its findings in both excel and PDF format. Further, for reports in excel format, they must be able to be downloaded and saved as such. Use Case Example #1 Example Site: [login to view URL] This site has 30 pages, each with different URLs and different Page Titles (Meta Titles). The name of the company is “Company Brand” and I want to find out home many times the text string “product widget” appears on this microsite’s pages. Text Strings to Search: product widget product widget for women product widget for men The ideal spider would provide the following report: (Imagine a Table with the "| acting as Row dividers) Microsite | Page | Page Title | String Found | # Times on Pg. ----------------------------------------------------------------------------------------------------------------------------------- examplemicrosite | examplemicrosite/home | Home Page | product widget | 3 examplemicrosite | examplemicrosite/[login to view URL] | About Us | product widget | 3 examplemicrosite | examplemicrosite/[login to view URL] | About Us | product widget for women | 2 examplemicrosite | examplemicrosite/[login to view URL] | About Us | product widget for men | 1 A couple of important things to notice with this example: - On the About Us page, three of the search strings were found and as you can see above, each was documented on a separate row. PHASE 2 PDF Files Depending on scope and estimates received, this could be the second phase of this two-part effort. For this second phase, I essentially want the same capabilities as I listed above but for PDF files instead of a list of web pages. Instead of uploading a list of website pages, I would like the ability to select a folder on my computer that contains PDF files. The tools would perform the same functions with respect to the PDF files. There would be pre-defined search strings and the need to find these strings within a PDF file and accordingly log the results in a report (Excel, PDF). Please provide the following when replying or applying for this project: - A brief paragraph that proves to me you understand what I’m trying to accomplish. - Examples of past work that is at least somewhat relevant to this job - Two reasons I should work with you. - Estimated time to complete this project - Bid to work on this project Thanks to everyone who applies.

Database Administration

Web Scraping

プロジェクト ID: 5204160

プロジェクトについて

7個の提案

リモートプロジェクト

アクティブ 10年前

お金を稼ぎたいですか？

メールアドレス

Freelancerで入札する利点

予算と期間を設定してください

仕事で報酬を得る

提案をご説明ください

登録して仕事に入札するのは無料です

この仕事に7人のフリーランサーが、平均$551 USDで入札しています

@mhmhz

Hi I could provide a desktop application in C# do fulfilled the requirements. I understood you want the application to find certain patterns from Web Pages and PDF files I like to finish work in no time with accuracy. It will be great to provide a sample of a pdf - just in testing phase. I could provide demo before any acceptance. Thanks

$631 USD 5日以内

5.0

(102 レビュー)

7.7

@robinsjp

Hi, I'm not sure how to demonstrate that I understand what you're trying to accomplish, other than saying what you already said in your brief. You want a histogram of phrase occurrences within a particular site; each page on the site should be indexed (I assume by looking for valid anchors on the page, and following those links). This seems like it should be a fairly straightforward project. You just have to use cURL or LWP::UserAgent (depending on whether you want this in PHP or Perl). You request the principle URL (the one you specify in the input file), you grep it (or use a string matching function like strrpos, or regular expressions if you want to match patterns) for each phrase you want to match, and then you record the results in a table. You should also record something like the canonical address of a webpage, to make sure you don't end up scraping a million urls that point to the same page (that could easily happen with several PHP frameworks - so basically you check the canonical address and see if you scraped the page already, if so, move on). You also look for all anchor elements on the page using some kind of DOM walker. You extract the href attribute of each link, and throw these urls on to a stack. You basically then work your way down the stack (you call this function iteratively - this is why getting stuck in a loop is something to watch out for.. hence the canonical url thing). When your stack is empty, your job is complete. Request the next site.

$555 USD 2日以内

5.0

(6 レビュー)

4.1

@yonibenitez

Dear Sir, I'd like to help you to complete this task successfully. I'm an experienced Software Engineer with solid knowledge of Java development. I have over 10 years of experience in the field and I'm Oracle Java certified with 100% score. Before starting to work and if you are interested, I'd like to check one of the microsites to study the map site of links. i.e. to check if there is some AJAX stuff involved on the link creation, or any technology that makes the site exploration more complex. My proposal would consist on a Java standalone (desktop) multi-thread application. If you prefer a web application we must discuss some details first. If you check my profile you can see I've done some twitter related sniffers and some related java appls, I know that they aren't the same project as yours but if you check the reviews you'll see that I'm a trustable freelance who only accepts jobs that I can done with 100% of customer satisfaction. The budget included phase 2 also, but remember that the quality of PDF parsing depends a lot on the quality and source of the pdf. I should check a sample PDF first. If you have any doubt, don't hesitate to ask me whatever you need to know. Thank you! Greetings, Francisco Fernandez Cabrera Software Engineer

$700 USD 7日以内

5.0

(3 レビュー)

4.1

@MichaelSDeVries

Dear fergseo, I can quickly and cost effectively develop and run a Windows client based automated system(s) to quickly and accurately spider each site to search for specific, pre-defined “text strings” found on sub-pages within each site you want from the each of the websites you specify and output this information into the format(s) you specify for you. Please find/see demo videos of and sample results from some of the similar automated systems I have developed in my Portfolio. Please Hire me for your project(s) and I can quickly and cost effectively develop and run a similar automated system for you now! I look forward to hearing from you and working together with you towards the successful completion of your project soon! - Michael S. DeVries

$555 USD 30日以内

4.9

(9 レビュー)

4.0

@fabiomenza

- A brief paragraph that proves to me you understand what I’m trying to accomplish. My scritpt must be able to crawl a website starting from a given url and log how many time it founds a given string(s). It must work both crawling a url and browsing a directory for pdf files. It must have low computationally cost in order to perform scraping of long sentences. Logged info must be provided both in CSV format and in PDF format with the scheme you've posted. - Examples of past work that is at least somewhat relevant to this job Previous scraping jobs: 1) I've done scraping of e-shops in order to retriew product info such as price, dimension, availability for a cashback site. 2) Scraped my movies for Indie films for a guy on freelancer.com 3) Scraped wikipedia for basket tournaments brackets. - Two reasons I should work with you. 1) Low-price 2) I aim to do good work in order to get a good review from you and get more jobs on freelancer.com - Estimated time to complete this project Software should be ready in less than 10 days

$400 USD 10日以内