Java expert Improve webpage scrapping solution

$250-750 USD

クローズ

投稿日:

4年近く前

$250-750 USD

完了時にお支払い

I developed a Java program to scrap information from a website. The architecture of the solution involves: 1) using Java Selenium to send requests to the webpage via Chrome Webdriver to trigger authentication and authenticated requests; 2) routing the requests from Chrome (headless) to Java BrowserMobProxy to capture three HTTP headers (Authorization, X-CSRF-TOKEN, and Cookie) and one query string (without these, the server after some requests starts responding 512); and 3) use these 4 elements in HTTPs requests from Java directly to the webpage (i.e. without Selenium, Chrome, and BrowserMobProxy involved) to retrieve the desired information. This program does the basic functionality of extracting the information but has a few problems: It depends on an external non-Java component: Chrome WebDriver It depends on Java Selenium and Java BrowserMobProxy, two dependencies that I would like to remove It is not optimized (too much refresh and too long sleep periods) relatively to the limit upon which the Webpage (Cloudfare) starts responding 429 errors. Thus, the retrieval of the information is taking much more time than needed. Deliverables You will get the current program Java code and you will need to solve the problems above. To do so, you will need to: A. Find out how to authenticate and refresh the 3 headers and the query string without depending on Selenium, Chrome Webdriver, and BrowserMobProxy. As most of this data is likely generated in JavaScript, you will need knowledge about JavaScript and how to execute JavaScript from within Java or convert the JavaScript code to Java (preferable solution). B. You will need to identify the limit upon which the Webpage (behind Cloudfare) starts responding 429 errors. You will need to tune the refresh frequency of the headers and sleep periods to the limit identified. You will need to demonstrate the benefits of your changes by extracting the information currently extracted by the program and measuring how long it takes. Note: you will need to create your own login/password in the webpage. No additional requirements exist to register.

Java

JavaScript

Software Architecture

Web Scraping

PHP

プロジェクト ID: 26818705

プロジェクトについて

8個の提案

リモートプロジェクト

アクティブ 4年前

お金を稼ぎたいですか？

メールアドレス

Freelancerで入札する利点

予算と期間を設定してください

仕事で報酬を得る

提案をご説明ください

登録して仕事に入札するのは無料です

この仕事に8人のフリーランサーが、平均$476 USDで入札しています

@serhiilyskin

Hi, sir. This is Serhii from Ukraine. I've been working as a programmer for over 10 years, and I have very much experience in web & desktop app development. I have checked your requirements in detail, and I think I can finish it in a short time. Please contact me and let me ask a few questions about the project. Regards.

$555 USD 6日以内

5.0

(14 レビュー)

6.8

@eightsl

Hello, I'd like to take a look at this. Can you send me the website in question and the information you want to extract? I'll try to mimic the behavior of Selenium by sending the appropriate headers, then parse out the contents with Jsoup.

$400 USD 7日以内