support@webharvy.com | sales@webharvy.com | YouTube Channel | KB Articles

Articles Home

Product Help

YouTube Channel

WebHarvy Blog


Scraping data from websites which require login


WebHarvy supports scraping data from websites which require authentication (login with user name and password). While using WebHarvy to scrape data from such websites (for example www.linkedin.com) please follow the steps below.


Method 1 (Recommended)


  1. 1. Open WebHarvy and navigate to the website.

  2. 2. Log in with user name and password (if not shown as logged in).

  3. 3. Configure WebHarvy to scrape data (or open a previously saved configuration file).

  4. 4. Start Mine.

The point here is to make sure that you are logged in to the website from within WebHarvy's browser before starting mining.


Method 2


A disadvantage of the first method is that configurations created cannot be scheduled without manual intervention. This is because login in not handled by the configuration and WebHarvy expects that you have logged in to the website via WebHarvy's browser. Follow the steps below if you would like to include the login process in the configuration, so that you need not perform additional login from WebHarvy's browser when the configuration is run. Configurations created following the method below can be scheduled.

  1. 1. Open WebHarvy and navigate to the login page of the website.

  2. 2. Start Configuration

  3. 3. Select Configuration menu > Options > Disable pattern detection

  4. 4. Using Input Text option enter user name and password in the login page

  5. 5. Use the Click option to click the login button.

  6. 6. Once you have successfully logged in to the website, if required, click on links in the page to navigate to the target page which displays the data which you need to extract. After clicking each link, select More Options > Click from the resulting Capture window to follow that link. You can also use other methods to interact-with/navigate pages as explained here.

  7. 7. Once the target data page which displays the data to be extracted is loaded, untick Configuration menu > Options > Disable pattern detection (to turn off)

  8. 8. Now you can select required data and continue configuration in normal method

  9. 9. Stop Config, Save and Start Mine.


Scraping data from websites which shows login in a popup window


Some websites display a popup window as soon as you load them where you can enter user name and password to authenticate and proceed to view the page. In such cases follow the method below.

  1. 1. Open WebHarvy and load the required page in WebHarvy by providing its URL in the address bar in the following format. Here, the username and password are provided in the URL itself.

  2. http://username:password@webdomain.com/path1/path2/page.php

  3. 2. Start Config

  4. 3. Select Edit menu > Edit Options > Edit Start URL/PostData

  5. 4. Paste the same URL entered in Step 1 in the Start URL box. Apply changes

  6. 5. Now you can proceed by creating the configuration

Here, the functionality provided by the browser to provide the login username and password as part of the URL is used.


Scraping data from pages which require CAPTCHA


WebHarvy currently does not support solving CAPTCHAs by itself, you will have to manually load the page (which shows CAPTCHA form) in WebHarvy's browser and solve the CAPTCHA manually. Once solved, CAPTCHA form will not be displayed again for the current session by most websites.

  1. 1. Open WebHarvy and navigate to the website.

  2. 2. Load the page which shows the CAPTCHA form and enter/solve the CAPTCHA.

  3. 3. Configure WebHarvy to scrape data (or open a previously saved configuration file).

  4. 4. Start Mine.

Please do not hesitate to contact our support team at support@webharvy.com with the necessary details in case you need any assistance.