Scraping data from websites which require login

How to scrape data from websites which require login to view data ?

WebHarvy supports scraping data from websites which require authentication (login with user name and password). While using WebHarvy to scrape data from such websites (for example www.linkedin.com) please follow the steps below.

Important Note: For all methods explained below, make sure that 'Disable cookies while mining' option in Browser Settings is not selected.

Method 1 (Recommended)

  1. 1. Open WebHarvy and load the website. Go to the login page.
  2. 2. Login with your username and password. Select the 'Remember me' or 'Keep me logged in' option if available.
  3. 3. Navigate to the page which displays the data to be extracted.
  4. 4. Copy the URL of the page from the address bar (If WebHarvy does not correctly display the current page address, you should copy it from another browser like Google Chrome)
  5. 5. Clear WebHarvy's configuration browser.
  6. 6. Directly load the URL copied in Step 4. If any further navigation is required to display the data needed to be scraped, then it must be performed after starting configuration.
  7. 7. Start Configuration and select required data. Configure pagination and follow links if required.
  8. 8. Stop Configuration. You may now optionally save the configuration.
  9. 9. Start Mine.

Make sure that you are logged in to the website from within WebHarvy's configuration browser before starting mining. While running previously saved configurations which require login, perform steps 1 and 2 above, before starting mining.

Method 2

A disadvantage of the first method is that configurations created cannot be scheduled without manual intervention. This is because login in not handled by the configuration and WebHarvy expects that you have logged in to the website via WebHarvy's browser. Follow the steps below if you would like to include the login process in the configuration, so that you need not perform additional login from WebHarvy's browser when the configuration is run. Configurations created following the method below can be scheduled.

  1. 1. Open WebHarvy and navigate to the login page of the website.
  2. 2. Start Configuration
  3. 3. Select Configuration menu > Options > Disable pattern detection
  4. 4. Using Input Text option enter user name and password in the login page
  5. 5. Use the Click option to click the login button.
  6. 6. Once you have successfully logged in to the website, if required, click on links in the page to navigate to the target page which displays the data which you need to extract. After clicking each link, select More Options > Click from the resulting Capture window to follow that link. You can also use other methods to interact-with/navigate pages as explained here.
  7. 7. Once the target data page which displays the data to be extracted is loaded, untick Configuration menu > Options > Disable pattern detection (to turn off)
  8. 8. Now you can select required data and continue configuration in normal method
  9. 9. Stop Config, Save and Start Mine.

Scraping data from websites which shows login in a popup window

Some websites display a popup window as soon as you load them where you can enter user name and password to authenticate and proceed to view the page. In such cases follow the method below.

  1. 1. Open WebHarvy and load the required page in WebHarvy by providing its URL in the address bar in the following format. Here, the username and password are provided in the URL itself.
  2. http://username:password@webdomain.com/path1/path2/page.php

  3. 2. Start Config
  4. 3. Select Edit menu > Edit Options > Edit Start URL/PostData
  5. 4. Paste the same URL entered in Step 1 in the Start URL box. Apply changes
  6. 5. Now you can proceed by creating the configuration

Here, the functionality provided by the browser to provide the login username and password as part of the URL is used.

Scraping data from pages which require CAPTCHA

WebHarvy currently does not support solving CAPTCHAs by itself, you will have to manually load the page (which shows CAPTCHA form) in WebHarvy's browser and solve the CAPTCHA manually. Once solved, CAPTCHA form will not be displayed again for the current session by most websites.

  1. 1. Open WebHarvy and navigate to the website.
  2. 2. Load the page which shows the CAPTCHA form and enter/solve the CAPTCHA.
  3. 3. Configure WebHarvy to scrape data (or open a previously saved configuration file).
  4. 4. Start Mine.

Need Help?

Please do not hesitate to contact our support team at support@webharvy.com with the necessary details in case you need any assistance.