Product Tour

Loading Web Pages & Starting Configuration

Selecting Data / Page Interaction

Following a link

Capturing data from multiple pages

Saving Configuration

Editing Configuration

Scraping Data

Export captured data

Category Scraping

Keyword based Scraping

Scrape via Proxy Server


Scheduler & Command line options

How to register ?

Editing Configuration

    1. 1. Editing Configuration
    2. 2. Add / Delete data
    3. 3. Add / Remove URLs from Configuration
    4. 4. Edit Keywords
    5. 5. Edit Start URL, PostData, Headers
    6. 6. Manually editing the configuration XML file
    7. 7. Disable auto pattern detection in start page

    How to edit configuration ?

  • To edit an already saved configuration, open the configuration XML file by clicking the Open button in Home menu.

    Web Scraper
  • WebHarvy will then ask you whether to start mining using the configuration or edit it. Click the Edit configuration button.

    Web Scraper
  • You may also click the Edit button in Home menu to start editing a loaded configuration.

    Web Scraper
  • When the Edit button is clicked, WebHarvy will start loading the configuration. The starting page of the configuration will be loaded and displayed in the browser window. The preview of data selected for scraping will also be displayed. After this, WebHarvy automatically switches to configuration mode and you can start selecting more data to be scraped or delete existing data selections. You may also edit URLs and keywords associated with the configuration.

  • Add / Delete data

  • To select new data just click on it. To delete already selected data, right click in the 'Captured Data Preview' pane and select the data to be removed from the 'Delete' menu as shown below.

    Web Scraper
  • Once you have finished editing the configuration, click the Stop button within Configuration panel of Home menu. You may now save the configuration by clicking the Save button or run the configuration by clicking the Start-Mine button.

  • Add / Remove URLs from Configuration

    During configuration (or while editing configuration) you may click the URLs button within Edit panel of Configuration menu to add or remove additional URLs associated with the configuration.

    Web Scraper

    In the resulting window, you may add or delete URLs in the configuration as shown below. All URLs added will be scraped using the same configuration.

    Web Scraper

    If you have a list of URLs (all belonging to the same domain, which shares the same page layout) you may make use of this feature to scrape all URLs using a single configuration by following the steps given below.

    • 1. Open WebHarvy and navigate to the first URL in the list
    • 2. Start configuration
    • 3. Select required data
    • 4. From Configuration menu, click URLs button within Edit panel.
    • 5. In the resulting window paste all the remaining URLs in the list and click 'Apply'
    • 6. Stop configuration
    • 7. Start Mine - all URLs in the list will be scraped using the same configuration

    Edit keywords

    To edit keywords in the configuration, while configuring (or while editing the configuration), click the Keywords button within Edit panel of Configuration menu as shown below.

    Web Scraper

    In the resulting window you may add/remove keywords associated with the configuration.

    Web Scraper

    Edit Start URL and Post Data

    To edit (change) the Start URL, Post Data and Headers of a Configuration, during configuration click the Start URL / PostData button within Edit panel of Configuration menu, as shown below.

    Web Scraper

    In the resulting window you may change the values of Start URL, PostData and Headers

    Web Scraper

    Disable auto pattern detection in start page

    WebHarvy automatically finds and extracts repeating patterns of data occurring in the starting page of configuration. This helps you select and scrape similar data from all records in the start page via a single click. But sometimes, this feature needs to be turned off, when the starting page data is not a table or list, where there will be only a single entry for each data column per page.

    For example, if you start configuration after loading the product details page of a product listed at Amazon, it is recommended to turn this option ON, since each selected data (like price, rating, ASIN etc.) occurs only once per page (per product).

    As shown below, you can select the Disable pattern detection option from within Options panel of Configuration menu.

    Web Scraper

    You need to turn this option ON only when the starting page of configuration is not a list or table. Pattern recognition is disabled by default for pages loaded by navigating links from the start page.