Trouble Shooting Guide - DIY
- 1. Page not loading correctly in WebHarvy's browser
- 2. Mining process is slow
- 3. No data during mining
- 4. Mining stops before completion
- 5. Next page not loading / Data extracted only from the first page
- 6. Unable to select data during configuration / Data within frame
- 7. While scraping table, only the first column data is selected
- 8. Mining does not stop. Last page is repeatedly extracted.
- 9. WebHarvy crashes/terminates unexpectedly
- 10. Unlocking the software fails with message "Activation failed due to unknown reason"
Page not loading correctly in WebHarvy's browser
WebHarvy internally use Internet Explorer (IE) to load and navigate web pages. It is recommended that the latest possible version of IE is installed in your system for all modern websites to load correctly. To check this load the web page directly in IE to see if it is displayed correctly, then try in WebHarvy.
Mining process is slow
Try the following :-
1. Make sure that the latest version of IE is installed in your system. WebHarvy is dependent on IE.
2. Turn off all IE add-ons : see http://windows.microsoft.com/en-us/internet-explorer/manage-add-ons.
3. Decrease the 'Page Load Timeout' and 'AJAX Load Wait Time' values in Miner Settings.
4. Turn off loading images in Internet Explorer (IE) settings and pages will load faster during mining. IE > Internet Options > Advanced > Multimedia > Show pictures > Untick.
No data during mining
During configuration, preview of captured data was generated and displayed correctly, but during mining (Start Mine) no data was extracted.
Please try the following to solve this problem :-
1. The first thing you can try is to save the current configuration, restart WebHarvy, open the saved configuration file and start mining again.
2. If the website from which you are trying to scrape data require login (with a user name and password), to view the data which you need to extract, then please make sure that you follow the steps given at Scraping websites which require login.
3. Make sure that the latest version of IE is installed in your system. WebHarvy is dependent on IE. Also, turn off all IE add-ons : see http://windows.microsoft.com/en-us/internet-explorer/manage-add-ons
4. Delete your Internet Explorer's (IE) entire browsing history and try mining (Start Mine) again. WebHarvy > Edit menu > Internet Options > Browsing History > Delete. Delete everything including cookies and then try mining.
5. Increase the 'AJAX Load Wait Time' in Miner Settings. Websites which employ AJAX to load and display elements may require some additional time after page load to display all data. The default value of this setting is 5 seconds. Increase it to 10, 15 or 20 seconds and see if WebHarvy is able to get results during mining.
6. Edit the configuration (Edit menu > Edit Configuration) and see if the configuration start page as well as the preview data is loaded and displayed correctly. If the problem is related to loading the start page then it can be identified here. You can fix problems related to starting page URL and Post data by selecting Edit menu > Edit Options > Edit StartURL / PostData.
7. You may also Try Scraping via Proxy Servers.
Mining stops before completion
Mining stops before the requested/total number of pages are scraped
This usually happens when WebHarvy is unable to load the next page of data by clicking the next page link selected during configuration. Please try the following.
1. Make sure that the latest possible version of IE is installed in your system. WebHarvy is dependent on IE.
2. Turn off all IE add-ons : see http://windows.microsoft.com/en-us/internet-explorer/manage-add-ons. This will improve mining efficiency/speed.
3. Increase 'Page Load Timeout' and 'AJAX Load Wait Time' values in Miner Settings so that the page gets enough time to load all data before it is scraped. Increasing these values will slow down mining but will minimize page load time outs.
4. When mining aborts before completion, you can click the Start button again (without closing the Miner window) and WebHarvy will try to resume mining from where it stopped.
5. You can also directly change the starting URL of the configuration so that mining starts at a different page (where it stopped) than it was originally configured for.
6. Also, websites can potentially block you if you access their pages via software for long time/data for data extraction. The solution here is to scrape via proxy servers or VPN so that you can remain anonymous and avoid getting blocked by websites. Try using proxy servers with WebHarvy.
7. In case you are trying to scrape a relatively large number of records please refer 'How to scrape large amounts of data ?'
Next page not loading / Data extracted only from first page
During mining data is extracted only from the first page. Mining stops after first page extraction, or more pages are loaded but no more data is extracted.
Try the following :-
1. Sometimes the first page of listings has a slightly different layout than the rest of the pages (Ex: Many Amazon product listings). In such cases the first and rest of pages should be scraped separately. Load the second page and then start configuration. See if data from subsequent pages (3,4 etc.) can be extracted during mining stage.
2. Make sure that the first item which you click after starting configuration does not belong to an advertised or sponsored listing. For example, with Yellow Pages Listings the first few ones may be sponsored/advertised listings and will only be present in the first page. Selecting data from these will prevent WebHarvy from getting data from subsequent pages.
3. Try setting the next page link via both the methods explained (see the images) at Selecting pagination links. The next link can be set either by clicking the next link/arrow or by clicking the direct link to load page number 2. Try both these methods.
4. You may also try increasing the 'AJAX Load Wait Time' value in Miner Settings.
5. In case the direct links (URLs) to each page of listings has the page number embedded in it you can try the URL based pagination method.
Data within Frame / Unable to select data during configuration
After starting configuration, Capture window is not displayed when a item (text/image) is clicked.
This usually happens when the data to be selected is inside a frame. To select data you will have to find the frame URL and load it directly in WebHarvy. If you have Chrome browser installed then the frame URL can be found as follows.
Load the page in Chrome browser. Right-clicking on the data item which you need to extract should show you the option "View frame source". By clicking on it, it will open the source code in a new tab. Its URL is on the address bar. Remove the "view-source:" prefix from the address bar string to get the frame URL.
Load the frame URL directly in WebHarvy and Start Config. You should be able to select data.
Scraping data from Table / Grid layout
While scraping items displayed in table/grid layout, only the first column items are selected/extracted.
For example, product listings are often displayed in a grid layout (table - row/column format). While configuring WebHarvy to extract data from such pages when the first product's title (or any other detail) is selected, only products from the first column are automatically identified. Products from the remaining columns are missed.
To solve this please follow the steps below :-
1. Open WebHarvy Settings : Edit menu > Settings
2. Click on the 'Advanced Miner Options' button
3. In the resulting window adjust the value of the first list option - 'Minimum number of items required in a list'. Select a value which is equal to or less than the number of columns in the table/grid. For example if products are displayed in 4 columns, this value should be set to '3'.
4. Apply Changes.
5. Now start configuration and select first product detail, details of all remaining products (from all columns) should be automatically selected.
Please make sure that you reset the change done in Step 3 above before mining other websites since this is a global miner setting.
Mining does not stop. Last page data is repeatedly extracted.
This usually happens when WebHarvy is unable to detect the end of pagination. This can be avoided by enabling the Automatically remove duplicate records while mining option in Miner Settings. When this option is enabled, mining is automatically stopped when a page full of duplicate entries is encountered.
This can also be prevented by configuring the next page link by clicking on the direct link to load page number 2 (if present), instead of clicking on the 'next' link. This is as shown in the second image displayed at How to select pagination links ?
WebHarvy crashes / terminates unexpectedly
Please try the following to solve this issue :-
1. Uninstall WebHarvy and install it again to a different location on your PC. Make sure that you do not install to the same location as where it was previously installed.
2. Make sure that the latest possible version of IE is installed in your system. WebHarvy is dependent on IE.
3. Turn off all IE add-ons : see http://windows.microsoft.com/en-us/internet-explorer/manage-add-ons.
4. Install WebHarvy from a user account with administrative privileges. In case you face the issue again you may try running WebHarvy in compatibility mode. Right click WebHarvy desktop icon, select Properties, click on the Compatibility tab, select 'Run this program in compatibility mode for' checkbox and select a previous version of windows from the following list box. You can also tick the 'Run this program as administrator' box.
5. In case the error message displayed is related to Adobe Flash Player and contains the text 'Security sandbox violation' it is recommended that you either update the Adobe Flash Player installed to the latest version or uninstall/disable it. In the latest versions of IE, flash player can be enabled/disabled/uninstalled from the Add-ons window - see http://windows.microsoft.com/en-us/internet-explorer/manage-add-ons. Look for 'Shockwave Flash Object'. In case you are running Windows 10 run 'iexplore' from Run window (Win + R), or type in 'Internet Explorer' in start menu. The default browser in Windows 10 is Edge and not Internet Explorer.
When trying to unlock the trial version of WebHarvy using the license key file the error message "Activation failed due to unknown reason" is displayed
WebHarvy registration involves online activation, so internet connection is required. Make sure that WebHarvy is not blocked by your firewall. The in-built browser of WebHarvy, since it is an IE component, will work without any special permissions, but online activation will not in case your firewall is blocking WebHarvy.