- 1. How to change WebHarvy Browser's emulated IE version ?
- 2. How to scrape pages which load more data when you scroll down or click a 'show more content' link/button ? (Method 2)
- 3. How to scrape large amounts of data ?
- 4. How to scrape multiple images from details pages ?
- 5. How to scrape META tags from HTML source ?
- 6. How to scrape repeating data / tables from details pages ?
- 7. How to scrape a list of URLs using the same configuration ?
How to change WebHarvy Browser's emulated IE version ?
WebHarvy internally use Internet Explorer (IE) to load and navigate web pages. So it is recommended that the latest version of IE is installed in your system for all modern websites to load correctly. WebHarvy emulates the latest version of IE installed in your system.
However, if you need to manually change the version of IE emulated by WebHarvy's browser please make the following registry key change.
The default value for the above is 11000 (decimal) which denotes IE 11 emulation. This value can be modified as explained at https://msdn.microsoft.com/en-us/library/ee330730(v=vs.85).aspx#browser_emulation. For example to emulate IE 10, the value should be 10000 (decimal) and for IE 9 it should be 9000.
Alternate method of mining pages which load more data when you scroll down or click a 'show more data' link
The standard method of performing multi page scraping in these cases are explained at the following links :-
Pages with 'Load more content' or 'Display more data' link or button
Pages where more data is loaded automatically when you scroll down
In both the above cases, during mining stage, WebHarvy initially tries to load all pages (by scrolling down or by clicking the 'show more data' link repeatedly) before attempting scraping. This works for most web pages, but sometimes when the requested number of pages is very high, some websites stop serving data beyond a limit resulting in incomplete mining. This can be solved as follows.
The first step is to try again after increasing the 'AJAX Load Wait Time' value in Miner Settings. Increasing this value gives the page enough time to load data completely after each scroll or click.
Still, if you are unable to get all data, try the method below which involves manually loading the pages.
- 1. Load the page in WebHarvy
- 2. Manually scroll down the page or click the 'show more data' link repeatedly till the required amount of data is displayed
- 3. Scroll up to the start of the page
- 4. Start Config
- 5. Select required data, When WebHarvy asks whether to generate complete data preview, select YES
- 6. Do not follow links from the page, this must be handled separately
- 7. Stop Config
- 8. Right click the Preview area and select ‘Copy Preview’ option.
- 9. Open any spreadsheet software (Microsoft Excel) and directly paste the data in table format. This method does not involve the mining stage.
In case you need to follow each listing link and get resulting data in Step 6 above, click on the link and select ‘Capture Target URL’ option to get the URL of the details page. Alternatively, URLs can also be captured from the HTML source of the link as explained here. In Step 8 you will get the entire list of URLs from which you can extract data using a single configuration as explained at https://www.webharvy.com/tour3.html#AddURLs.
How to scrape large amounts of data ?
In case you are planning to scrape an entire website or scrape data in the order of several hundred thousands of records then it is recommended that instead of attempting to mine and get the entire database in a single mining session, split the whole task to fewer manageable chunks - say of few thousand records each. Use the following methods for this.
1. You can change the starting URL of a configuration as explained at Edit Starting URL to make the configuration start mining at a different page than it was originally configured for.
2. Use the Auto Save Mined Data option (see Miner Settings) so that the mined data is not lost even if the program terminates unexpectedly due to any unknown reason. Also use the Inject Pauses during mining option in Miner Settings to avoid making continuous long time requests to the web server.
3. Since there is chance of target website blocking your IP due to large mining sessions, you may need to Scrape data anonymously via proxy servers or VPN to prevent prematurely aborted mining sessions.
4. In case you are using Category or Keyword scraping features with large number of keywords or category links you can split them via features to edit the keyword and to edit the URL list associated with the configuration.
How to scrape multiple images from details pages ?
Data extraction from eCommerce websites often require multiple images of products to be scraped from product details page. The following method can be used if directly clicking on the images and selecting Capture Image option does not work.
1. Click on the first thumbnail image displayed besides the main/large product image on the product details page.
2. From the resulting Capture window, select More Options > Capture More Content (one or more times as required). The idea here is to ensure that the HTML of the selected portion/content contains the image URL.
3. Select More Options > Capture HTML, check and make sure that the HTML displayed contains the image URL
4. Select More Options > Apply Regular Expression, to select the image URL from the HTML displayed - check this
5. Paste the correct RegEx string (as explained here) and click Apply button, to get only the image URL from the HTML. href="([^"]*) or src="([^"]*) or image-url="([^"]*) usually works, but this depends on the HTML code.
6. Now click the Capture Image button. WebHarvy will automatically identify that there are multiple product images and will ask whether to download all of them. Select Yes.
How to scrape META tags from HTML source ?
Follow the steps below to scrape data from META tags from the HTML source code of the web page.
1. During configuration, click any portion of the web page, preferably the main title or the space above it.
2. From the Capture window, select More Options > Capture More Content multiple times till the entire website content is selected and displayed in the preview. You will have to apply this option 8 to 10 times, depending on the website.
3. Select More Options > Capture HTML option. The HTML then displayed in the preview should contain the entire HTML source of the page including META tags.
4. Select More Options > Apply Regular Expression option
5. Paste the correct RegEx string (RegEx Tutorial) to extract the required portion of the HTML and click Apply button. For example, the RegEx strings to extract values of description and keywords META tags are copied below.
<meta name="description" content="([^"]*)
<meta name="keywords" content="([^"]*)
6. Click the main Capture HTML button to capture the selected portion of HTML
How to scrape repeating data (list/table) from details pages ?
Automatic pattern detection (automatically selecting repeating data) is supported only in the starting page of the configuration. So, if you need to scrape repeating data (data in list or table format) from pages reached by following a link from the starting page of configuration, follow the steps below.
This is a 2 stage process. In the first stage get all details page URLs.
1. Open WebHarvy and load the starting page of configuration
2. Start Config
3. Click and select the next page link.
4. Click the first listing link and select Capture Target URL option.
5. Stop Config
6. Start Mine
At the end of above step you will get a list of URLs of details pages.
7. Load the first URL in the list in WebHarvy's browser
8. Start Config
9. First click and select the repeating data displayed in the page, i.e., data displayed in table/list format.
10. When all repeating items have been selected from the page, click and select non-repeating items like title, price etc.
11. Select Edit menu > Edit Options > Add/Remove URLs from Configuration. Paste the remaining URLs obtained in Step 6 above. Apply.
12. Stop Config
13. Start Mine.
How to scrape data from a list of URLs ?
Using the Add URLs to Configuration feature you can scrape data from multiple URLs using a single configuration. This requires that all URLs belong to the same domain/website and share the same page layout. The following are the steps involved.
Only a single row of data from each URL
- 1. Open WebHarvy and navigate to the first URL in the list
- 2. Start Config
- 3. Select Edit menu > Edit Options > Disable start-page pattern detection
- 4. Select required data
- 5. Select Edit menu > Edit Options > Add/Remove URLs from Configuration
- 6. In the resulting window paste all the remaining URLs in the list and click Apply button
- 7. Stop Config
- 8. Start Mine - all URLs in the list will be mined using the same configuration
Multiple rows of data from each URL, spanning multiple pages
- 1. Open WebHarvy and navigate to the first URL in the list
- 2. Start Config
- 3. Select required data, Select Next Page Link, Follow links and select data if required
- 4. Select Edit menu > Edit Options > Add/Remove URLs from Configuration
- 5. In the resulting window paste all the remaining URLs in the list and click Apply button
- 6. Stop Config
- 7. Start Mine - all URLs in the list will be mined using the same configuration
By enabling the Tag with Category/Keyword option in Miner Settings an additional column can be added to the data table filled with the URL from which the row of data is mined.