WebHarvy Web Scraper : How Tos

1. Alternate method of mining pages which load more data when you scroll down or click a 'show more data' link

The standard method of performing multi page scraping in these cases are explained at the following links :- Pages with 'Load more content' or 'Display more data' link or button Pages where more data is loaded automatically when you scroll down In both the above cases, during mining stage, WebHarvy initially tries to load all pages (by scrolling down or by clicking the 'show more data' link repeatedly) before attempting scraping. This works for most web pages, but sometimes when the requested number of pages is very high, some websites stop serving data beyond a limit resulting in incomplete mining. This can be solved as follows.

The first step is to try again after increasing the 'Script Load Wait Time' value in Miner Settings. Increasing this value gives the page enough time to load data completely after each scroll or click.

Still, if you are unable to get all data, try the method below which involves manually loading the pages.

1. Load the page in WebHarvy
2. Manually scroll down the page or click the 'show more data' link repeatedly till the required amount of data is displayed
3. Scroll up to the start of the page
4. Start configuration
5. Select required data, When WebHarvy asks whether to generate complete data preview, select YES
6. Do not follow links from the page, this must be handled separately
7. Stop configuration
8. Right click the Preview area and select 'Copy Preview' option.
9. Open any spreadsheet software (Microsoft Excel) and directly paste the data in table format. This method does not involve the mining stage.

In case you need to follow each listing link and get resulting data in Step 6 above, click on the link and select 'Capture Target URL' option to get the URL of the details page. Alternatively, URLs can also be captured from the HTML source of the link as explained here. In Step 8 you will get the entire list of URLs from which you can extract data using a single configuration as explained at https://www.webharvy.com/docs/handling-pagination.html#AddURLs.

2. How to scrape large amounts of data ?

In case you are planning to scrape an entire website or scrape data in the order of several hundred thousands of records then it is recommended that instead of attempting to mine and get the entire database in a single mining session, split the whole task to fewer manageable chunks - say of few thousand records each. Use the following methods for this.

1. You can change the starting URL of a configuration as explained at Edit Starting URL to make the configuration start mining at a different page than it was originally configured for.
2. Use the Auto Save Mined Data option (see Miner Settings) so that the mined data is not lost even if the program terminates unexpectedly due to any unknown reason. Also use the Inject Pauses during mining option in Miner Settings to avoid making continuous long time requests to the web server.
3. Since there is chance of target website blocking your IP due to large mining sessions, you may need to Scrape data anonymously via proxy servers or VPN to prevent prematurely aborted mining sessions.
4. In case you are using Category or Keyword scraping features with large number of keywords or category links you can split them via features to edit the keyword and to edit the URL list associated with the configuration.

3. How to scrape multiple images from details pages ?

Data extraction from eCommerce websites often require multiple images of products to be scraped from product details page. The following method can be used if directly clicking on the images and selecting Capture Image option does not work.

1. If you are starting configuration from the product details page (where multiple images are displayed), soon after starting configuration, disable pattern detection.
2. Click on the first thumbnail image displayed besides the main/large product image on the product details page.
3. Select More Options > Capture HTML, check and make sure that the HTML displayed contains the image URL
4. If the HTML does not contain the image URL, select More Options > Capture More Content, one or more times as required, till the image URL portion is displayed.
5. Select More Options > Apply Regular Expression, to select the image URL from the HTML displayed - check this
6. Paste the correct RegEx string (as explained here) and click Apply button, to get only the image URL from the HTML. href="([^"]*) or src="([^"]*) or image-url="([^"]*) usually works, but this depends on the HTML code.
7. Now click the Capture Image button. WebHarvy will automatically identify that there are multiple product images and will ask whether to download all of them. Select Yes.

Watch Video: WebHarvy Image Scraping Tutorial

4. How to scrape META tags from HTML source ?

Follow the steps below to scrape data from META tags in the HTML source code of the web page.

1. During configuration, click anywhere on the page.
2. In the resulting Capture window displayed, double click on the Capture HTML toolbar button.
3. You should now be able to see the complete HTML source of the page displayed in the preview area of the Capture window. If not, apply Capture More Content option multiple times.
4. Select More Options > Apply Regular Expression option
5. Paste the correct RegEx string (RegEx Tutorial) to extract the required portion of the HTML and click Apply button. For example, the RegEx strings to extract values of description and keywords META tags are copied below.
<meta name="description" content="([^"]*)

<meta name="keywords" content="([^"]*)
6. Click the main Capture HTML button to capture the selected portion of HTML

5. How to scrape repeating data (list/table) from details pages ?

Automatic pattern detection (automatically selecting repeating data) is supported only in the starting page of the configuration. So, if you need to scrape repeating data (data in list or table format) from pages reached by following a link from the starting page of configuration, follow the steps below.

This is a 2 stage process. In the first stage get all details page URLs.

1. Open WebHarvy and load the starting page
2. Start configuration
3. Click and select the next page link.
4. Click the first listing link and select Capture Target URL option.
5. Stop configuration
6. Start Mine
At the end of above step you will get a list of URLs of details pages.
7. Load the first URL in the list in WebHarvy's browser
8. Start configuration
9. First click and select the repeating data displayed in the page, i.e., data displayed in table/list format.
10. When all repeating items have been selected from the page, click and select non-repeating items like title, price etc.
11. Click the URLs button within Edit panel of Configuration menu. Paste the remaining URLs obtained in Step 6 above. Apply.
12. Stop configuration
13. Start Mine.

6. How to scrape data from a list of URLs ?

Using the Add URLs to Configuration feature you can scrape data from multiple URLs using a single configuration. This requires that all URLs belong to the same domain/website and share the same page layout. The following are the steps involved.

Only a single row of data from each URL

1. Open WebHarvy and navigate to the first URL in the list
2. Start configuration
3. Select Configuration menu > Edit > Disable pattern detection option (tick).
4. Select required data
5. Select Configuration menu > Edit > URLs
6. In the resulting window paste all the remaining URLs in the list and click Apply button
7. Stop configuration
8. Start Mine - all URLs in the list will be mined using the same configuration

Multiple rows of data from each URL, spanning multiple pages

1. Open WebHarvy and navigate to the first URL in the list
2. Start configuration
3. Select required data, Select Next Page Link, Follow links and select data if required
4. Select Configuration menu > Edit > URLs
5. In the resulting window paste all the remaining URLs in the list and click Apply button
6. Stop configuration
7. Start Mine - all URLs in the list will be mined using the same configuration

By enabling the Tag with Category/Keyword option in Miner Settings an additional column can be added to the data table filled with the URL from which the row of data is mined.

7. How to get URLs of pages from which data is extracted ?

To get URL of the currently loaded page, during configuration, click anywhere on the page and select More Options > Add Custom Data > Page URL from the resulting Capture window.
If you are following links from the starting page of the configuration, then (before following links) the Capture target URL option can be used to extract URLs of details pages.
If you are following URLs present in HTML source of the page, then while following the same method, after applying regular expression, when the URL of the page is displayed in the preview area of Capture window, you can click on Capture HTML button to capture the URL in a separate column.
If you are scraping data from a list of URLs, then by enabling the Tag with Category/Keyword option in Category/Keyword tab in Settings window, the URLs from which each row of data is scraped can be added as a separate column.
You can also use JavaScript to scrape the page URL. Follow the steps below.
1. 1. During configuration, after selecting all required data from the currently loaded page, to get the URL of the page, click anywhere on the page and select More Options > Run Script from the resulting Capture window.
2. 2. Paste and run the following JavaScript code.
  document.body.innerText = document.URL;
3. 3. Now, you should be able to see the page URL displayed in the browser area.
4. 4. Click on it and select Capture Text option from the resulting Capture window to capture it.
5. 5. Ideally, you should stop configuration here. If for some reason you need to load the previous page in the browser, click anywhere on the page and select More Options > Run Script from the resulting Capture window. Paste and run the following code to go back to the previous page.
  window.history.back();

How to scrape data of product variants?

WebHarvy currently does not support automatically scraping product variants (data related to various color, size combinations of the same product). So, you will need to manually make the required selections (color, size etc.) during configuration, using the page interaction functions in the Capture window (Select Dropdown, Click etc.) and then scrape the resulting data displayed.

How to select data when first listing on page does not have all required data?

When the first item on page does not have all the required data, please follow the steps below.

1. During configuration, select the name/title and other available details from the first listing.
2. Details displayed only in subsequent listings can be selected from their respective locations.
2. For following links, instead of clicking on the title/link of the first listing, click on the title/link of the second or third listing, which has all the required details and select the Follow this link option.
3. Wait for the page to load and then select all the required data.
4. In the preview pane, the details selected will be updated against the first product, potentially causing misalignment of data. But during mining all product data will be correctly mined.

How to run multiple mining tasks in parallel (simultaneously)?

1. Open multiple windows of WebHarvy by running the app multiple times (by clicking on the desktop/start-menu icon)
2. From each window, open a different configuration file and start mining

How scrape final redirected URL after a button/link click?

When a button or link redirects to another page upon clicking, you can follow the steps given below to scrape the final redirected URL

1. During configuration, click on the button/link and select More Options > Click
2. Wait for the final (redirected) page to load
3. Click anywhere on the page and select More Options > Add Custom Data > Page URL.
4. Preferably, the above steps should be performed as the final steps in the configuration, allowing you to stop configuration immediately afterward. If not, you will need to return to the previous page. To do this, click anywhere on the redirected page and select More Options > Page > Go Back from the resulting Capture window.