WebHarvy Regular Expression Quick Start Tutorial

Regular Expressions (RegEx) can be used to accurately select required data from a block of text or HTML

Watch Video Tutorial: How to use Regular Expressions in WebHarvy?


  1. 1. Introduction
  2. 2. How to select string following another string ?
  3. 3. How to select string following another string till a specific character ?
  4. 4. How to select string between 2 other strings ?
  5. 5. How to select URLs/Email addresses from HTML ?
  6. 6. Commonly used regular expressions

For a very detailed Regular Expression Tutorial we highly recommend http://www.regular-expressions.info.

Introduction

WebHarvy allows you to apply Regular Expressions on the selected text (or HTML) before scraping it. You may apply Regular Expressions on Text or HTML.

Regular expressions can be applied by clicking the 'More Options' button and then selecting the 'Apply Regular Expression' option as shown below.

Scrape using RegEx

You may then specify the RegEx string. WebHarvy will extract only those portion(s) of the main text which matches the group(s) specified in the RegEx string.

Scrape using RegEx

Click Apply. The resulting text after applying the Regular Expression will be displayed in the Capture window text box. Click the main 'Capture Text' button to capture it. The result after matching the RegEx string will be extracted as shown below.

Scrape using RegEx Watch video : Selecting required portions of text using Regular Expressions (RegEx)

How to select string following another string ?

Suppose you need to extract the price in dollars from the text below.

The RegEx string to be used is :

Price: (.*)

Here the text following the heading Price: till the end of line is selected. The extracted portion is the portion matched within the parenthesis (.*). Dot (.) denotes any character and * denotes repetition.

How to select string following another string till a specific character ?

In the same example above if you need to extract the price excluding the dollar sign, the RegEx string to be used is :

Price: ([\d]*)

Here the captured portion is the string which follows the heading 'Price:' which contains only digits \d which are repeated [\d]*. An alternative RegEx for the same purpose is :

Price: ([^$]*)

Here the captured portions it the set of repeating characters which follows the heading Price: such that it is not a dollar $ (escaped with a \ since $ is a special character in RegEx).

Read more about character classes

How to select string between 2 other strings ?

Suppose you need to extract the string embedded between <address> and </address> below.

The RegEx string to be used is :

<address>([\s\S]*?)</address>

The portion ([\s\S]*?) matches all characters between <address> and </address>.

How to select URLs/Email addresses from HTML ?

You can use the 'Capture HTML' option to get the HTML of the selected content in Capture window.

To extract the URL/website address from the following HTML.

Use the following RegEx string :

href="([^"]*)

href=" denotes the heading text before the URL and ([^"]*) matches all characters till " in the HTML code.

To extract the email address from the following HTML.

Use the following RegEx string :

mailto:([^?]*)

mailto: denotes the heading text before the email address and ([^?]*) matches all characters till ? .

The following RegEx string can also be used to extract email address (second occurrence in HTML) :

data-email="([^"]*)

mailto: denotes the heading text before the email address and ([^?]*) matches all characters till ? .

Commonly used RegEx strings and techniques in WebHarvy