WebHarvy Regular Expression Tutorial

Regular Expressions (RegEx) can be used to accurately select required data from a block of text or HTML

Watch Video Tutorial: How to use Regular Expressions in WebHarvy?

1. Introduction
2. How to select string following another string ?
3. How to select string following another string till a specific character ?
4. How to select string between 2 other strings ?
5. How to select URLs/Email addresses from HTML ?
6. Commonly used regular expressions

For a very detailed Regular Expression Tutorial we highly recommend https://www.regular-expressions.info.

Introduction

WebHarvy allows you to apply Regular Expressions on the selected text (or HTML) before scraping it. You may apply Regular Expressions on Text or HTML.

Regular expressions can be applied by clicking the 'More Options' button and then selecting the 'Apply Regular Expression' option as shown below.

You may then specify the RegEx string. WebHarvy will extract only those portion(s) of the main text which matches the group(s) specified in the RegEx string.

Click Apply. The resulting text after applying the Regular Expression will be displayed in the Capture window text box. Click the main 'Capture Text' button to capture it. The result after matching the RegEx string will be extracted as shown below.

Watch video : Selecting required portions of text using Regular Expressions (RegEx)

How to select string following another string ?

Suppose you need to extract the price in dollars from the text below.

	Product Details
	Price: 99$
	This product comes with absolutely no warranty . .

view raw Sample Product Details Text hosted with ❤ by GitHub

The RegEx string to be used is :

Price: (.*)

Here the text following the heading Price: till the end of line is selected. The extracted portion is the portion matched within the parenthesis (.*). Dot (.) denotes any character and * denotes repetition.

How to select string following another string till a specific character ?

In the same example above if you need to extract the price excluding the dollar sign, the RegEx string to be used is :

Price: ([\d]*)

Here the captured portion is the string which follows the heading 'Price:' which contains only digits \d which are repeated [\d]*. An alternative RegEx for the same purpose is :

Price: ([^$]*)

Here the captured portions it the set of repeating characters which follows the heading Price: such that it is not a dollar $ (escaped with a \ since $ is a special character in RegEx).

How to select string between 2 other strings ?

Suppose you need to extract the string embedded between <address> and </address> below.

	<address>
	356, Street Name, City, Country
	</address>

view raw Sample HTML hosted with ❤ by GitHub

The RegEx string to be used is :

The portion ([\s\S]*?) matches all characters between <address> and </address>.

How to select URLs/Email addresses from HTML ?

You can use the 'Capture HTML' option to get the HTML of the selected content in Capture window.

To extract the URL/website address from the following HTML.

	<div class="call-to-action ">
	<a title="Website (opens in a new window)"
	class="contact contact-main contact-url " href="http://www.canberraeyelaser.com.au" target="_blank" rel="nofollow">
	<span class="glyph icon-website border border-dark-blue with-text"></span><span class="contact-text">Website</span>
	</a>
	</div>

view raw Sample HTML with URL hosted with ❤ by GitHub

Use the following RegEx string :

href="([^"]*)

href=" denotes the heading text before the URL and ([^"]*) matches all characters till " in the HTML code.

To extract the email address from the following HTML.

	<div class="call-to-action ">
	<a title="Email" class="contact contact-main contact-email "
	href="mailto:info@canberraeyelaser.com.au?subject=Enquiry%2C%20sent%20from%20yellowpages.com.au&
	body=%0A%0A%0A%0A%0A------------------------------------------%0AEnquiry%20via%20yellowpages.com.au%0Ahttp%3A%2F%2Fyellowpages.com.au%2Fact%2Fphillip%2Fcanberra-eye-laser-15333167-listing.html%3Fcontext%3DbusinessTypeSearch"
	rel="nofollow" data-email="info@canberraeyelaser.com.au">
	<span class="glyph icon-email border border-dark-blue with-text"></span><span class="contact-text">Email</span>
	</a>
	</div>

view raw Sample HTML with Email hosted with ❤ by GitHub

Use the following RegEx string :

mailto:([^?]*)

mailto: denotes the heading text before the email address and ([^?]*) matches all characters till ? .

The following RegEx string can also be used to extract email address (second occurrence in HTML) :

data-email="([^"]*)