Watch Video Tutorial: How to use Regular Expressions in WebHarvy?
- 1. Introduction
- 2. How to select string following another string ?
- 3. How to select string following another string till a specific character ?
- 4. How to select string between 2 other strings ?
- 5. How to select URLs/Email addresses from HTML ?
- 6. Commonly used regular expressions
For a very detailed Regular Expression Tutorial we highly recommend https://www.regular-expressions.info.
Introduction
WebHarvy allows you to apply Regular Expressions on the selected text (or HTML) before scraping it. You may apply Regular Expressions on Text or HTML.
Regular expressions can be applied by clicking the 'More Options' button and then selecting the 'Apply Regular Expression' option as shown below.
You may then specify the RegEx string. WebHarvy will extract only those portion(s) of the main text which matches the group(s) specified in the RegEx string.
Click Apply. The resulting text after applying the Regular Expression will be displayed in the Capture window text box. Click the main 'Capture Text' button to capture it. The result after matching the RegEx string will be extracted as shown below.
Watch video : Selecting required portions of text using Regular Expressions (RegEx)How to select string following another string ?
Suppose you need to extract the price in dollars from the text below.
Product Details | |
Price: 99$ | |
This product comes with absolutely no warranty . . |
The RegEx string to be used is :
Price: (.*)
Here the text following the heading Price: till the end of line is selected. The extracted portion is the portion matched within the parenthesis (.*). Dot (.) denotes any character and * denotes repetition.
How to select string following another string till a specific character ?
In the same example above if you need to extract the price excluding the dollar sign, the RegEx string to be used is :
Price: ([\d]*)
Here the captured portion is the string which follows the heading 'Price:' which contains only digits \d which are repeated [\d]*. An alternative RegEx for the same purpose is :
Price: ([^$]*)
Here the captured portions it the set of repeating characters which follows the heading Price: such that it is not a dollar $ (escaped with a \ since $ is a special character in RegEx).
Read more about character classes
How to select string between 2 other strings ?
Suppose you need to extract the string embedded between <address> and </address> below.
<address> | |
356, Street Name, City, Country | |
</address> |
The RegEx string to be used is :
<address>([\s\S]*?)</address>
The portion ([\s\S]*?) matches all characters between <address> and </address>.
How to select URLs/Email addresses from HTML ?
You can use the 'Capture HTML' option to get the HTML of the selected content in Capture window.
To extract the URL/website address from the following HTML.
<div class="call-to-action "> | |
<a title="Website (opens in a new window)" | |
class="contact contact-main contact-url " href="http://www.canberraeyelaser.com.au" target="_blank" rel="nofollow"> | |
<span class="glyph icon-website border border-dark-blue with-text"></span><span class="contact-text">Website</span> | |
</a> | |
</div> |
Use the following RegEx string :
href="([^"]*)
href=" denotes the heading text before the URL and ([^"]*) matches all characters till " in the HTML code.
To extract the email address from the following HTML.
<div class="call-to-action "> | |
<a title="Email" class="contact contact-main contact-email " | |
href="mailto:info@canberraeyelaser.com.au?subject=Enquiry%2C%20sent%20from%20yellowpages.com.au& | |
body=%0A%0A%0A%0A%0A------------------------------------------%0AEnquiry%20via%20yellowpages.com.au%0Ahttp%3A%2F%2Fyellowpages.com.au%2Fact%2Fphillip%2Fcanberra-eye-laser-15333167-listing.html%3Fcontext%3DbusinessTypeSearch" | |
rel="nofollow" data-email="info@canberraeyelaser.com.au"> | |
<span class="glyph icon-email border border-dark-blue with-text"></span><span class="contact-text">Email</span> | |
</a> | |
</div> |
Use the following RegEx string :
mailto:([^?]*)
mailto: denotes the heading text before the email address and ([^?]*) matches all characters till ? .
The following RegEx string can also be used to extract email address (second occurrence in HTML) :
data-email="([^"]*)
mailto: denotes the heading text before the email address and ([^?]*) matches all characters till ? .
Commonly used RegEx strings and techniques in WebHarvy
(.*) | |
Selects only first line from a block of text or HTML | |
[\s]*(.*) | |
Selects first line, ignoring the starting white-spaces, (spaces, line feeds and carriage returns). | |
[\s]* matches all white-spaces till the first view-able character. | |
href=”([^”]*) | |
Gets the href link/URL from HTML. [^”]* matches till the next " character. | |
src=”([^”]*) | |
Gets src link/URL from HTML | |
Also can be modified according to requirement as shown below. | |
zoom-image=”([^”]*) | |
data-large-image=”([^”]*) | |
mailto:([^”]*) | |
Gets email address from HTML | |
Alloy Wheels([\s\S]*?)<div class="icon"> | |
Gets the string between 'Alloy Wheels' and <div class="icon">. This can be modified to match | |
any string which is guaranteed to appear between 2 other strings in HTML or in TEXT. | |
[\s\S]* matches everything (white-space and non white-space - includes all characters) | |
Starting Text([\s\S]*?)Ending Text | |
General format of the above case. Just place ([\s\S]*?) between the starting and ending portion | |
and the in-between text or HTML is matched and selected. | |
itemprop="name">([^<]*)<div class="line"> | |
Gets HTML code between itemprop="name"> and <div class="line">. [^<]* matches all characters till <. | |
itemprop="name">([\s\S]*?)<div class="line"> | |
Same as above. | |
(?=[^M]*MAP)[^M]*MAP: \$(.*)|List Price: \$(.*) | |
Conditional regular expression. Captures MAP price if available, else capture List Price. | |
RegEx special characters like $, ., ^ etc. should be escaped by \ (example: \$, \. etc). | |
<img src="([^"]*) | |
First image URL in HTML | |
<img src=[\s\S]*?<img src="([^"]*) | |
Second image URL in HTML. src value of second img tag in HTML. | |
(In Stock) | |
Matches and gives value 'In Stock', only if the selected HTML or TEXT has the text 'In Stock'. | |
This can be used to check if the selected HTML or TEXT contains a specific string. | |
merch_name[^>]*>([^<]*) | |
Matches the string which comes between 2 HTML tags where the starting tag contains the text 'merch_name'. | |
[^>]*> matches till the next > | |
[^<]* matches till the next < | |