Web Scraping News Articles

WebHarvy can be used to scrape articles data from various websites. For example, news articles can be scraped from websites like Business Wire, Google News, Wall Street Journal etc. Medical and Research articles can be scraped from National Institutes of Health, Springer, Science Direct, Nature etc.

While scraping online articles, the following data items are most commonly selected for extraction:

1. Article Title
2. Author Name
3. Author Contact Details (Email, Phone etc.)
4. Published Date
5. Article content in plain text or HTML
6. Images embedded in articles

How to scrape articles using WebHarvy?

WebHarvy allows you to load any web page within its browser and select the required data using simple mouse clicks. You may download and install the free evaluation version of WebHarvy in your computer.

Given below are the steps which you need to follow in general to scrape articles using WebHarvy:

1. Open WebHarvy and load the page which displays the articles which you need to scrape.
2. Start Configuration
3. Start selecting data. Details like title, author name, date, article URL etc. can be selected by directly clicking over it.
4. If there are multiple pages of listings, click on the link to load next page and set it as the next page link.
5. Click on the title of the first article and select the Follow this link option to load the article details page.
6. Once the article page is loaded, click and select more details like article text, author contact details, images etc.
7. Stop Configuration
8. Save Configuration
9. Start Mine
10. Once mining finishes, you can save the scraped data to a file or database.

Examples

Given below are demonstrations of using WebHarvy to scrape articles data from various websites.

Scraping BusinessWire Articles

The following video shows how WebHarvy can be used to scrape news articles from businesswire.com by submitting a list of keywords. Articles corresponding to each of the submitted keywords will be scraped. For each article, details like title, content, URL, date published etc. can be scraped. The Keyword Scraping feature of WebHarvy is used to scrape data corresponding to a list of keywords.

Scraping Wall Street Journal Articles

Video displayed below shows how WebHarvy can be used to scrape articles data from WSJ.com website. Details like article title, URL, author name, published date, news content text, images etc. can be scraped from WSJ.com using WebHarvy.

Scraping Google News

Google News Articles can be scraped using WebHarvy as shown in the following video.

Scraping Medical & Scientific Articles

WebHarvy can scrape articles from medical and science research websites like National Library of Medicine, PubMed, Arxiv, Springer, Science Direct, Nature etc. To know more please follow this link.

Download & Try

If you are new to WebHarvy, we highly recommend that you refer our Getting Started Guide. If you have any questions, please reach out to our technical support team.