XML Configuration File Format

WebHarvy configuration files are saved in XML format. Following is the description of the WebHarvy Configuration XML file format. Advanced users can directly tweak the XML config file created using WebHarvy with the help of this description.

WebHarvy allows you to change most of the details in the configuration directly from the UI. See Editing Configuration for more details. This lets you easily change the configuration parameters without manually editing the XML file.

This document provides only a high level description of the configuration file format and is not complete. Please contact our Support in case you need any further information.

Header

The header portion of the configuration file is as follows. This portion is the same for all configuration files.

	<?xml version="1.0"?>
	<MineParams xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
	<!-- Configuration Details -->
	</MineParams>

view raw WebHarvy XML Header hosted with ❤ by GitHub

Version and Registration Info

The version and registration details in the configuration file are optional and are written by WebHarvy versions 6.3.0.189 and above.

	<VersionInfo>6.3.0.189</VersionInfo>
	<RegInfo>SysNucleus</RegInfo>

view raw WebHarvy XML Version Info hosted with ❤ by GitHub

Miner Options

From version 6.3.0.189, Advanced Miner Options are saved in the configuration file. If this part is not present, the default values of these options from Settings are used.

	<MinerOptions>
	<MinLevelsUp>2</MinLevelsUp>
	<MinChildCount>10</MinChildCount>
	<SelAccuracy>2</SelAccuracy>
	</MinerOptions>

view raw WebHarvy XML Miner Options hosted with ❤ by GitHub

Selection Accuracy Values : Strict (-1), Low (0), Medium (1), High (2), Highest (3)

URL details

The following StartURL tag describes the URL of the page from which data scraping starts. The url tag inside StartURL contains the URL of the web page from which you intend to scrape data. The StartURL tag can optionally contain headers and postdata tags if required.

	<StartURL>
	<url>http://www.yellowpages.com/san-francisco-ca/accountants?g=San+Francisco,+CA&q=Accountants</url>
	</StartURL>

view raw WebHarvy XML StartURL hosted with ❤ by GitHub

Editing this portion directly from UI : How to edit Start URL, PostData and Headers

Field List

This section, which follows StartURL, provides information regarding the data to be extracted from the start URL. That is, the list of data to be extracted. Each Data Field describes a data element to be extracted or a link to be followed.

	<FieldList>
	<DATAFIELD> Data Field #1 </DATAFIELD>
	<DATAFIELD> Data Field #2 </DATAFIELD>
	<DATAFIELD> Data Field #3 </DATAFIELD>
	<!-- More Data Fields as required -->
	</FieldList>

view raw WebHarvy XML FieldList hosted with ❤ by GitHub

Data Field

Each Data Field takes the following format:

	<DATAFIELD>
	<type>Text</type>
	<name>Name</name>
	<selector>
	#hotellist_inner > DIV:nth-of-type(02) > DIV:nth-of-type(2) > DIV:nth-of-type(1) > DIV:nth-of-type(1) > DIV:nth-of-type(1) > H3 > A > SPAN:nth-of-type(1)
	</selector>
	<heading />
	<pattern>true</pattern>
	<regex />
	</DATAFIELD>

view raw WebHarvy XML DataField hosted with ❤ by GitHub

The type tag defines the type of data. It can take the following values :

Text	Capture element's text
Text_Near_Heading	Capture Text next to the heading text
Url	Capture element's URL
Image	Download Image
Image_URL	Capture Image URL
Image_RegEx	Capture Image from URL obtained by applying RegEx on HTML
Image_RegExMulti	Capture multiple images. First image URL obtained by applying RegEx on HTML
HTML	Capture HTML code
File	Capture element's text as file
Link_Follow	Follow link
Link_RegEx	Follow link obtained by applying RegEx on HTML
Click	Click the element
Link_Back	Navigate back (after a link has been followed)
Link_NextPage	Link to load next page (for paginated lists)
Link_LoadNextPageSet	Link to load next set of pages
Link_LoadMoreContent	Link to load more content (display more results)
Auto_Scroll	Load more data by scrolling down the page
Input_Text	Enter string in input text field
Invoke_Script	Run Java Script on page
Open_Popup	Click to open popup and extract data
Select	Select list/dropdown option
Scroll	Scroll page down slowly to load all contents
Custom	Custom data fields (Page URL, Page Screenshot, Date-Time, Text)

The name tag provides a name for the data element. For Text/URL/Image/HTML/File elements this will be the name of the corresponding data column while the data is scraped.

The selector tag provides the CSS selector of the data field. This is used to locate the element (and following patterns) during mining.

In WebHarvy versions before 6.0, xpath tag is used instead of selector tag. The xpath tag provides the XPath which denotes the data element. WebHarvy uses a customized XPath format which is explained as follows:-

The path starts with the topmost HTML tag which is <HTML>. The tag name is followed by two indices ([ ]). The first one denotes the index position (first child is at [0], next one at [1] and so on) of the current tag related to its parent tag. The second one is optional and denotes the class id of tag (if exists).

The heading tag is optional and if present contains the heading text for Text_Near_Heading type.

The pattern tag can take values 'true' or 'false'. 'true' for repeating data (valid only in start page), 'false' otherwise.

The regex tag is optional. In case a value is provided (regular expression) it is matched with the captured element's text.

Code for 'Add URLs to Configuration' / 'Scrape a list of similar links'

Using the 'Add URLs to Configuration' option in Edit menu > Edit Options, you can directly add URLs to an existing configuration without manually editing the configuration XML file. Also see, Scrape a list of similar links.

In case you need to add a list of URLs to a configuration file, so that WebHarvy scrapes data from each of the URL in the list as per the configuration, add the following XML code. This code should be added towards the end of the configuration XML file, before </MineParams>.

To make this work, make sure that the first URL in the list (www.url1.com) is the same as that provided in the <StartURL> tag (see above) present in the start of the configuration file.

view raw WebHarvy Category List.xml hosted with ❤ by GitHub

Code for Keyword Scraping

Using the 'Edit Keywords' option in Edit menu > Edit Options, you can directly edit keywords associated with a configuration without manually editing the configuration XML file.

The following is the format for enabling Keyword based Scraping. The first keyword provided should match the keyword used in the start URL/PostData.

	<KeywordList>
	<string>keyword1</string>
	<string>keyword2</string>
	<string>keyword3</string>
	<!-- More keyword strings as required -->
	</KeywordList>

view raw WebHarvy Keyword List hosted with ❤ by GitHub

In case you need any further information please do not hesitate to contact our support team at support@webharvy.com with the necessary details.

	<CategoryList>
	<URLDATA>
	<name>URL1</name>
	<url>http://www.url1.com</url>
	</URLDATA>
	<URLDATA>
	<name>URL2</name>
	<url>http://www.url2.com</url>
	</URLDATA>
	<URLDATA>
	<name>URL3</name>
	<url>http://www.url3.com</url>
	</URLDATA>
	<!-- More URLDATA Fields as required -->
	</CategoryList>

WebHarvy Configuration File (XML) Format