WebHarvy configuration files are saved in XML format. Following is the description of the WebHarvy Configuration XML file format. Advanced users can directly tweak the XML config file created using WebHarvy with the help of this description.
WebHarvy allows you to change most of the details in the configuration directly from the UI. See Editing Configuration for more details. This lets you easily change the configuration parameters without manually editing the XML file.
This document provides only a high level description of the configuration file format and is not complete. Please contact our Support in case you need any further information.
Header
The header portion of the configuration file is as follows. This portion is the same for all configuration files.
Version and Registration Info
The version and registration details in the configuration file are optional and are written by WebHarvy versions 6.3.0.189 and above.
Miner Options
From version 6.3.0.189, Advanced Miner Options are saved in the configuration file. If this part is not present, the default values of these options from Settings are used.
Selection Accuracy Values : Strict (-1), Low (0), Medium (1), High (2), Highest (3)
URL details
The following StartURL tag describes the URL of the page from which data scraping starts. The url tag inside StartURL contains the URL of the web page from which you intend to scrape data. The StartURL tag can optionally contain headers and postdata tags if required.
Editing this portion directly from UI : How to edit Start URL, PostData and Headers
Field List
This section, which follows StartURL, provides information regarding the data to be extracted from the start URL. That is, the list of data to be extracted. Each Data Field describes a data element to be extracted or a link to be followed.
Data Field
Each Data Field takes the following format:
The type tag defines the type of data. It can take the following values :
The name tag provides a name for the data element. For Text/URL/Image/HTML/File elements this will be the name of the corresponding data column while the data is scraped.
The selector tag provides the CSS selector of the data field. This is used to locate the element (and following patterns) during mining.
In WebHarvy versions before 6.0, xpath tag is used instead of selector tag. The xpath tag provides the XPath which denotes the data element. WebHarvy uses a customized XPath format which is explained as follows:-
The path starts with the topmost HTML tag which is <HTML>. The tag name is followed by two indices ([ ]). The first one denotes the index position (first child is at [0], next one at [1] and so on) of the current tag related to its parent tag. The second one is optional and denotes the class id of tag (if exists).
The heading tag is optional and if present contains the heading text for Text_Near_Heading type.
The pattern tag can take values 'true' or 'false'. 'true' for repeating data (valid only in start page), 'false' otherwise.
The regex tag is optional. In case a value is provided (regular expression) it is matched with the captured element's text.
Code for 'Add URLs to Configuration' / 'Scrape a list of similar links'
Using the 'Add URLs to Configuration' option in Edit menu > Edit Options, you can directly add URLs to an existing configuration without manually editing the configuration XML file. Also see, Scrape a list of similar links.
In case you need to add a list of URLs to a configuration file, so that WebHarvy scrapes data from each of the URL in the list as per the configuration, add the following XML code. This code should be added towards the end of the configuration XML file, before </MineParams>.
To make this work, make sure that the first URL in the list (www.url1.com) is the same as that provided in the <StartURL> tag (see above) present in the start of the configuration file.
Code for Keyword Scraping
Using the 'Edit Keywords' option in Edit menu > Edit Options, you can directly edit keywords associated with a configuration without manually editing the configuration XML file.
The following is the format for enabling Keyword based Scraping. The first keyword provided should match the keyword used in the start URL/PostData.
In case you need any further information please do not hesitate to contact our support team at support@webharvy.com with the necessary details.