Crawlomatic’s [crawlomatic-scraper] Shortcode Documentation

What is new in Crawlomatic‘s [crawlomatic-scraper] shortcode?

Crawlomatic’s v2.0 update brought a new shortcode (and also Gutenberg block alternative) which will make much more easier to implement a web scraper for WordPress. This can be used to display real time data from any websites directly into your posts, pages ohttps://coderevolution.ro/knowledge-base/faq/crawlomatics-crawlomatic-scraper-shortcode-documentation/r sidebar.

Use this to include real time stock quotes, cricket or soccer scores or any other generic content. Features include:

Scrap output can be displayed thru custom template tag, shortcode in page, post and sidebar (through a text widget).
Configurable caching of scraped data. Cache timeout in minutes can be defined in minutes for every scrap.
Configurable UserAgent for your scraper can be set for every scrap.
Configurable default settings like enabling, UserAgent, timeout, caching, error handling.
Multiple ways to query content – CSS Selector, XPath or Regex.
A wide range of arguments for parsing content.
Option to pass post arguments to a URL to be scraped.
Dynamic conversion of scrap to specified character encoding to scrap data from a site using different charset.
Create scrap pages on the fly using dynamic generation of URLs to scrap or post arguments based on your page’s get or post arguments.
Callback function for advanced parsing of scraped data.

Why is web scraping needed in WordPress?

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Web scraping is closely related to web indexing, however, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mashup and web data integration.

Using Crawlomatic’s new [crawlomatic-scraper] shortcode, you can easily embed external content from websites (HTML), structured data feeds (RSS, ATOM, XML, JSON, CSV etc) with ease and mostly without the need of any coding. The possible implementations of this are limited only by your imagination.

While scraping, you should consider the copyright of the content owner. Its best to at least attribute the content owner by a linkback or better take a written permission. Apart from rights, scraping in general is a very resource intensive task. It will exhaust the bandwidth of your host as well as the host of of the content owner. Best is not to overdo it. Ideally find single pages with enough content to create your your mesh-up.

How to use the new scraper shortcode of Crawlomatic?

Crawlomatic lets you specify a URL source and a query to fetch specific content from it. Crawlomatic can be used through a shortcode (for posts, pages or sidebar) or template tag (for direct integration in your theme) for scraping and displaying web content. Here’s the actual usage detail:

In shortcode:

[crawlomatic-scraper url="https://www.yahoo.com/" query_type="cssselector" query=".stream-item" output="text"]

In template tag:

<?php echo crawlomatic_scraper_get_content("https://www.yahoo.com/", ".stream-item", array('query_type' => 'cssselector', 'output' => 'text')); ?>

The above shortcode and template tag would output the content of the CSS Selector ‘ol.trendingnow_trend-list’ from URL ‘https://www.yahoo.com/’ in your post, page or sidebar as plain text (HTML striped).

In case of template tag (crawlomatic_scraper_get_content), the first argument is URL, the second argument is query while the third argument is a associative array with all other arguments.

Crawlomatic has a host of options to control your URL request, do advanced parsing and managing output. Apart from CSS Selectors, Crawlomatic also supports XPath and Regex queries.

Full parameter list:

‘url’ => ”, //the URL to be scraped
‘urldecode’ => 0, //0 or 1 – do you want to decode the URL before sending the request?
‘get_page_using’ => ‘default’, //select the page download method – default, wp_remote_request, phantomjs, puppeteer, tor, headlessbrowserapipuppeteer, headlessbrowserapitor, headlessbrowserapiphantomjs. Phantomajs, Puppeteer or Tor need to be installed on your server and configured in plugin settings to work correctly (when you select their values). If you use HeadlessBrowserAPI (Puppeteer, Tor or PhantomJS endpoints), these are not required to be installed on your server, however, you need to have a valid subscription for the API and a valid HeadlessBrowserAPI key added in the plugin’s ‘Main Settings’ menu -> ‘HeadlessBrowserAPI Key’ settings field. You can sign up, here.
‘on_error’ => ‘error_show’, //select the behavior in case of errors
‘cache’ => ’60’, //cache duration in minutes
‘page_level_caching’ => ‘on’ or ‘off’,//if the caching is done site level or page level. Page level will save data for the same URL multiple times, on shortcodes added to different URL from your site. Site level will save it only once, all shortcodes pointing to the same scraped URL will display the same info, on the entire site
‘output’ => ‘html’, //html or text – set the output format for the shortcode
‘timeout’ => ‘3’, //timeout in seconds for requests
‘query_type’ => ‘auto’, //auto, cssselector, regex, regexmatch, xpath, class, id, iframe, full – set the query type to parse the page content
‘query’ => ”,//the query string of the above query type
‘querydecode’ => 0, //0 or 1 – do you want to decode queries?
‘remove_query_type’ => ‘none’, //strip content query type – none, xpath, regex, cssselector, class, id
‘remove_query’ => ”, //strip content query string
‘replace_query_type’ => ‘none’, //replace content query type – none, xpath, regex, cssselector, class, id
‘replace_query’ => ”, //replace content query string
‘replace_with’ => ”, //replacement string for matches
‘lazy_load_tag’ => ”,//image lazy load tag (for lazy loaded images)
‘strip_links’ => ‘0’,//strip links from imported content – 0 or 1
‘strip_internal_links’ => ‘0’,//strip internal links from imported content – 0 or 1
‘strip_scripts’ => ‘0’,//strip scripts from imported content – 0 or 1
‘strip_images’ => ‘0’,//strip images from imported content – 0 or 1
‘content_percent_to_keep’ => ”,//percentage of the content to keep. Numeric, between 1 and 100.
‘limit_word_count’ => ”,//number of words to show at max in the scraped content.
‘spin’ => ”,//Set if you wish to enable text spinning for imported content. You can set this settings field to 1 to use the credentials set for text spinning in the plugin\’s “Main Settings” menu. You can also enter a specific content spinner, with credentials, in the following format: SpinnerName:username/email:password/APIkey. For SpinnerName, you can use the following: bestspinner, wordai, spinrewriter, turkcespin, builtin, wikisynonyms, freethesaurus (username and password should be entered only for premium spinner services)
‘translate_to’ => ”,//Enter a 2 letter language code to which you wish to translate content automatically
‘translate_source’ => ”,//2 letter language code for source language
‘useragent’ => ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’,//[advanced] user agent to set for requests
‘charset’ => get_bloginfo(‘charset’),//[advanced] request charset
‘iframe_height’ => ‘800’,//[advanced] iframe height, if query_type is set to iframe
‘headers’ => ”, //[advanced] request headers, working only if query_type is set to
‘glue’ => ”, //[advanced] the glue to use when multiple matched content found
‘eq’ => ”,//[advanced] if multiple matched content found, include only the nth element from the match list
‘gt’ => ”,//[advanced] If the query set matches multiple parts of the content, using this field, you can get the matched element from the page, with the numeric index greater than the value entered here.
‘lt’ => ”,//[advanced] If the query set matches multiple parts of the content, using this field, you can get the matched element from the page, with the numeric index less than the value entered here.
‘basehref’ => 1,//[advanced] Set the base href URL to which links should be converted in imported content. Optional. By default, this is set to the URL of the crawled site, so links can be auto completed (in case they have missing parts)
‘a_target’ => ”,//[advanced] Set the target attribute for links. Example: _blank
‘callback_raw’ => ”,//[advanced] Set the raw output callback function. Optional.
‘callback’ => ”,//[advanced] Set the output callback function. Optional.
‘debug’ => 0,//[advanced] Select if you wish to enable debug mode – 0 or 1

Tutorial video

Callback Functions

Using the callback functions, you can extend Crawlomatic to do some advanced parsing. Simply put, callback functions will will parse and return your data. Callback functions can reside in functions.php of your child theme. Placing your callback functions here will make sure that these don’t get lost on upgrading your theme. On changing your theme, just change the child theme parent to the new upgraded theme.

There are two sets of callback functions:

callback_raw

The function name specified in ‘callback_raw’ argument will parse the scraped content in its most raw form. This function expects only one array (of strings) argument and is called before any parsing arguments are applied. The code within this function should process the input and return a parsed array of strings or a single string as an output.

callback

The function name specified in ‘callback’ argument will parse the scraped content in its processed form. This function expects only one string argument and is called after all parsing arguments are applied. The code within this function should process the input and return a parsed string as an output.

Dynamic URL and headers

Using this feature of dynamic URL and headers, you can dynamically create a source URL and scrap content on the fly. Basically this lets you fetch content from a single underlying source by passing multiple get arguments to it. This feature will convert specific text in your URL or headers arguments to corresponding value based on some placeholders.

For example, if you want a page to scrap symbols on reuters.com dynamically based on user input then:

url should be http://www.reuters.com/finance/stocks/overview?symbol=___symbol___
get argument for page should be `http://yourdomain.com/page/?symbol=CSCO.O (to get Cisco details)

This will replace ___symbol___ in the url with CSCO.O in realtime. You can use multiple such replacement variables in your url or postargs. Such replacement variables should be wrapped between 3 underscores. Note that field names being passed this was are case-sensitive. Having ‘FieldName’ vs. ‘fieldname’ makes a difference.

You can also use the special variable ___QUERY_STRING___ to replace the complete query string post.

Check the ‘dynamic URL and on_error handling’ example from below, where the get argument ‘symbol’ is used to dynamically build a URL and fetch content.

Query

This article specifically details usage of query which are the heart of Crawlomatic. For parsing html, the plugin three types of queries- CSS Selectors; XPath and Regular Expression. Selectors are not only used by Crawlomatic to query data from source URL, but also to remove or replace stuff

For all scraping that deals with DOM documents (XML, HTML etc) CSS Selectors and XPaths can support all possible use cases. Regular Expression is provided as a query option for extreme edge cases or non-DOM content.

CSS Selectors

CSS selectors are patterns used to select the element(s) you want to style. CSS selectors are less powerful than XPath, but far easier to write, read and understand.

Many developers — particularly web developers — are more comfortable using CSS selectors to find elements. As well as working in stylesheets, CSS selectors are used in JavaScript with thequerySelectorAll function and in popular JavaScript libraries such as jQuery, Prototype and MooTools.

The CSS Selector Reference on w3schools is recommended to get you started. You may also want to try the CSS Selector Tester from w3schools.

Internally, Crawlomatic converts the CSS Selector into an XPath expressions using Symfony’s CssSelector Component.

XPath

The XPath language is based on a tree representation of the XML document, and provides the ability to navigate around the tree, selecting nodes by a variety of criteria. In popular use (though not in the official specification), an XPath expression is often referred to simply as “an XPath”.

When you’re parsing an HTML or an XML document, by far the most powerful method is XPath. XPath expressions are incredibly flexible, so there is almost always an XPath expression that will find the element you need.

XPath Syntax and XPath Examples on w3schools is a good starting point.

Internally, Crawlomatic relies on PHP DOM and uses DOMXPath::query to evaluate XPath expressions.

Regular Expression

A regular expression (abbreviated regex or regexp) and sometimes called a rational expression is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations. Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning.

This Introduction to Regex is a great place to start with Regex. You can also use regexr.com to check your Regular Expressions.

Internally, Crawlomatic relies on Regular Expressions (Perl-Compatible) and uses preg_match_all to perform a global regular expression match.

Arguments API

Unless mentioned, all these arguments are available in the template tag as well as shortcode. Here’s how these arguments will be used.

In shortcode:

[crawlomatic-scraper url="https://www.yahoo.com/" query_type="cssselector" query=".stream-item" output="text"]

In template tag:

<?php echo crawlomatic_scraper_get_content("https://www.yahoo.com/", ".stream-item", array('query_type' => 'cssselector', 'output' => 'text')); ?>

For representational purposes, these arguments are categorized below as Request, Response or Parsing ones.

Request Arguments

These set of arguments deal with the way requests are made to the source URL to fetch content

url

Required. The complete URL which needs to be scraped.

cache

Timeout interval of the cached data in minutes. This is dependent on how frequently your source URL is expected to change content. If the content is not changed often, it is recommended to keep this as higher as possible to save external requests. To set an infinite caching period (never scrape the page again), please set this value to 0. If ignored, the default value specified in Crawlomatic Settings will be used. It is strongly recommended to use a Persistent Cache Plugin for better caching performance.

Default: 60

useragent

The USERAGENT header for making request. This string acts as your footprint while scraping data. If ignored, the default value specified in Crawlomatic Settings will be used.

Default: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36

timeout

Request timeout in seconds. Higher the interval, better for scraping from slow server source URLs. But this will also increase your page load time. Ideally should not exceed 2. If ignored, the default value specified in Crawlomatic Settings will be used.

Default: 3

headers

A string in query string format (like id=197&cat=5) of the post arguments that you may want to pass on to the source URL. Note that get arguments should be a part of URL itself and this argument should only be used for post arguments.

urldecode

Only available in shortcode. Handy for URLs with characters (like [ or ]) that may interfere with shortcode. Gives you an opportunity to enter a urlencoded string as URL. Values can be 1 or 0. Set to 1 to use urldecode for URLs with special characters. Set to 0 to use URL without modification.

Default: 0

querydecode

Only available in shortcode. Handy for query with characters (like [ or ]) that may interfere with shortcode. Gives you an opportunity to enter a urlencoded string as query. Strongly recommended if you are using xpath as query_type. Values can be 1 or 0. Set to 1 to use querydecode for URLs with special characters. Set to 0 to use URL without modification.

Default: 0

Response Arguments

These set of arguments deal with the way the parsed response from source URL is displayed

output

Format of output rendered by the selector. Values can be text or html. Text format strips all html tags and returns only text content. Html format retains the the html tags in output.

Default: html

on_error

Error handling options for response. Values can be error_show or error_hide or any other string. error_show displays the error; error_hide fails silently without any error display while any other string will print the string itself. For instance on_error=”screwed!” will output ‘screwed!’ if something goes wrong. If ignored, the default value specified in Crawlomatic Settings will be used.

Default: error_show

debug

Display of debug information. Values can be 1 or 0. Set to 1 to turn on debug information in form of an html comment before scrap output or set to 0 to turn it off.

Default: 0

Parsing Arguments

These set of arguments provide options for parsing the content received from the source URL

query

Query string to select the content to be scraped. The query can be of type cssselector or xpath or regex. If query is empty, complete response will be returned without any querying. Read more about this in the query documentation above.

Default: (empty string)

query_type

Type of query. Values can be cssselector, xpath, regex, regexmatch, iframe or auto. If query is blank, complete response will be returned without any querying irrespective of query_type. Read more about these query types in the query documentation below.

Default: auto

glue

String to be used to concatenate multiple results of query. For example if your (cssselector or xpath or regex) query returns 5 <p> elements, then this sting will be used to join all these 5.

Default: PHP_EOL

eq

Filter argument to reduce the set of matched elements to the one at the specified index. Values can be first or last or an integer to represent a 0 based index (similar to eq implementation of jQuery API).

If ignored: All elements are returned.

gt

Filter argument to select all elements at an index greater than index within the matched set. Value can be an integer to represent a 0 based index. All elements with indexes greater then this value are returned (similar to eq implementation of jQuery API).

If ignored: All elements are returned.

lt

Filter argument to select all elements at an index lesser than index within the matched set. Value can be an integer to represent a 0 based index. All elements with indexes lesser then this value are returned (similar to eq implementation of jQuery API).

If ignored: All elements are returned.

remove_query

Similar to query, however this query is used to remove matched content from the output. Read more about this in the query documentation from this page.

If ignored: No content is removed.

remove_query_type

Type of query. Values can be cssselector or xpath or regex or none. If remove_query is blank, complete response will be returned without removing anything. Read more about this in the query documentation from this page.

Default: none

replace_query

Similar to query, however this query is used to replace matched content with string specified in argument replace_with. Read more about this in the query documentation from this page.

If replace_query_type is regex, this parameter can also be a serialized urlencoded array created like this:

urlencode(serialize($array))

That way, you can pass an array argument to the underlying preg_replace function.

If ignored: No content is replaced.

replace_query_type

Type of query. Values can be cssselector or xpath or regex or none. If replace_query is blank, complete response will be returned without replacing anything. Read more about this in the query documentation from this page.

Default: none

replace_with

String to replace content matched by replace_query.

If replace_query_type is regex, this parameter can also be a serialized urlencoded array created like this:

urlencode(serialize($array))

That way, you can pass an array argument to the underlying preg_replace function.

If ignored: Content matched by replace_query will be replaced by empty string (will be removed)

basehref

Converts relative links from the scrap to absolute links. This can be handy to keep relative links functional. Values can be 1 or 0 or a specific URL which would be used to convert relative links to absolute links. Setting basehref to 1 will use the source URL itself intuitively to convert relative URL to absolute; setting basehref to 0 will not do any conversion; while setting basehref to a specific URL will use that URL as the base for conversion. Note that basehref needs to be a complete URL (with http, hostname, path etc).

If ignored: basehref conversion will be skipped

a_target

Sets a specified target attribute for all links (a href). This can be handy to make sure external links open in a separate window. Values can be _blank or _self or _parent or _top or your custom framename. However note that there’s no validation and the argument value provided by you is used as is.

If ignored: a target modification will be skipped

callback_raw

Callback function which will parse the scraped content in its most raw form. This callback function (if specified) is called before any of the above parsing arguments are applied. Handy to do some advanced parsing. Read more about this in the callback documentation on this page.

callback

Callback function which will parse the scraped content in its processed form. This callback function (if specified) is called after all of the above parsing arguments are applied. Handy to do some advanced parsing. Read more about this in the callback documentation on this page.

Minimum requirements & dependencies of the new shortcode

As Crawlomatic is a WordPress plugin, so over and above the minimum requirements of WordPress, you need to have a couple of things:

PHP version 5.3.3 or greater (required). This is needed by Symfony’s CssSelector Component, which is internally used to convert the CSS Selector into an XPath expressions.
Any Persistent Cache Plugin for better caching performance (strongly recommended)

That’s really it.

Crawlomatic uses native WordPress APIs wherever possible. It uses HTTP API for making HTTP requests and Transients API for caching.

How to optimize performance?

Here are some tips to help you optimize the usage:

Keep the timeout as low as possible (least is 1 second). Higher timeout might impact your page processing time if you are dealing with content on slow servers.
Strongly recommended to use a Persistent Cache Plugin and enable Disk or Memory based Object Cache for better caching performance. There can be some serious issues if you are scraping quite a lot and not using a persistent cache plugin.
If you are not using a Persistent Cache Plugin, then the underlying Transients API will fallback on WordPress options table (wp_options) to store cache. This might lead to issues if your site is located on a shared hosting. To avoid such issues, either use a Persistent Cache Plugin or delete expired transients occasionally.
If you are scraping a lot, keep a watch on your cache size too. Clear/Flush cache occasionally.
If you plan use multiple scrapers in a single page, make sure you set the cache timeout to a larger period. Possibly as long as a day (i.e. 1440 minutes) or even more. This will cache content on your server and reduce scraping.
Use fast loading pages (URL sources) as your content source. Also prefer pages low in size to optimize performance.
Keep a close watch on your scraper. If the website changes its page layout, your selector may fail to fetch the right content.

Using proxies

If you use the get_page_using parameter, at any value excepting wp_remote_request, the plugin will use the proxy set in the plugin’s ‘Main Settings’ -> ‘Web Proxy Address List’ settings field.
If you use the get_page_using parameter with the wp_remote_request value, using proxies to make requests via this plugin is same as using the HTTP API with a Proxy. The allowed Proxy settings for the wp-config.php are the following:

# HTTP Proxies
# Used for e.g. in Intranets
# Fixes Feeds as well
# Defines the proxy adresse.
define( 'WP_PROXY_HOST', '127.0.84.1' );
# Defines the proxy port.
define( 'WP_PROXY_PORT', '8080' );
# Defines the proxy username.
define( 'WP_PROXY_USERNAME', 'my_user_name' );
# Defines the proxy password.
define( 'WP_PROXY_PASSWORD', 'my_password' );
# Allows you to define some adresses which
# shouldn't be passed through a proxy.
define( 'WP_PROXY_BYPASS_HOSTS', 'localhost, www.example.com' );

Examples:

1. JSON and callback

Parse a JSON URL source (http://ip.jsontest.com/) using a custom callback function

Shortcode:

[crawlomatic-scraper url=’http://ip.jsontest.com/’ basehref=0 callback=’json_parser_callback’]

Callback function:

This is a sample callback function that needs to be placed into functions.php of your child theme:
function json_parser_callback($content){
$object = json_decode(trim($content));
return $object->ip;
}

Template tag:

<?php echo crawlomatic_scraper_get_content(‘http://ip.jsontest.com/’, ”, array(‘basehref’ => 0, ‘callback’ => ‘json_parser_callback’)); ?>

2. dynamic URL and on_error handling

Fetch stock quotes chart image from reuters.com and make sure links and image sources are not broken. Display a nice error message. (View this same page with a symbol argument for Google, Microsoft and Apple)

Shortcode:

[crawlomatic-scraper url=’https://nasdaq.com/market-activity/stocks/___symbol___’ query=’.quote-page-chart__chart-container’ a_target=’_blank’ on_error=’Stock symbol argument not passed or specified symbol not found’]

Template tag:

<?php echo crawlomatic_scraper_get_content(‘https://nasdaq.com/market-activity/stocks/’, ‘.quote-page-chart__chart-container’, array(‘a_target’ => ‘_blank’, ‘on_error’ => ‘Stock symbol argument not passed or specified symbol not found’)); ?>

3. basehref auto correction and a_target

Fetch US Market Indices from reuters.com and make sure links and image sources are not broken.

Shortcode:

[crawlomatic-scraper url=’https://nasdaq.com/market-activity/stocks’ query_type=”cssselector” query=’#tab1 table.dataTable’ a_target=’_blank’]

Template tag:

<?php echo crawlomatic_scraper_get_content(‘https://nasdaq.com/market-activity/stocks’, ‘#tab1 table.dataTable’, array(‘a_target’ => ‘_blank’)); ?>

4. xpath query on RSS feed, lt and text output

Fetch ‘Hot Google searches’ using the RSS feed (https://www.google.com/trends/hottrends/atom/feed) using an XPath query and display only the first 10 items as a list.

Shortcode:

[crawlomatic-scraper url=’https://www.google.com/trends/hottrends/atom/feed’ query_type=’xpath’ glue='<br/>’ query=’//channel/item/title’ a_target=’_blank’]

Template tag:

<?php echo crawlomatic_scraper_get_content(‘https://www.google.com/trends/hottrends/atom/feed’, ‘//channel/item/title’, array(‘query_type’ => ‘xpath’, ‘glue’ => ‘<br />’, ‘output’ => ‘text’, ‘lt’ => 10)); ?>

5. remove_query and a_target

Shortcode:

[crawlomatic-scraper url=’http://www.reuters.com/finance/markets’ query_type=”cssselector” query=’#content’ remove_query_type=”cssselector” remove_query=’#tab1 table.dataTable’ a_target=’_blank’]

Template tag:

<?php echo crawlomatic_scraper_get_content(‘http://www.reuters.com/finance/markets’, ‘#content’, array(‘query_type’ => ‘cssselector’, ‘remove_query_type’ => “cssselector”, ‘remove_query’ => ‘#tab1 table.dataTable’ , ‘a_target’ => ‘_blank’)); ?>