Connect with us

PHP tip: How to convert a relative URL to an absolute URL

coding

PHP tip: How to convert a relative URL to an absolute URL

It was a long time since I create a coding related blog post. So, this is a good opportunity to continue this tradition and talk today about absolute and related URLs and how to handle them in PHP.

An absolute URL is complete and ready to use to download a web file. But web pages often include incomplete relative URLs with missing parts, such as an “http” or host name, or the first part of a file path. These parts need to be filled in by copying them from a base absolute URL. This article shows how and includes code to do it.

Introduction

An absolute URL like “http://example.com/index.htm” tells a web browser what file to get (“index.htm“), where to get it (from “example.com“), and how to get it (via an HTTP web server). But a relative URL like “logo.png” is missing important parts, like “http“, the host name, and the first part of a path. Without these parts, the URL can’t be used to get a file.

To use a relative URL, you have to fill in the missing parts by copying them from another URL. This other URL, or base URL, must be absolute and it is often the URL of the web page containing the relative URL. The base URL also can be set by a <base> tag within the page, by the Content-Location field in the web server’s HTTP response header, and (rarely) by a special <meta> tag.

The URL specification defines an “absolutize” algorithm for combining an absolute base URL with a relative URL to create a new absolute URL. This article explains the algorithm’s steps and implements it in PHP.

The code below requires the split_url( ) and join_url( ) functions from a companion article:

PHP tip: How to parse and build URLs. Split URLs into their component parts, and reassemble those parts into a complete URL.

Code

Let’s go straight to the code first. Explanations follow in the next sections.

This url_to_absolute( ) function’s arguments include an absolute base URL and a relative URL. Missing parts of the relative URL are copied from the base URL to form a new absolute URL returned by the function. FALSE is returned if either URL can’t be parsed or if the base URL isn’t absolute.


function url_to_absolute( $baseUrl, $relativeUrl )
{
$r = split_url( $relativeUrl );
if ( $r === FALSE )
return FALSE;
if ( !empty( $r['scheme'] ) )
{
if ( !empty( $r['path'] ) && $r['path'][0] == '/' )
$r['path'] = url_remove_dot_segments( $r['path'] );
return join_url( $r );
}
$b = split_url( $baseUrl );
if ( $b === FALSE || empty( $b['scheme'] ) || empty( $b['host'] ) )
return FALSE;
$r['scheme'] = $b['scheme'];
if ( isset( $r['host'] ) )
{
if ( !empty( $r['path'] ) )
$r['path'] = url_remove_dot_segments( $r['path'] );
return join_url( $r );
}
unset( $r['port'] );
unset( $r['user'] );
unset( $r['pass'] );$r['host'] = $b['host'];
if ( isset( $b['port'] ) ) $r['port'] = $b['port'];
if ( isset( $b['user'] ) ) $r['user'] = $b['user'];
if ( isset( $b['pass'] ) ) $r['pass'] = $b['pass'];if ( empty( $r['path'] ) )
{
if ( !empty( $b['path'] ) )
$r['path'] = $b['path'];
if ( !isset( $r['query'] ) && isset( $b['query'] ) )
$r['query'] = $b['query'];
return join_url( $r );
}
if ( $r['path'][0] != '/' )
{
$base = mb_strrchr( $b['path'], '/', TRUE, 'UTF-8' );
if ( $base === FALSE ) $base = '';
$r['path'] = $base . '/' . $r['path'];
}
$r['path'] = url_remove_dot_segments( $r['path'] );
return join_url( $r );
}

The url_remove_dot_segments( ) function is used as a last step above to filter out “.” and “..” segments from the returned URL’s path. The function’s only argument is the path to filter. The filtered path is returned.


function url_remove_dot_segments( $path ) { $inSegs = preg_split( '!/!u', $path ); $outSegs = array( ); foreach ( $inSegs as $seg ) { if ( $seg == '' || $seg == '.') continue; if ( $seg == '..' ) array_pop( $outSegs ); else array_push( $outSegs, $seg ); } $outPath = implode( '/', $outSegs ); if ( $path[0] == '/' ) $outPath = '/' . $outPath; if ( $outPath != '/' && (mb_strlen($path)-1) == mb_strrpos( $path, '/', 'UTF-8' ) ) $outPath .= '/'; return $outPath; }

Examples

Combine a base URL and a relative URL:

$newUrl = url_to_absolute(
    "http://example.com/products/index.htm",
    "./product.png" );
print( "$newUrl\n" );

Prints:

http://example.com/products/product.png

Extract URLs from a web page and convert each one to an absolute URL:


$text = file_get_contents( $baseUrl );
$groupedUrls = extract_html_urls( $text );
$pageUrls = array( );
foreach ( $urls as $element_entry )
foreach ( $element_entry as $attribute_entry )
$pageUrls = array_merge( $pageUrls, $attribute_entry );
$n = count( $pageUrls );
for ( $i = 0; $i < $n; $i++ )
$pageUrls[$i] = url_to_absolute( $baseUrl, $pageUrls[$i] );

Explanation

The familiar format of a URL is officially defined by the Internet Engineering Task Force in specification RFC3986. It goes like this (the parts within brackets are optional):

[scheme “:”] scheme-specific-part [“?” query] [“#” fragment]

For example, in this URL:

http://example.com/products/index.htm?sku=1234#section42

http” is the scheme, “//example.com/products/index.htm” is the scheme-specific-part, “sku=1234” is the query, and “section42” is the fragment.

The scheme is a standard name, like “http”, that indicates how to interpret and use the rest of the URL. The Internet Assigned Numbers Authority (IANA) has specifications for over 60 different URL schemes including “http”, “ftp”, “file”, “mailto”, “news”, “pop”, “snmp”, and many more.

The scheme-specific-part is just that — it’s scheme-specific. Different schemes expect different information here, such as a file path for the “http” scheme, an email address for the “mailto” scheme, or a news group name for the “news” scheme.

The query part of a URL usually contains parameters for a database query. The format is not standardized, so web sites can define queries any way they like.

fragment at the end of a URL is usually the name of a subsection of the content. For web pages, this is the name of an anchor on the page and it’s often used to mark sections of the page.

All schemes can be divided into two types: hierarchical and nonhierarchical (also called opaque). A hierarchical scheme’s scheme-specific-part contains a path with a series of words separated by slashes. For schemes like “http”, this path often selects a file, such as “/products/index.htm“. A nonhierarchical scheme’s URL, however, contains other information, such as an email address for the “mailto” scheme.

For this article, we only care about hierarchical schemes like “http”, “ftp”, and “file”. nonhierarchical schemes cannot have relative URLs.

So, for hierarchical schemes the scheme-specific-part contains a path, preceded by an authority like this:

[scheme “:”] [“//” authority] [path] [“?” query] [“#” fragment]

And the authority part’s format is:

[user [“:” pass] “@”] host [“:” port]

The authority includes a host name or IP address (v4 or v6). Optionally, the host may be followed by a port number (80 for a web server). Some schemes support a user name and password in front, but this is rare (and including a password in a URL is not very secure either).

An absolute URL always has a scheme and it usually has an authority and path. A relative URL does not have a scheme and it may be missing some or all of the other parts of a URL. If the path of a relative URL doesn’t start with a slash, then it is also missing the first part of its path.

The point of this article, of course, is how to fill in the missing parts of a relative URL to make it an absolute URL.

Absolutizing a relative URL

When parts of a relative URL are missing, the RFC3986 specification explains how to copy them from an absolute base URL. Typically, the base URL is the URL of the web page containing the relative URL. There are several other ways to get a base URL:

    • In the web server’s HTTP header when downloading the page:
      • The Content-Location field contains the base URL.
    • In the web page’s <head> section:
      • The optional <base> tag’s href field contains the base URL.
      • The optional <meta> tag’s http-equiv attribute may contain a Content-Location field that contains the base URL.
  • Within the web page’s body:
    • An <applet> tag’s codebase attribute may contain the base URL for that applet only.

The URL specification’s “absolutize” algorithm for relative URLs does the following:

  • Copies missing parts to the relative URL.
  • Concatenates the base and relative URL paths if needed.
  • Removes all “.” path segments.
  • Collapses all “..” path segments.

This is pretty straightforward but there are some things to watch out for:

  • To copy and merge URL parts, you first have to split apart the base and relative URLs. PHP’s standard parse_url( ) function can do this, but it has problems with complex and relative URLs.
  • After URLs are split apart, you have to expand percent-encoding into actual characters. For example, this converts a %20 into a space. This has to be done after splitting the URL so that percent-encodings of special URL characters like @ : ? # don’t confuse the URL splitting. If you use PHP’s parse_url( ) instead, you’ll need to decode the hostuserpasspathquery, and fragment parts yourself using PHP’s standard rawurldecode( ). Do not use PHP’s similar urldecode( ), which does not follow the URL specification and can garble some URLs.
  • When copying missing parts, the base URL’s fragment is never copied. The base URL’s query part is only copied when the relative URL has no path at all (which is rare).
  • A URL, after expanding percent-encoding, may contain multibyte UTF-8 characters (see RFC2718). String processing on the URL’s parts must use PHP’s multibyte character-aware functions. These all have names starting with “mb“. For instance, this code uses mb_strrchr( ) to find the substring of the path up to the last slash. The standard preg_* functions also support UTF-8 when the “u” pattern modifier is included.
  • After you’ve got all the right URL parts, you have to percent-encode special characters. Only some parts of a URL can have percent-encoded characters, so you have to do this before reassembling the URL. Encode the host, user, pass, path, query, and fragment parts, but don’t touch the scheme and port. Also, only encode the host part if it is a name, not an IPv4 or IPv6 address. Be sure to use PHP’s rawurlencode( ) but not the incorrect urlencode( ) which does not follow the URL specification.
  • Finally, join the parts together again into a complete URL.

So, here are the algorithm’s steps:

Step 1: split the relative URL into an associative array of parts.

$r = split_url( $relativeUrl );
if ( $r == FALSE )
    return FALSE;

Step 2: check if the relative URL is already absolute. If it is, update its path to remove “.” and “..” using the url_remove_dot_segments( ) function discussed later in this article. Then rebuild the URL and return it.

if ( !empty( $r['scheme'] ) )
{
    if (!empty( $r['path'] ) && $r['path'][0] == '/' )
        $r['path'] = url_remove_dot_segments( $r['path'] );
    return join_url( $r );
}

Step 3: split the base URL into its parts and make sure it is absolute (it must have at least a scheme and host). If it is absolute, copy its scheme to the relative URL.

$b = split_url( $baseUrl );
if ( $b == FALSE || empty( $b['scheme'] ) || empty( $b['host'] ) )
    return FALSE;
$r['scheme'] = $b['scheme'];

Step 4: check if the relative URL has a host part. If it does, the rest of the relative URL is complete. There’s nothing more to copy from the base URL. Update the relative URL’s path to remove “.” and “..”, then rebuild the URL and return it.

if ( !empty( $r['host'] ) )
{
    if ( !empty( $r['path'] ) )
        $r['path'] = url_remove_dot_segments( $r['path'] );
    return join_url( $r );
}

Step 5: copy the missing authority parts from the base URL to the relative URL.

$r['host'] = $b['host'];
if ( isset( $b['port'] ) ) $r['port'] = $b['port'];
if ( isset( $b['user'] ) ) $r['user'] = $b['user'];
if ( isset( $b['pass'] ) ) $r['pass'] = $b['pass'];

Step 6: if the relative URL doesn’t have a path (rare), then use the base URL’s path and query. Since that base URL’s path is already absolute, it should already have “.” and “..” removed. So, rebuild the URL and return it.

if ( empty( $r['path'] ) )
{
    if ( !empty( $b['path'] ) )
        $r['path'] = $b['path'];
    if ( !isset( $r['query'] ) && isset( $b['query'] ) )
        $r['query'] = $b['query'];
    return join_url( $r );
}

Step 7: if the relative URL’s path doesn’t start with a slash, then merge the first part of the base URL’s path (up to the last slash) with the relative URLs path. Note the use of mb_strrchr( ) for multibyte character string handling.

if ( $r['path'][0] != '/' )
{
    $base = mb_strrchr( $b['path'], '/', TRUE, 'ISO-8859-1' );
    if ( $base === FALSE ) $base = '';
    $r['path'] = $base . '/' . $r['path'];
}

Step 8: update the path to remove “.” and “..”, rebuild it, and return.

$r['path'] = url_remove_dot_segments( $r['path'] );
return join_url( $r );

Removing dot segments

Paths are built from a series of segments (often folder names) separated by slashes. Segments named “.” and “..” have special meanings. If you think of a file path as a series of steps downward into folders, then:

  • A “.” segment means “stay here”.
  • A “..” segment means “go back one folder”.

A “.” is always redundant and can be safely removed. The path “/products/./index.htm” is the same as “/products/index.htm“.

A “..” can be removed along with the segment before it. The paths “/products/../logo.png” and “/logo.png” are equivalent.

The URL specification explains how to remove “.” and “..” by scanning the path character by character. You can do this more simply by splitting the path into segments at slashes and scanning through the path segment by segment. Afterwards, reassemble the path by adding slashes between the segments. This approach will handle paths will multiple “.” and “..” segments and even paths with too many “..” segments, like “/one/two/../../../../..“.

Dot segment removal must be careful to handle multibyte character strings in UTF-8. For instance, to split the path into an array of segments between slashes, it’s tempting to use explode( ). However, it is not safe with multibyte characters (though implode( ) is). Instead, use preg_split( ) with the “u” pattern modifier.

Step 1: explode the path at every “/” to create an array of path segments.

$inSegs  = preg_split( '!/!u', $path );

Step 2: loop through the segments. Push non-dot segments onto a stack. Skip “.” segments. And on a “..” segment, pop the stack to remove the previous segment.

$outSegs = array( );

foreach ( $inSegs as $seg )
{
    if ( $seg == '' || $seg == '.')
        continue;
    if ( $seg == '..' )
        array_pop( $outSegs );
    else
        array_push( $outSegs, $seg );
}

Step 3: implode the segment stack by adding a “/” between each segment to create a new path.

$outPath = implode( '/', $outSegs );

Step 4: if the original path started or ended with a “/”, put that slash back in the new path. To get the last character of a multibyte character string, we can’t safely use $path[strlen($path)-1]. Instead, search for the last “/” and see if it is at the end.

if ( $path[0] == '/' )
    $outPath = '/' . $outPath;
if ( $outPath != '/' &&
    (mb_strlen($path)-1) == mb_strrpos( $path, '/', 'UTF-8' ) )
    $outPath .= '/';
return $outPath;
Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

More in coding

About Me:

Szabi Kisded

Hey there, I'm Szabi. At 30 years old, I quit my IT job and started my own business and became a full time WordPress plugin developer, blogger and stay-at-home dad. Here I'm documenting my journey earning an online (semi)passive income. Read more

Sign up for my newsletter and get the YouTube Caption Scraper WordPress plugin for free
(worth 29$)!

All My Plugins In A Bundle:

My AutoBlogging Plugins:

My Online Courses:

A Theme I Recommend:

Featured Posts:

To Top