As you might already know, PHP is a popular backend language that powers many popular CMSs, including WordPress. If you are stepping into WordPress or PHP development, you will find this article helpful.
Đang xem: How to parse html using php native classes
You might already know how to parse HTML using Javascript or JQuery if you have ever dealt with DOM (Document Object Model) manipulation on the front-end.
Related: Should you learn JQuery in 2020?
Since Javascript runs on the client-side, it can interact with the browser DOM.
But what if we want to process HTML data on the server? In this post, let us look at some of the useful PHP classes which enables us to process HTML on the server-side.
Watch video:
What is Parsing & What are its Uses?
Parsing (in this case) is the process of extracting or modifying useful information from an HTML or XML string. A parser gives us easy ways to query raw data instead of using regex.
Suppose you want to get all the links on a web page. PHP DOM parsing classes can help you.
Important DOM classes in PHP
There are around nineteen DOM-related classes in PHP. Some of the important ones are:
DOMDocument (extends DOMNode class)DOMNodeDOMNodeListDOMXPathDOMElement (extends DOMNode class
DOMDocument, Nodes & Elements
The DOMDocument is the first one to mention here. It takes HTML as input and returns an object that gives access to DOM elements. It can load HTML or XML from a string or file. The class defines several methods like getElementById which resemble the functions in Javascript.
$dom = new DOMDocument();//examples//methods to load HTML$dom->loadHTML($html_string);$dom->loadHTMLFile(“path/to/htmlfile.html”);//methods to load XML$dom->load(“path/to/xmlfile.xml”);$dom->loadXML($xml_string);$documentElement = $dom->documentElement; //object of DOMElement Class which gives access to the document elementIn this post, we will mainly think about HTML manipulation over XML.
Nodes
The DOM made from HTML is a tree-like structure made up of individual nodes. These nodes can be of any type, say an element, text, comment, attribute etc. DOMNode is the base class from which all types of node classes inherit.
Elements
The DOMElement class extends the DOMNode class which can represent the elements in your HTML markup. An object of DOMElement can be any element like an image, div, span, table etc.
Practical Examples
Without going more into the theories, let us dive into some practical examples. First of all, we want some HTML data. For that, let us use one of the posts in this blog about image optimization.
We will do the following jobs with our sample HTML:
Select element by IdGet elements by its tag nameFind elements by classFind all links in a pageInserting HTML elementDeleting an elementDealing with attributes
Here is the curl request:
header(“Content-Type:application/json”);$url = “https://www.coralnodes.com/best-wordpress-image-optimization-plugins/”;$ch = curl_init();curl_setopt($ch, CURLOPT_URL, $url);curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);$res = curl_exec($ch);curl_close($ch);The variable $res contains the whole HTML from the web-page.
Selecting by ID
If you look at our sample page, you can see that it contains two tables. Suppose I want to find the number of rows in the first table. Using chrome dev-tools, I found that the required table has the Id – #tablepress-3.
$dom = new DomDocument();
$dom->loadHTML($res);$table = $dom->getElementById(“tablepress-3”); //DOMElement$child_elements = $table->getElementsByTagName(“tr”); //DOMNodeList$row_count = $child_elements->length – 1;echo “No. of rows in the table is ” . $row_count;The above code gives the output:
No. of rows in the table is 10
Selecting a Tag by Its Name
Both the DOMDocument and DOMElement classes have the method getElementsByTagName() which allows us to select elements using the name of the tag. For example, if we have to get all the h2 headings from a page, we can use this function.
Xem thêm: Thể Loại Phim Nào Hợp Nhất Với 12 Cung Hoàng Đạo Phim, 12 Cung Hoàng Đạo
$dom = new DomDocument();
$dom->loadHTML($res);$h2s = $dom->getElementsByTagName(“h2″);foreach( $h2s as $h2 ) { echo $h2->textContent . ”
“;}The result:
Test ImagesResults after CompressionShortPixelreSmush.itImagifyTinyPNG Compress JPEG & PNG ImagesKraken.IOEWWW Image OptimizerWP SmushDo you actually need a Plugin to Optimize Images?Consclusion
Find elements with a particular class
In Javascript, the querySelectorAll() method makes it easy to select any elements using a CSS selector. In PHP, it is not that straightforward. Instead, we have to use the DOMXpath class to query and traverse the DOM tree.
Example: Select all the tables with the class tablepress.
$dom = new DomDocument();
class,”tablepress”)>”);$count = $tables->length;echo “No. of tables ” . $count;Just like getElementByTagName(), the query() method of DOMXpath also returns a DOMNodeList. It takes an expression as an argument. This XPath expression is so versatile that we can perform almost any type of queries.
If you are new to XPath, this cheatsheet from Devhints.io contains a wide list of CSS & JS selectors and their corresponding XPath expressions. It will help you in finding out the appropriate expression for the query you want to perform.
Extract links from a page
Parsing opens a number of opportunities. Extracting the links from a web-page is one such use. That’s how crawlers crawl the world wide web.
Suppose I want to find all the external links to a particular website on a web-page. In our sample page, what I like to do is to find all the outbound links to the wordpress.org website from the blog post. So, this is how I did it.
$dom = new DomDocument();
$dom->loadHTML($res);$links = $dom->getElementsByTagName(“a”);$urls = <>;foreach($links as $link) { $url = $link->getAttribute(“href”); $parsed_url = parse_url($url); if( isset($parsed_url<"host">) && $parsed_url<"host"> === “wordpress.org” ) { $urls<> = $url; }}var_dump($urls);
Modifying & Saving HTML
So far we saw how to extract or select the required data from HTML. Now, let us see how we can modify it by adding or deleting elements and attributes.
Inserting new HTML element into the document
In this example, we will see how to add an image with a link after the first paragraph. This is how you insert banner ads between posts.
$dom = new DomDocument();
$dom->loadHTML($html);$ps = $dom->getElementsByTagName(“p”);$first_para = $ps->item(0);$html_to_add = “
$dom_to_add->loadHTML($html_to_add);$new_element = $dom_to_add->documentElement;$imported_element = $dom->importNode($new_element, true);$first_para->parentNode->insertBefore($imported_element, $first_para->nextSibling);$output =
$dom->saveHTML();echo $output;Note that The saveHTML() method return the manipulated html string.
Deleting an element from the document
To delete an element from our HTML, we can make use of the removeChild() method from the DOMElement class.
$html = “This is our first paragraph
$dom->loadHTML($html);$documentElement = $dom->documentElement;echo $dom->saveHTML();$xpath = new DOMXpath($dom);$elems = $xpath->query(“https://div<
class=”del”>”);foreach( $elems as $elem ) { $elem->parentNode->removeChild($elem);}echo “——-after deletion——–“;echo $dom->saveHTML();
Here we have performed an XPath query to find all the elements with the class del. Then we remove each node from the document by iterating over the DOMNodeList object using a foreach loop.
This is our first paragraphDelete thisThis is our second paragraphThis is our third paragraphDelete this too——-after deletion——–This is our first paragraphThis is our second paragraphThis is our third paragraph
Manipulating Attributes
Classes and Ids are not the only attributes we can access in PHP DOM. The DOMElement class has several functions which can get, set or remove attributes from an element. These methods look similar to that of Javascript. So you will find it easy to understand.
getAttribute($attribute_name) // get the value of an attributesetAttribute($attribute_name, $attribute_value) – set the value of an attributehasAttribute($attribute_name) – checks whether an element has a certain attribute and returns a true or false$html = “Content”;$dom = new DomDocument();
$dom->loadHTML($html);$elem = $dom->getElementsByTagName(“span”)->item(0);if( $elem->hasAttribute(“data-action”) ) { echo “attribute value is “” . $elem->getAttribute(“data-action”) . “””; $elem->setAttribute(“data-action”, “hide”); echo “updated attribute value is “” . $elem->getAttribute(“data-action”) . “””;}
Conclusion
So far, we have looked into some of the important DOM APIs in PHP. I hope that it will help you to get started in parsing HTML and XML data with ease. If I am not clear in certain points, do ask it in the comments.