Web scraping is a term used to describe a way to automatically extract data from the internet, we have seen many web scraping tools so far like BeautifulSoup with python, Diffbot without coding a GUI based tool, Puppeteer with Node.js but is it possible to scrape the data from website using PHP?
Yes, Goutte made it easy for developers to use PHP to scrape data. Goutte was originally written by Fabien Potencier. He is a creator of the Symfony framework, which is now maintained by FriendsOfPHP. Goutte is a library that is based on PHP 5.5+ version and Guzzle 6+; Guzzle is a PHP HTTP client that is the requirement of Goutte framework, it is used to send HTTP requests. Some Pros about Guzzle is as follows:
THE BELAMY
Sign up for your weekly dose of what”s up in emerging technology.
Đang xem: Php web scraping using goutte
Email
Sign up
Simple interface for building POST requests.The Same interface can send both synchronous and asynchronous requests.PSR-7 interfaces for requests.No hard dependency on cURL, PHP streams.
Read more about Guzzle here.
PHP is a server scripting language. It is a very powerful tool for making dynamic and interactive Web pages. PHP is widely used and a great competitor to Microsoft’s ASP. When we talk about data extraction from the internet, PHP is the last thing that comes into mind. Goutte is based on the Symfony framework.
Symfony is a set of PHP components: a Philosophy, a Web application framework, and a community – all working together in harmony. It is a PHP framework and a set of reusable components/libraries. Symfony was created by Sensio labs and was published as free software in 2005 and was released under MIT licence.
Image: credit
Symfony is used by large numbers of developers and contains many great features like:
Create complex web applicationsStandalone PHP Micro Framework.Very fast.Stable frameworkGood open source community & contribution.
Read more about the Symfony framework here.
Goutte provides a decent API to crawl websites and extract data from HTML/XML documents.
That means you can login into websites, submit forms using POST, upload a file and many more all by just using the Goutte framework at your server, you can also run this framework on a local computer.
Getting Started
First, let’s see how to set up a PHP environment, what are the requirements, how to install an additional framework one by one.
Xem thêm: Fun Fact Cung Hoàng Đạo – Những Fun Fact Chấn Động Về 12 Cung Hoàng Đạo
Requirements for Goutte
As discussed above, Goutte depends on :
After downloading unzip and adding the extracted directory path into the environment variable For installation procedure of PHP visit here. For checking if PHP installed properly use the below command:
php –version
composer require guzzlehttp/guzzle
Installing Goutte
Now install goutte using composer, it will add fabpot/goutte as a required dependency in your composer.json file:
composer require fabpot/goutte
Example
A web app that will Scrape GitHub repository list from your account; using Goutte a php framework!
The following example is taken from here, we are going to create a script that will log in to your personal Github account and scrape all the repository list into your browser.
Create a project folder and nameDownload the Goutte library repository from GitHub using the below command and after extracting you will get a directory name “Goutte”, we are going to use this directory in further process.
git clone https://github.com/FriendsOfPHP/Goutte.gitNow use the composer command to initialize your local directory with composer.json file, we are going to install goutte dependencies init.composer require fabpot/goutteLet’s first create a basic interface for the user so that anyone can extract their repository list from GitHub by entering their username and password into the given below form.
Submit
This is a client-side user interface from where our scraper is going to read the username and password.
Check if our page is submitting the data. This is PHP code which runs first when the user hits the submit button after filling the form.
if(isset($_POST<"“git_email">) && isset($_POST<"git_pwd">) && !empty($_POST<"“git_email">) && !empty($_POST<"git_pwd">)){}Import libraries and importing client.php from Goutte directory which we downloaded by using git clone command also we are importing vendor autoload.php that helps in autoloading PHP classes.require_once(“vendor/autoload.php”);require_once(“Goutte/Goutte/Client.php”);$client = new Client();Inspect Github and search for login elements: username and password.
Initialize crawler variable using request and GET command on URL.$crawler = $client->request(“GET”, “https://github.com/login”);Now we know the elements where username and password tag are, so set the parameter and form:$form = $crawler->selectButton(“Sign in”)->form();$form->setValues(<"login" => $_POST<"git_email">, “password” => $_POST<“git_pwd">>);$crawler = $client->submit($form);Checking if the login was successful by checking if meta tag having name “octolytics-actor-login”$username = “”;$crawler->filter(“meta”)->each(function ($node) {global $username;if(trim($node->attr(“name”)) == “octolytics-actor-login”){$username = ($node->attr(“content”));return;}});$crawler = $client->request(“GET”, “https://github.com/”.$username.”?tab=repositories”);Let’s inspect the repository page. All the repository’s names are inside the ankle tag that is inside the class “source”. We can use the filter function to extract text from ankle tag using filter(li.source a)
$crawler->filter(“li.source a”)->each(function ($node) {if(is_numeric($node->text()) === false){echo $node->text();echo “”;}});
Let’s run our script in the browser. To create a lightweight server for our web app, first, open the terminal/command prompt(CMD) in this directory and use the following commands:php -S 127.0.0.1:8000
Custom GitHub login pageThe output will be shown in the browser, all the repositories list including private repo too because we scraped the data after login, so we have full access to user data.
output
Conclusion
Goutte is quite fast, can imitate basic user actions, supports async requests, and even doesn’t require any browser.
We saw a web application that is capable of scraping data from the GitHub account of the user, Goutte a friend of PHP which is capable of working with client-side application and also it can take user inputs and scrape accordingly with full control over the account. There are some cons of Goutte too like it doesn’t support JavaScript and also can’t take pictures as we do in Puppeteer. Indeed Goutte is a lightweight wrapper on top of the best frameworks.