Web scraping is a term used to lớn describe a way to lớn automatically extract data from the internet, we have seen many website scraping tools so far like BeautifulSoup with python, Diffbot without coding a GUI based tool, Puppeteer with Node.js but is it possible to lớn scrape the data from website using PHP?
Yes, Goutte made it easy for developers to lớn use PHP khổng lồ scrape data. Goutte was originally written by Fabien Potencier. He is a creator of the Symfony framework, which is now maintained by FriendsOfPHP. Goutte is a library that is based on PHP 5.5+ version và Guzzle 6+; Guzzle is a PHP HTTP client that is the requirement of Goutte framework, it is used lớn send HTTP requests. Some Pros about Guzzle is as follows:
Sign up for your weekly dose of what"s up in emerging technology.
Bạn đang xem: Php web scraping using goutte
Simple interface for building POST requests.The Same interface can send both synchronous và asynchronous requests.PSR-7 interfaces for requests.No hard dependency on cURL, PHP streams.
Read more about Guzzle here.
PHP is a server scripting language. It is a very powerful tool for making dynamic and interactive web pages. PHP is widely used and a great competitor to Microsoft’s ASP. When we talk about data extraction from the internet, PHP is the last thing that comes into mind. Goutte is based on the Symfony framework.
Symfony is a mix of PHP components: a Philosophy, a web application framework, và a community – all working together in harmony. It is a PHP framework and a set of reusable components/libraries. Symfony was created by Sensio labs & was published as không tính phí software in 2005 và was released under MIT licence.
Symfony is used by large numbers of developers và contains many great features like:Create complex website applicationsStandalone PHP Micro Framework.Very fast.Stable frameworkGood mở cửa source community và contribution.
Read more about the Symfony framework here.
Goutte provides a decent API to crawl websites và extract data from HTML/XML documents.
That means you can login into websites, submit forms using POST, upload a file và many more all by just using the Goutte framework at your server, you can also run this framework on a local computer.
First, let’s see how lớn set up a PHP environment, what are the requirements, how lớn install an additional framework one by one.
Requirements for Goutte
As discussed above, Goutte depends on :
After downloading unzip và adding the extracted directory path into the environment variable For installation procedure of PHP visit here. For checking if PHP installed properly use the below command:
Installing GoutteNow install goutte using composer, it will showroom fabpot/goutte as a required dependency in your composer.json file:
composer require fabpot/goutte
A web ứng dụng that will Scrape GitHub repository danh sách from your account; using Goutte a php framework!The following example is taken from here, we are going to create a script that will log in to lớn your personal Github account và scrape all the repository menu into your browser.Create a project thư mục and nameDownload the Goutte library repository from GitHub using the below command và after extracting you will get a directory name “Goutte”, we are going khổng lồ use this directory in further process.
git clone https://github.com/FriendsOfPHP/Goutte.gitNow use the composer command to initialize your local directory with composer.json file, we are going lớn install goutte dependencies init.composer require fabpot/goutteLet’s first create a basic interface for the user so that anyone can extract their repository list from GitHub by entering their username và password into the given below form.
if(isset($_POST<"“git_email">) && isset($_POST<"git_pwd">) && !empty($_POST<"“git_email">) && !empty($_POST<"git_pwd">))Import libraries & importing client.php from Goutte directory which we downloaded by using git clone command also we are importing vendor autoload.php that helps in autoloading PHP classes.require_once("vendor/autoload.php");require_once("Goutte/Goutte/Client.php");$client = new Client();Inspect Github and tìm kiếm for login elements: username và password.
ConclusionGoutte is quite fast, can imitate basic user actions, supports async requests, và even doesn’t require any browser.