In our previous Python tutorial, we have explained how to develop User Management System with Python, Flask and MySQL. In this tutorial, we will explain how to do web scraping using Python.
Now the question arises, What is Web Scraping? Web Scraping is a process to extract data from websites. The scraping software make request to website or web page and extracts underlying HTML code with data to use further in other websites.
In this tutorial, we will discuss how to perform web scraping using the requests
and beautifulsoup
library in Python.
So let’s proceed to do web scraping.
Application Setup
First, we will create our application directory web-scraping-python
using below command.
$ mkdir web-scraping-python
we moved to the project direcotry
$ cd web-scraping-python
Install Required Python Library
We need requests
and beautifulsoup
library from Python to do scraping. So we need to install these.
- requests: This modules provides methods to make HTTP request (GET, POST, PUT, PATCH, or HEAD requests). So we need this to make
GET
HTTP request to another website. We will install it using the below command:
pip install requests
- beautifulsoup: This library used to parsing HTML and XML documents.. We will install it using the below command:
pip install beautifulsoup
Making HTTP Request to URI
We will make HTTP GET
request from given server to URI. The GET
method sends the encoded information with the page request.
# Import requests library import requests # Making a HTTP GET request req = requests.get("https://www.codewithlucky.com/") print (req.content)
When we make a request to URI, it returns a response object. The response object have many functions (status_code, url, content) to get details of request and response.
Output :
Scrape Information using BeautifulSoup Library
We have response data object after making HTTP request to URI. But response data still not useful as it needs to parse to extract usefull data.
So now we will parse that response data using BeautifulSoup
library. We will include BeautifulSoup
library and parse the response HTML using library.
import requests from bs4 import BeautifulSoup # Passing headers if not able to access due to mode_security headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0', } # Making a HTTP GET request req = requests.get("https://www.codewithlucky.com/",headers=headers) soup = BeautifulSoup(req.content, 'html.parser') print (soup.prettify())
Ouput:
We have parsed and prettify response HTML using prettify()
but it’s still not usefull as it’s displaying all resposne HTML.
1. Extracting Information By Element Class
Now we want to extract some specific HTML from below website. We will extract all paragraph text from page specific class.
We can see in page source, the paragraphs are under <div class="entry-content">
, so we will find all P
tags present in that DIV with Class. We will use find()
function to find the object of that specific class from DIV. We will use find_all()
function to get all P
tags from that object.
import requests from bs4 import BeautifulSoup # Passing headers if not able to access due to mode_security headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0', } # Making a HTTP GET request req = requests.get("https://www.codewithlucky.com/",headers=headers) soup = BeautifulSoup(req.content, 'html.parser') entryContent = soup.find('div', class_='entry-content') for paragraph in entryContent.find_all('p'): print (paragraph.text)
Output:
2. Extracting Information By Element Id
Now will extract all top menu text by element by id. We will have following HTML source.
We will find DIV
object by id. Then we will find UL
element from that object. Then we will find all li
element from that UL
element and get text.
import requests from bs4 import BeautifulSoup # Passing headers if not able to access due to mode_security headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0', } # Making a HTTP GET request req = requests.get("https://www.codewithlucky.com/",headers=headers) soup = BeautifulSoup(req.content, 'html.parser') wrapper = soup.find('div', id='wrapper') navBar = wrapper.find('ul', class_='navbar-nav') for list in navBar.find_all('li'): print (list.text)
Output:
3. Extracting Links
Now we will extract all links data from a particular div.
We will find object of DIV
with class entry-content
and then find find all anchor a
tags and loop through to get anchor href and text.
import requests from bs4 import BeautifulSoup # Passing headers if not able to access due to mode_security headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0', } # Making a HTTP GET request req = requests.get("https://www.codewithlucky.com/",headers=headers) soup = BeautifulSoup(req.content, 'html.parser') entryContent = soup.find('div', class_='entry-content') for link in entryContent.find_all('a'): print (link.text) print (link.get('href'))
Output:
4. Saving Scraped Data to CSV
Now we will save scraped data to CSV file. Here we will extract anchor details and save into CSV file.
We will import csv
library. Then we will get all links data and append into an list. Then we will save list data to CSV file.
import requests from bs4 import BeautifulSoup import csv # Passing headers if not able to access due to mode_security headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0', } # Making a HTTP GET request req = requests.get("https://www.codewithlucky.com/",headers=headers) soup = BeautifulSoup(req.content, 'html.parser') anchorsList = [] entryContent = soup.find('div', class_='entry-content') linkCount = 1 for link in entryContent.find_all('a'): anchor = {} anchor['Link text'] = link.text anchor['Link url'] = link.get('href') linkCount += 1 anchorsList.append(anchor) fileName = 'links.csv' with open(fileName, 'w', newline='') as f: w = csv.DictWriter(f,['Link text','Link url']) w.writeheader() w.writerows(anchorsList)
Output: