Web Scraping in Python

In our previous Python tutorial, we have explained how to develop User Management System with Python, Flask and MySQL. In this tutorial, we will explain how to do web scraping using Python.

Now the question arises, What is Web Scraping? Web Scraping is a process to extract data from websites. The scraping software make request to website or web page and extracts underlying HTML code with data to use further in other websites.

In this tutorial, we will discuss how to perform web scraping using the requests and beautifulsoup library in Python.

So let’s proceed to do web scraping.

Application Setup

First, we will create our application directory web-scraping-python using below command.

$ mkdir web-scraping-python

we moved to the project direcotry

$ cd web-scraping-python

Install Required Python Library

We need requests and beautifulsoup library from Python to do scraping. So we need to install these.

requests: This modules provides methods to make HTTP request (GET, POST, PUT, PATCH, or HEAD requests). So we need this to make GET HTTP request to another website. We will install it using the below command:

pip install requests

beautifulsoup: This library used to parsing HTML and XML documents.. We will install it using the below command:

pip install beautifulsoup

Making HTTP Request to URI

We will make HTTP GET request from given server to URI. The GET method sends the encoded information with the page request.

# Import requests library
import requests

# Making a HTTP GET request 
req = requests.get("https://www.codewithlucky.com/")

print (req.content)

When we make a request to URI, it returns a response object. The response object have many functions (status_code, url, content) to get details of request and response.

Output :

Scrape Information using BeautifulSoup Library

We have response data object after making HTTP request to URI. But response data still not useful as it needs to parse to extract usefull data.

So now we will parse that response data using BeautifulSoup library. We will include BeautifulSoup library and parse the response HTML using library.

import requests
from bs4 import BeautifulSoup

# Passing headers if not able to access due to mode_security
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

# Making a HTTP GET request
req = requests.get("https://www.codewithlucky.com/",headers=headers)

soup = BeautifulSoup(req.content, 'html.parser')

print (soup.prettify())

Ouput:

We have parsed and prettify response HTML using prettify() but it’s still not usefull as it’s displaying all resposne HTML.

1. Extracting Information By Element Class

Now we want to extract some specific HTML from below website. We will extract all paragraph text from page specific class.

We can see in page source, the paragraphs are under <div class="entry-content">, so we will find all P tags present in that DIV with Class. We will use find() function to find the object of that specific class from DIV. We will use find_all() function to get all P tags from that object.

import requests
from bs4 import BeautifulSoup

# Passing headers if not able to access due to mode_security
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

# Making a HTTP GET request
req = requests.get("https://www.codewithlucky.com/",headers=headers)

soup = BeautifulSoup(req.content, 'html.parser')

entryContent = soup.find('div', class_='entry-content')

for paragraph in entryContent.find_all('p'):
    print (paragraph.text)

Output:

2. Extracting Information By Element Id

Now will extract all top menu text by element by id. We will have following HTML source.

We will find DIV object by id. Then we will find UL element from that object. Then we will find all li element from that UL element and get text.

import requests
from bs4 import BeautifulSoup

# Passing headers if not able to access due to mode_security
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

# Making a HTTP GET request
req = requests.get("https://www.codewithlucky.com/",headers=headers)

soup = BeautifulSoup(req.content, 'html.parser')

wrapper = soup.find('div', id='wrapper')

navBar = wrapper.find('ul', class_='navbar-nav')

for list in navBar.find_all('li'):
    print (list.text)

Output:

3. Extracting Links

Now we will extract all links data from a particular div.

We will find object of DIV with class entry-content and then find find all anchor a tags and loop through to get anchor href and text.

import requests
from bs4 import BeautifulSoup

# Passing headers if not able to access due to mode_security
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

# Making a HTTP GET request
req = requests.get("https://www.codewithlucky.com/",headers=headers)

soup = BeautifulSoup(req.content, 'html.parser')

entryContent = soup.find('div', class_='entry-content')

for link in entryContent.find_all('a'):
    print (link.text)
    print (link.get('href'))

Output:

4. Saving Scraped Data to CSV

Now we will save scraped data to CSV file. Here we will extract anchor details and save into CSV file.

We will import csv library. Then we will get all links data and append into an list. Then we will save list data to CSV file.

import requests
from bs4 import BeautifulSoup
import csv

# Passing headers if not able to access due to mode_security
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

# Making a HTTP GET request
req = requests.get("https://www.codewithlucky.com/",headers=headers)

soup = BeautifulSoup(req.content, 'html.parser')

anchorsList = []

entryContent = soup.find('div', class_='entry-content')

linkCount = 1
for link in entryContent.find_all('a'):    
    anchor = {}
    anchor['Link text'] = link.text
    anchor['Link url'] = link.get('href')
    linkCount += 1
    anchorsList.append(anchor)

fileName = 'links.csv'
with open(fileName, 'w', newline='') as f:
    w = csv.DictWriter(f,['Link text','Link url'])
    w.writeheader()
     
    w.writerows(anchorsList)

Output: