Web Scraping and Storing Movie Rankings in SQLite Using Python

Introduction

In the era of big data, web scraping has become an essential skill for extracting valuable information from online sources. This article demonstrates a Python-based approach to web scraping using BeautifulSoup, data storage using SQLite, and data analysis using Pandas. The example focuses on extracting the top 50 highly ranked films from a webpage archived by the Wayback Machine.

Prerequisites

To execute this script, ensure you have the following Python libraries installed:

requests for fetching web pages
BeautifulSoup from bs4 for parsing HTML content
sqlite3 for database interactions
pandas for data manipulation and storage

You can install these libraries using the following command:

pip install requests beautifulsoup4 pandas

Code Breakdown

Step 1: Import Required Libraries

import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup

The necessary libraries are imported for handling HTTP requests, parsing HTML content, manipulating data, and storing it in a database.

Step 2: Define Key Variables

url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'

db_name ='Movies.db'
table_name='top_50'
csv_path='/home/project/top_50_films.csv'
df=pd.DataFrame(columns=['Average Rank', 'Film', 'Year'])
count=0

The url variable stores the webpage containing the movie rankings.
db_name and table_name define the SQLite database and table names.
csv_path specifies the path where the data will be saved as a CSV file.
An empty Pandas DataFrame df is initialized to store the extracted data.
The count variable keeps track of the number of films extracted.

Step 3: Fetch and Parse Web Page

html_page=requests.get(url).text
data=BeautifulSoup(html_page, 'html.parser')

The requests.get() method retrieves the HTML content of the webpage.
BeautifulSoup parses the HTML structure to enable data extraction.

Step 4: Extract Movie Rankings from Table

# get the tables
tables=data.find_all('tbody')
rows=tables[0].find_all('tr')

The script locates all <tbody> elements, which contain tabular data.
The first table is accessed using tables[0], and its rows are stored in rows.

Step 5: Process and Store Data in DataFrame

for row in rows:
    if count<50:
        col = row.find_all('td')
        if len(col)!=0:
            data_dict = {"Average Rank": col[0].contents[0],
                         "Film": col[1].contents[0],
                         "Year": col[2].contents[0]}
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df, df1], ignore_index=True)
            count+=1
    else:
        break

Iterates over each row in the table to extract movie data.
The script checks if the count is below 50 to limit data extraction.
Extracted data (Rank, Film Name, and Year) is stored in a dictionary and added to the DataFrame.

Step 6: Save Data to CSV

df.to_csv(csv_path)

Saves the extracted movie data as a CSV file at the specified path.

Step 7: Store Data in SQLite Database

conn=sqlite3.connect(db_name)
df.to_sql(table_name, conn, if_exists='replace', index=False)
conn.close()

Establishes a connection to the SQLite database.
Uses to_sql() to store the DataFrame into the database table.
Closes the connection to free up resources.

Conclusion

This script effectively scrapes movie rankings from an archived webpage, stores them in a Pandas DataFrame, saves them as a CSV file, and inserts them into an SQLite database. This workflow is useful for data collection, analysis, and automation in various research and analytics applications.

Link to the project