Introduction
In the era of big data, web scraping has become an essential skill for extracting valuable information from online sources. This article demonstrates a Python-based approach to web scraping using BeautifulSoup, data storage using SQLite, and data analysis using Pandas. The example focuses on extracting the top 50 highly ranked films from a webpage archived by the Wayback Machine.
Prerequisites
To execute this script, ensure you have the following Python libraries installed:
requestsfor fetching web pagesBeautifulSoupfrombs4for parsing HTML contentsqlite3for database interactionspandasfor data manipulation and storage
You can install these libraries using the following command:
pip install requests beautifulsoup4 pandas
Code Breakdown
Step 1: Import Required Libraries
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup
The necessary libraries are imported for handling HTTP requests, parsing HTML content, manipulating data, and storing it in a database.
Step 2: Define Key Variables
url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'
db_name ='Movies.db'
table_name='top_50'
csv_path='/home/project/top_50_films.csv'
df=pd.DataFrame(columns=['Average Rank', 'Film', 'Year'])
count=0
- The
urlvariable stores the webpage containing the movie rankings. db_nameandtable_namedefine the SQLite database and table names.csv_pathspecifies the path where the data will be saved as a CSV file.- An empty Pandas DataFrame
dfis initialized to store the extracted data. - The
countvariable keeps track of the number of films extracted.
Step 3: Fetch and Parse Web Page
html_page=requests.get(url).text
data=BeautifulSoup(html_page, 'html.parser')
- The
requests.get()method retrieves the HTML content of the webpage. BeautifulSoupparses the HTML structure to enable data extraction.
Step 4: Extract Movie Rankings from Table
# get the tables
tables=data.find_all('tbody')
rows=tables[0].find_all('tr')
- The script locates all
<tbody>elements, which contain tabular data. - The first table is accessed using
tables[0], and its rows are stored inrows.
Step 5: Process and Store Data in DataFrame
for row in rows:
if count<50:
col = row.find_all('td')
if len(col)!=0:
data_dict = {"Average Rank": col[0].contents[0],
"Film": col[1].contents[0],
"Year": col[2].contents[0]}
df1 = pd.DataFrame(data_dict, index=[0])
df = pd.concat([df, df1], ignore_index=True)
count+=1
else:
break
- Iterates over each row in the table to extract movie data.
- The script checks if the count is below 50 to limit data extraction.
- Extracted data (Rank, Film Name, and Year) is stored in a dictionary and added to the DataFrame.
Step 6: Save Data to CSV
df.to_csv(csv_path)
- Saves the extracted movie data as a CSV file at the specified path.
Step 7: Store Data in SQLite Database
conn=sqlite3.connect(db_name)
df.to_sql(table_name, conn, if_exists='replace', index=False)
conn.close()
- Establishes a connection to the SQLite database.
- Uses
to_sql()to store the DataFrame into the database table. - Closes the connection to free up resources.
Conclusion
This script effectively scrapes movie rankings from an archived webpage, stores them in a Pandas DataFrame, saves them as a CSV file, and inserts them into an SQLite database. This workflow is useful for data collection, analysis, and automation in various research and analytics applications.