Web scraping has become an indispensable tool for gathering data from across the internet, empowering data analysts, tech enthusiasts, and businesses to make informed decisions. But extracting data is just the first step. To unlock its full potential, you need to export it efficiently into the right format—whether that's a CSV file for spreadsheets, JSON for APIs, or databases for large-scale storage and analysis.
This blog will take you through the essentials of exporting web-scraped data. You’ll learn step-by-step how to work with CSV and JSON files, integrate web-scraped data with databases, and make the most of your data management practices.
Before diving into the script, let’s understand the dataset and workflow that we’ll use to demonstrate the data-saving process.
We’ll be scraping data from the website Books to Scrape, which provides a list of books along with their:
This website is designed for practice purposes, making it an ideal choice for showcasing web scraping techniques.
Here’s the process we’ll follow:
requests
Và BeautifulSoup
libraries to extract the book details from the website.To run the script, you’ll need the following Python libraries:
Install these libraries using
. Run the following command in your terminal: pip
pip cài đặt yêu cầu beautifulsoup4 pandas
Here’s the Python script to scrape the data from the website and store it in a Pandas DataFrame:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Scrape data from the website
def scrape_books():
url = "https://books.toscrape.com/"
response = requests.get(url)
if response.status_code != 200:
raise Exception("Failed to load page")
soup = BeautifulSoup(response.content, "html.parser")
books = []
# Extract book data
for article in soup.find_all("article", class_="product_pod"):
title = article.h3.a["title"]
price = article.find("p", class_="price_color").text.strip()
availability = article.find("p", class_="instock availability").text.strip()
books.append({"Title": title, "Price": price, "Availability": availability})
# Convert to DataFrame
books_df = pd.DataFrame(books)
return books_df
# Main execution
if __name__ == "__main__":
print("Scraping data...")
books_df = scrape_books()
print("Data scraped successfully!")
print(books_df)
The table we will use to demonstrate the data-saving process is structured as follows:
Title | Giá | Khả dụng |
A Light in the Attic | £51.77 | In stock |
Tipping the Velvet | £53.74 | In stock |
Soumission | £50.10 | In stock |
Sharp Objects | £47.82 | In stock |
Sapiens: A Brief History of Humankind | £54.23 | NA |
The Requiem Red | £22.65 | In stock |
... | ... | .... |
Use the
method from Pandas: to_csv
def save_to_csv(dataframe, filename="books.csv"):
dataframe.to_csv(filename, index=False)
print(f"Data saved to {filename}")
Code Explanation:
filename
: Specifies the name of the output file.index=False
: Ensures the index column is not included in the CSV file. Use the
method from Pandas: to_json
def save_to_json(dataframe, filename="books.json"):
dataframe.to_json(filename, orient="records", indent=4)
print(f"Data saved to {filename}")
Code Explanation:
orient="records"
: : Each row in the DataFrame is converted into a JSON object.indent=4
: Formats the JSON for better readability. Use the
method from Pandas với SQLite: to_sql
import sqlite3
def save_to_database(dataframe, database_name="books.db"):
conn = sqlite3.connect(database_name)
dataframe.to_sql("books", conn, if_exists="replace", index=False)
conn.close()
print(f"Data saved to {database_name} database")
Code Explanation:
sqlite3.connect(database_name)
: Connects to the SQLite database (creates it if it doesn’t exist).to_sql("books", conn, if_exists="replace", index=False)
:While formats like CSV or JSON work well for smaller projects, databases offer superior performance, query optimization, and data integrity when handling larger datasets. The seamless integration of Pandas with SQLite makes it simple to store, retrieve, and manipulate data efficiently. Whether you're building a data pipeline or a complete application, understanding how to leverage databases will greatly enhance your ability to work with data effectively. Start using these tools today to streamline your data workflows and unlock new possibilities!