Python 第二阶段 - 爬虫入门

🎯 本周知识回顾

网络请求与网页结构基础
HTML解析入门（使用 BeautifulSoup）
实现爬虫多页抓取与翻页逻辑
模拟登录爬虫与 Session 维持
使用 XPath 进行网页解析（lxml + XPath）
反爬虫应对策略 & Headers / Cookies 模拟请求

🧪 综合实战挑战任务

🎯 项目名称：多页名言抓取系统 Plus 版

🛠️ 任务目标

使用本周所学，构建一个完整的爬虫系统，功能如下：

📌 功能需求：

自动翻页抓取 quotes.toscrape.com
支持 BeautifulSoup 与 XPath 两种方式切换
支持数据保存为 JSON 文件
记录抓取日志（包括时间、页数、条数）
模拟登录后抓取用户登录后的页面内容

📂 推荐文件结构

quotes_scraper/
├── main.py               # 程序入口
├── bs4_parser.py         # 使用 BeautifulSoup 提取
├── xpath_parser.py       # 使用 lxml + XPath 提取
├── login_session.py      # 登录模块，返回登录后的 session
├── utils.py              # 通用函数，如保存 json，打印日志
└── data/└── quotes.json       # 抓取结果

🚀 挑战加分项（可选）

✨ 使用命令行参数切换解析方式（如：--parser xpath）
✨ 抓取后展示作者出现次数统计（使用 collections.Counter）
✨ 每条数据添加时间戳字段
✨ 自动保存抓取失败日志（如网络错误等）

🧪 示例挑战结果展示（JSON 数据片段）

[{"text": "“The world as we have created it is a process of our thinking.”","author": "Albert Einstein","tags": ["change", "deep-thoughts"],"timestamp": "2025-06-16 21:30:00"},...
]

✅ 今日练习任务

完成多页抓取器功能（任选解析方式）
写出能复用的函数和模块
成功模拟登录（CSRF + Cookie）
保存 JSON，并打印总抓取条数
[可选] 完善命令行支持 & 错误日志输出

好的！以下是 第4周第7天 · 项目实战 的完整代码实现 —— 一个结构化、可扩展的多页名言爬虫系统，支持：

多页抓取
BeautifulSoup 与 XPath 两种方式切换
数据保存为 JSON
日志记录
简易模拟登录（session）

🗂 项目结构建议（quotes_scraper）

quotes_scraper/
├── main.py               # 程序入口
├── bs4_parser.py         # BeautifulSoup 解析器
├── xpath_parser.py       # XPath 解析器
├── login_session.py      # 登录模拟（requests.Session）
├── utils.py              # 工具函数（保存、打印、日志）
├── config.py             # 配置项（起始页、保存路径等）
└── data/└── quotes.json       # 保存结果

✅ 1. `config.py`（配置文件）

# config.py
BASE_URL = "https://quotes.toscrape.com"
START_PAGE = "/page/1/"
USE_XPATH = False  # True 使用 lxml，否则用 BS4
SAVE_PATH = "data/quotes.json"

✅ 2. `utils.py`

# utils.py
import json
import os
import logging
from datetime import datetimedef save_to_json(data, filepath):os.makedirs(os.path.dirname(filepath), exist_ok=True)with open(filepath, 'w', encoding='utf-8') as f:json.dump(data, f, ensure_ascii=False, indent=2)print(f"✅ 数据已保存到 {filepath}")def log(message):print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {message}")def setup_logging(debug=False):os.makedirs("logs", exist_ok=True)level = logging.DEBUG if debug else logging.WARNINGlogging.basicConfig(level=level,format='%(asctime)s [%(levelname)s] %(message)s',handlers=[logging.FileHandler("logs/error.log", encoding='utf-8'),logging.StreamHandler()])

✅ 3. `bs4_parser.py`

# bs4_parser.py
from bs4 import BeautifulSoupdef parse_quotes_bs4(html):soup = BeautifulSoup(html, "html.parser")quotes = []for q in soup.select(".quote"):quotes.append({"text": q.select_one(".text").text.strip(),"author": q.select_one(".author").text.strip(),"tags": [tag.text for tag in q.select(".tags .tag")]})return quotes

✅ 4. `xpath_parser.py`

# xpath_parser.py
from lxml import etreedef parse_quotes_xpath(html):tree = etree.HTML(html)quotes = []quote_elements = tree.xpath('//div[@class="quote"]')for q in quote_elements:quotes.append({"text": q.xpath('.//span[@class="text"]/text()')[0],"author": q.xpath('.//small[@class="author"]/text()')[0],"tags": q.xpath('.//div[@class="tags"]/a[@class="tag"]/text()')})return quotes

✅ 5. `login_session.py`

# login_session.py
import requestsdef create_session():session = requests.Session()login_url = "https://quotes.toscrape.com/login"headers = {"User-Agent": "Mozilla/5.0","Referer": login_url}# 先获取 csrf_tokenresp = session.get(login_url, headers=headers)token = resp.text.split('name="csrf_token" value="')[1].split('"')[0]login_data = {"csrf_token": token,"username": "test","password": "test"}session.post(login_url, data=login_data, headers=headers)return session

✅ 6. `main.py`（程序入口）

# main.py
import argparse
import logging
import time
from login_session import create_session
from bs4_parser import parse_quotes_bs4
from xpath_parser import parse_quotes_xpath
from utils import save_to_json, log, setup_loggingimport configdef parse_args():parser = argparse.ArgumentParser(description="Quotes Scraper - 支持 BS4 / XPath")parser.add_argument('--use-xpath', action='store_true', help="使用 XPath 解析器")parser.add_argument('--output', type=str, default="data/quotes.json", help="保存文件路径")parser.add_argument('--debug', action='store_true', help="启用调试模式（输出更多日志）")return parser.parse_args()def scrape(use_xpath=False, save_path="data/quotes.json"):log("开始爬取...")session = create_session()url = config.BASE_URL + config.START_PAGEall_quotes = []page = 1while True:try:log(f"请求第 {page} 页：{url}")resp = session.get(url, headers={"User-Agent": "Mozilla/5.0","Referer": config.BASE_URL})if resp.status_code != 200:logging.error(f"请求失败：{resp.status_code}")breakhtml = resp.textquotes = parse_quotes_xpath(html) if use_xpath else parse_quotes_bs4(html)if not quotes:log("未获取到内容，结束。")breakall_quotes.extend(quotes)log(f"第 {page} 页抓取 {len(quotes)} 条")# 判断是否还有下一页if 'href="/page/{}/"'.format(page + 1) in html:page += 1url = f"{config.BASE_URL}/page/{page}/"time.sleep(1)else:breakexcept Exception as e:logging.exception(f"第 {page} 页抓取异常：{e}")breaklog(f"共抓取 {len(all_quotes)} 条名言")save_to_json(all_quotes, save_path)if __name__ == "__main__":args = parse_args()setup_logging(debug=args.debug)scrape(use_xpath=args.use_xpath, save_path=args.output)

✅ 如何运行：

# 默认使用 BeautifulSoup，保存到 data/quotes.json
python main.py# 使用 XPath + 自定义保存路径
python main.py --use-xpath --output data/xpath_output.json# 开启调试模式（输出详细错误）
python main.py --debug

🎯 示例运行结果（终端输出）

1. 运行`python main.py`

python main.py

输出：

[2025-06-17 17:39:29] 开始爬取...
[2025-06-17 17:39:31] 正在请求第 1 页：https://quotes.toscrape.com/page/1/
[2025-06-17 17:39:31] 本页获取 10 条名言。
[2025-06-17 17:39:32] 正在请求第 2 页：https://quotes.toscrape.com/page/2/
[2025-06-17 17:39:33] 本页获取 10 条名言。
[2025-06-17 17:39:34] 正在请求第 3 页：https://quotes.toscrape.com/page/3/
[2025-06-17 17:39:34] 本页获取 10 条名言。
[2025-06-17 17:39:35] 正在请求第 4 页：https://quotes.toscrape.com/page/4/
[2025-06-17 17:39:35] 本页获取 10 条名言。
[2025-06-17 17:39:36] 正在请求第 5 页：https://quotes.toscrape.com/page/5/
[2025-06-17 17:39:37] 本页获取 10 条名言。
[2025-06-17 17:39:38] 正在请求第 6 页：https://quotes.toscrape.com/page/6/
[2025-06-17 17:39:38] 本页获取 10 条名言。
[2025-06-17 17:39:39] 正在请求第 7 页：https://quotes.toscrape.com/page/7/
[2025-06-17 17:39:39] 本页获取 10 条名言。
[2025-06-17 17:39:40] 正在请求第 8 页：https://quotes.toscrape.com/page/8/
[2025-06-17 17:39:41] 本页获取 10 条名言。
[2025-06-17 17:39:42] 正在请求第 9 页：https://quotes.toscrape.com/page/9/
[2025-06-17 17:39:42] 本页获取 10 条名言。
[2025-06-17 17:39:43] 正在请求第 10 页：https://quotes.toscrape.com/page/10/
[2025-06-17 17:39:43] 本页获取 10 条名言。
[2025-06-17 17:39:44] 正在请求第 11 页：https://quotes.toscrape.com/page/11/
[2025-06-17 17:39:45] 没有找到更多数据，停止爬取。
[2025-06-17 17:39:45] 共抓取 100 条名言
✅ 数据已保存到 data/quotes.json

2. 运行`python3 main.py --use-xpath --output data/xpath_output.json`

python3 main.py --use-xpath --output data/xpath_output.json

输出：

[2025-06-17 17:50:38] 开始爬取...
[2025-06-17 17:50:39] 请求第 1 页：https://quotes.toscrape.com/page/1/
[2025-06-17 17:50:40] 第 1 页抓取 10 条
[2025-06-17 17:50:41] 请求第 2 页：https://quotes.toscrape.com/page/2/
[2025-06-17 17:50:41] 第 2 页抓取 10 条
[2025-06-17 17:50:42] 请求第 3 页：https://quotes.toscrape.com/page/3/
[2025-06-17 17:50:42] 第 3 页抓取 10 条
[2025-06-17 17:50:43] 请求第 4 页：https://quotes.toscrape.com/page/4/
[2025-06-17 17:50:43] 第 4 页抓取 10 条
[2025-06-17 17:50:44] 请求第 5 页：https://quotes.toscrape.com/page/5/
[2025-06-17 17:50:45] 第 5 页抓取 10 条
[2025-06-17 17:50:46] 请求第 6 页：https://quotes.toscrape.com/page/6/
[2025-06-17 17:50:46] 第 6 页抓取 10 条
[2025-06-17 17:50:47] 请求第 7 页：https://quotes.toscrape.com/page/7/
[2025-06-17 17:50:47] 第 7 页抓取 10 条
[2025-06-17 17:50:48] 请求第 8 页：https://quotes.toscrape.com/page/8/
[2025-06-17 17:50:49] 第 8 页抓取 10 条
[2025-06-17 17:50:50] 请求第 9 页：https://quotes.toscrape.com/page/9/
[2025-06-17 17:50:50] 第 9 页抓取 10 条
[2025-06-17 17:50:51] 请求第 10 页：https://quotes.toscrape.com/page/10/
[2025-06-17 17:50:51] 第 10 页抓取 10 条
[2025-06-17 17:50:51] 共抓取 100 条名言
✅ 数据已保存到 data/xpath_output.json

3. 运行`python3 main.py --debug`

python3 main.py --debug

输出：

[2025-06-17 17:50:09] 开始爬取...
2025-06-17 17:50:09,517 [DEBUG] Starting new HTTPS connection (1): quotes.toscrape.com:443
2025-06-17 17:50:10,583 [DEBUG] https://quotes.toscrape.com:443 "GET /login HTTP/1.1" 200 1880
2025-06-17 17:50:10,889 [DEBUG] https://quotes.toscrape.com:443 "POST /login HTTP/1.1" 302 189
2025-06-17 17:50:11,172 [DEBUG] https://quotes.toscrape.com:443 "GET / HTTP/1.1" 200 11928
[2025-06-17 17:50:11] 请求第 1 页：https://quotes.toscrape.com/page/1/
2025-06-17 17:50:11,461 [DEBUG] https://quotes.toscrape.com:443 "GET /page/1/ HTTP/1.1" 200 11928
[2025-06-17 17:50:11] 第 1 页抓取 10 条
[2025-06-17 17:50:12] 请求第 2 页：https://quotes.toscrape.com/page/2/
2025-06-17 17:50:12,756 [DEBUG] https://quotes.toscrape.com:443 "GET /page/2/ HTTP/1.1" 200 14597
[2025-06-17 17:50:12] 第 2 页抓取 10 条
[2025-06-17 17:50:13] 请求第 3 页：https://quotes.toscrape.com/page/3/
2025-06-17 17:50:14,122 [DEBUG] https://quotes.toscrape.com:443 "GET /page/3/ HTTP/1.1" 200 10888
[2025-06-17 17:50:14] 第 3 页抓取 10 条
[2025-06-17 17:50:15] 请求第 4 页：https://quotes.toscrape.com/page/4/
2025-06-17 17:50:15,419 [DEBUG] https://quotes.toscrape.com:443 "GET /page/4/ HTTP/1.1" 200 11188
[2025-06-17 17:50:15] 第 4 页抓取 10 条
[2025-06-17 17:50:16] 请求第 5 页：https://quotes.toscrape.com/page/5/
2025-06-17 17:50:16,717 [DEBUG] https://quotes.toscrape.com:443 "GET /page/5/ HTTP/1.1" 200 10891
[2025-06-17 17:50:16] 第 5 页抓取 10 条
[2025-06-17 17:50:17] 请求第 6 页：https://quotes.toscrape.com/page/6/
2025-06-17 17:50:18,017 [DEBUG] https://quotes.toscrape.com:443 "GET /page/6/ HTTP/1.1" 200 11299
[2025-06-17 17:50:18] 第 6 页抓取 10 条
[2025-06-17 17:50:19] 请求第 7 页：https://quotes.toscrape.com/page/7/
2025-06-17 17:50:19,300 [DEBUG] https://quotes.toscrape.com:443 "GET /page/7/ HTTP/1.1" 200 11591
[2025-06-17 17:50:19] 第 7 页抓取 10 条
[2025-06-17 17:50:20] 请求第 8 页：https://quotes.toscrape.com/page/8/
2025-06-17 17:50:20,598 [DEBUG] https://quotes.toscrape.com:443 "GET /page/8/ HTTP/1.1" 200 12243
[2025-06-17 17:50:20] 第 8 页抓取 10 条
[2025-06-17 17:50:21] 请求第 9 页：https://quotes.toscrape.com/page/9/
2025-06-17 17:50:21,969 [DEBUG] https://quotes.toscrape.com:443 "GET /page/9/ HTTP/1.1" 200 11862
[2025-06-17 17:50:21] 第 9 页抓取 10 条
[2025-06-17 17:50:22] 请求第 10 页：https://quotes.toscrape.com/page/10/
2025-06-17 17:50:23,270 [DEBUG] https://quotes.toscrape.com:443 "GET /page/10/ HTTP/1.1" 200 10795
[2025-06-17 17:50:23] 第 10 页抓取 10 条
[2025-06-17 17:50:23] 共抓取 100 条名言

✅ 数据保存示例片段

[{"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”","author": "Albert Einstein","tags": ["change","deep-thoughts","thinking","world"]},...
]