Selenium 基础操作
作为一名资深爬虫工程师,我将带您全面掌握Selenium自动化测试与网页爬取技术。
本教程基于Python 3.12,使用uv进行依赖管理,并通过FastAPI搭建模拟网站供实战练习。
第一章:环境搭建
1.1 安装Python 3.12
首先确保您已安装Python 3.12,可以从Python官网下载安装。
1.2 安装uv包管理器
uv是一个快速的Python包管理器,替代传统的pip:
# 安装uv
curl -LsSf https://astral.sh/uv/install.sh | sh# 或者使用pip安装
pip install uv
1.3 创建项目并安装依赖
# 创建项目目录
mkdir selenium-tutorial && cd selenium-tutorial# 初始化项目
uv init -p 3.12# 初始化虚拟环境并指定python版本
uv venv .venv# 激活虚拟环境
# windows
.venv\Scripts\activate# macos | linux
source .venv\bin\activate# 安装所需依赖
uv add selenium fastapi uvicorn jinja2 python-multipart webdriver-manager
或者
# 创建项目目录
uv init selenium-tutorial -p 3.12 && cd selenium-tutorial# 初始化虚拟环境并指定python版本
uv venv .venv# 激活虚拟环境
# windows
.venv\Scripts\activate# macos | linux
source .venv\bin\activate# 安装所需依赖
uv add selenium fastapi uvicorn jinja2 python-multipart webdriver-manager
1.4 浏览器驱动配置
Selenium需要对应浏览器的驱动程序,我们使用webdriver-manager
自动管理:
- Chrome: 会自动下载对应版本的chromedriver
- Firefox: 会自动下载geckodriver
- Edge: 会自动下载msedgedriver
无需手动下载和配置路径,webdriver-manager
会处理一切。
第二章:FastAPI模拟网站搭建
为了进行安全合法的练习,我们搭建一个模拟网站作为爬取目标。
app.py
"""
code: app.py
"""
from fastapi import FastAPI, Request, Form
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates
import random
import string# 创建FastAPI应用
app = FastAPI(title="Selenium 练习网站")# 设置模板目录
templates = Jinja2Templates(directory="templates")# 生成随机token用于演示token验证
def generate_token():return ''.join(random.choices(string.ascii_letters + string.digits, k=16))# 首页
@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):# 生成页面tokentoken = generate_token()return templates.TemplateResponse("index.html", {"request": request,"token": token,"products": [{"id": 1, "name": "笔记本电脑", "price": 5999, "category": "电子产品"},{"id": 2, "name": "机械键盘", "price": 399, "category": "电脑配件"},{"id": 3, "name": "无线鼠标", "price": 199, "category": "电脑配件"},{"id": 4, "name": "蓝牙耳机", "price": 799, "category": "音频设备"}]})# 登录页面
@app.get("/login", response_class=HTMLResponse)
async def login_page(request: Request):token = generate_token()return templates.TemplateResponse("login.html", {"request": request,"token": token})# 处理登录请求
@app.post("/login", response_class=HTMLResponse)
async def login(request: Request,username: str = Form(...),password: str = Form(...),token: str = Form(...)
):# 简单验证逻辑if username == "test" and password == "password123":return templates.TemplateResponse("dashboard.html", {"request": request,"message": "登录成功","username": username})else:return templates.TemplateResponse("login.html", {"request": request,"error": "用户名或密码错误","token": generate_token()})# 动态内容页面(用于演示等待机制)
@app.get("/dynamic", response_class=HTMLResponse)
async def dynamic_content(request: Request):return templates.TemplateResponse("dynamic.html", {"request": request})# 表单页面
@app.get("/form", response_class=HTMLResponse)
async def form_page(request: Request):token = generate_token()return templates.TemplateResponse("form.html", {"request": request,"token": token})# 文件上传页面
@app.get("/upload", response_class=HTMLResponse)
async def upload_page(request: Request):return templates.TemplateResponse("upload.html", {"request": request})if __name__ == "__main__":import uvicornuvicorn.run(app, host="0.0.0.0", port=8000)
创建模板文件
创建templates
目录,并添加以下HTML文件:
templates/index.html
<!DOCTYPE html>
<html>
<head><title>Selenium练习网站</title><style>body { font-family: Arial, sans-serif; max-width: 1200px; margin: 0 auto; padding: 20px; }.product { border: 1px solid #ddd; padding: 10px; margin: 10px; display: inline-block; width: 200px; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style><input type="hidden" id="csrf_token" value="{{ token }}">
</head>
<body><div class="nav"><a href="/" id="home-link">首页</a><a href="/login" class="nav-link">登录</a><a href="/dynamic" class="nav-link">动态内容</a><a href="/form" class="nav-link">表单</a><a href="/upload" class="nav-link">文件上传</a></div><h1>产品列表</h1><div id="products">{% for product in products %}<div class="product" data-id="{{ product.id }}"><h3 class="product-name">{{ product.name }}</h3><p class="product-price">价格: ¥{{ product.price }}</p><p class="category">{{ product.category }}</p><button class="add-to-cart" data-product="{{ product.id }}">加入购物车</button></div>{% endfor %}</div>
</body>
</html>
templates/login.html
<!DOCTYPE html>
<html>
<head><title>登录 - Selenium练习网站</title><style>.container { max-width: 400px; margin: 50px auto; padding: 20px; border: 1px solid #ddd; }.error { color: red; }</style>
</head>
<body><div class="container"><h1>用户登录</h1>{% if error %}<p class="error">{{ error }}</p>{% endif %}<form method="post"><input type="hidden" name="token" value="{{ token }}"><div><label for="username">用户名:</label><input type="text" id="username" name="username" required></div><div><label for="password">密码:</label><input type="password" id="password" name="password" required></div><button type="submit" id="submit-btn">登录</button></form></div>
</body>
</html>
templates/dashboard.html
<!DOCTYPE html>
<html>
<head><title>用户中心 - Selenium练习网站</title><style>.container { max-width: 800px; margin: 50px auto; padding: 20px; border: 1px solid #ddd; }.success { color: green; font-size: 1.2em; }.user-info { margin: 20px 0; padding: 15px; background-color: #f5f5f5; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style>
</head>
<body><div class="nav"><a href="/" id="home-link">首页</a><a href="/login" class="nav-link">登录</a><a href="/dynamic" class="nav-link">动态内容</a><a href="/form" class="nav-link">表单</a><a href="/upload" class="nav-link">文件上传</a></div><div class="container"><h1>用户中心</h1><p class="success">{{ message }}</p><div class="user-info"><p>用户名: {{ username }}</p><p>登录时间: {{ now }}</p><p>账户状态: 正常</p></div><h3>最近活动</h3><ul><li>浏览了产品列表</li><li>查看了动态内容</li><li>提交了测试表单</li></ul></div>
</body>
</html>
templates/dynamic.html
<!DOCTYPE html>
<html>
<head><title>动态内容 - Selenium练习网站</title><style>.container { max-width: 800px; margin: 50px auto; padding: 20px; }.dynamic-content { margin: 20px 0; padding: 15px; border: 1px solid #ccc; display: none; }.visible-after-delay { margin: 20px 0; padding: 15px; background-color: #e3f2fd; display: none; }#delayed-button { padding: 10px 20px; background-color: #4CAF50; color: white; border: none; cursor: pointer; display: none; }#status-message { margin-top: 20px; padding: 10px; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style>
</head>
<body><div class="nav"><a href="/" id="home-link">首页</a><a href="/login" class="nav-link">登录</a><a href="/dynamic" class="nav-link">动态内容</a><a href="/form" class="nav-link">表单</a><a href="/upload" class="nav-link">文件上传</a></div><div class="container"><h1>动态内容演示</h1><p>本页面展示各种动态加载的内容,用于测试Selenium的等待机制。</p><div id="dynamic-content" class="dynamic-content">这是延迟加载的动态内容,通常通过JavaScript在页面加载后一段时间显示。</div><div class="visible-after-delay">这是另一个延迟显示的内容,使用了不同的延迟时间。</div><button id="delayed-button">点击我</button><div id="status-message"></div></div><script>// 模拟动态内容加载setTimeout(() => {document.getElementById('dynamic-content').style.display = 'block';}, 2000); // 2秒后显示// 另一个延迟显示的元素setTimeout(() => {document.querySelector('.visible-after-delay').style.display = 'block';}, 4000); // 4秒后显示// 延迟显示按钮并添加点击事件setTimeout(() => {const button = document.getElementById('delayed-button');button.style.display = 'inline-block';button.addEventListener('click', () => {document.getElementById('status-message').textContent = '按钮已点击,操作成功!';document.getElementById('status-message').style.backgroundColor = '#dff0d8';});}, 6000); // 6秒后显示按钮// 存储CSRF Token到JavaScript变量,用于演示window.csrfToken = 'dynamic_' + Math.random().toString(36).substring(2);</script>
</body>
</html>
templates/form.html
<!DOCTYPE html>
<html>
<head><title>表单示例 - Selenium练习网站</title><style>.container { max-width: 600px; margin: 50px auto; padding: 20px; border: 1px solid #ddd; }.form-group { margin-bottom: 15px; }label { display: block; margin-bottom: 5px; }input, select, textarea { width: 100%; padding: 8px; box-sizing: border-box; }button { padding: 10px 20px; background-color: #4CAF50; color: white; border: none; cursor: pointer; }.success { color: green; margin-top: 15px; padding: 10px; background-color: #dff0d8; display: none; }.error { color: red; margin-top: 15px; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style>
</head>
<body><div class="nav"><a href="/" id="home-link">首页</a><a href="/login" class="nav-link">登录</a><a href="/dynamic" class="nav-link">动态内容</a><a href="/form" class="nav-link">表单</a><a href="/upload" class="nav-link">文件上传</a></div><div class="container"><h1>用户信息表单</h1><form id="user-form" method="post"><input type="hidden" name="token" value="{{ token }}"><div class="form-group"><label for="name">姓名:</label><input type="text" id="name" name="name" required></div><div class="form-group"><label for="email">邮箱:</label><input type="email" id="email" name="email" required></div><div class="form-group"><label for="age">年龄:</label><input type="number" id="age" name="age" min="1" max="120"></div><div class="form-group"><label for="gender">性别:</label><select id="gender" name="gender"><option value="">请选择</option><option value="male">男</option><option value="female">女</option><option value="other">其他</option></select></div><div class="form-group"><label>兴趣爱好:</label><div><input type="checkbox" id="hobby1" name="hobbies" value="reading"><label for="hobby1">阅读</label><input type="checkbox" id="hobby2" name="hobbies" value="sports"><label for="hobby2">运动</label><input type="checkbox" id="hobby3" name="hobbies" value="music"><label for="hobby3">音乐</label></div></div><div class="form-group"><label for="message">留言:</label><textarea id="message" name="message" rows="4"></textarea></div><button type="submit" id="submit-form">提交</button></form><div id="form-success" class="success">表单提交成功!</div>{% if error %}<div class="error">{{ error }}</div>{% endif %}</div><script>// 简单的表单验证document.getElementById('user-form').addEventListener('submit', function(e) {const name = document.getElementById('name').value;const email = document.getElementById('email').value;if (!name || !email) {alert('请填写姓名和邮箱');e.preventDefault();return false;}// 在实际提交前更新success消息的显示状态document.getElementById('form-success').style.display = 'block';return true;});</script>
</body>
</html>
templates/upload.html
<!DOCTYPE html>
<html>
<head><title>文件上传 - Selenium练习网站</title><style>.container { max-width: 600px; margin: 50px auto; padding: 20px; border: 1px solid #ddd; }.form-group { margin-bottom: 15px; }label { display: block; margin-bottom: 5px; }input[type="file"] { margin: 10px 0; }button { padding: 10px 20px; background-color: #4CAF50; color: white; border: none; cursor: pointer; }.upload-area { border: 2px dashed #ccc; padding: 30px; text-align: center; margin-bottom: 20px; }.upload-area.dragover { border-color: #4CAF50; background-color: #f5f5f5; }.message { margin-top: 20px; padding: 10px; display: none; }.success { background-color: #dff0d8; color: #3c763d; }.error { background-color: #f2dede; color: #a94442; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style>
</head>
<body><div class="nav"><a href="/" id="home-link">首页</a><a href="/login" class="nav-link">登录</a><a href="/dynamic" class="nav-link">动态内容</a><a href="/form" class="nav-link">表单</a><a href="/upload" class="nav-link">文件上传</a></div><div class="container"><h1>文件上传演示</h1><p>本页面用于测试文件上传功能,可以上传图片、文档等文件。</p><form id="upload-form" method="post" enctype="multipart/form-data"><div class="form-group"><label for="file-title">文件标题:</label><input type="text" id="file-title" name="title" required></div><div class="form-group"><label>选择文件:</label><div class="upload-area" id="upload-area">点击或拖拽文件到这里上传<input type="file" id="file-upload" name="file" multiple style="display: none;"></div><p id="file-name" style="margin-top: 10px;"></p></div><div class="form-group"><label for="file-description">文件描述:</label><textarea id="file-description" name="description" rows="3"></textarea></div><button type="submit" id="upload-btn">上传文件</button></form><div id="success-message" class="message success">文件上传成功!</div><div id="error-message" class="message error">文件上传失败,请重试。</div></div><script>// 处理拖拽上传const uploadArea = document.getElementById('upload-area');const fileInput = document.getElementById('file-upload');const fileNameDisplay = document.getElementById('file-name');// 点击上传区域触发文件选择uploadArea.addEventListener('click', () => {fileInput.click();});// 显示选择的文件名fileInput.addEventListener('change', (e) => {if (e.target.files.length > 0) {const fileNames = Array.from(e.target.files).map(file => file.name).join(', ');fileNameDisplay.textContent = `已选择: ${fileNames}`;}});// 拖拽相关事件uploadArea.addEventListener('dragover', (e) => {e.preventDefault();uploadArea.classList.add('dragover');});uploadArea.addEventListener('dragleave', () => {uploadArea.classList.remove('dragover');});uploadArea.addEventListener('drop', (e) => {e.preventDefault();uploadArea.classList.remove('dragover');if (e.dataTransfer.files.length > 0) {// 这里只是模拟,实际项目中需要额外处理const fileNames = Array.from(e.dataTransfer.files).map(file => file.name).join(', ');fileNameDisplay.textContent = `已选择: ${fileNames}`;}});// 表单提交处理document.getElementById('upload-form').addEventListener('submit', function(e) {// 简单验证if (!fileInput.files.length) {alert('请选择要上传的文件');e.preventDefault();return false;}// 显示成功消息(实际应用中会由服务器处理)setTimeout(() => {document.getElementById('success-message').style.display = 'block';document.getElementById('error-message').style.display = 'none';}, 1000);return true;});</script>
</body>
</html>
启动模拟网站:
uv run app.py
访问 http://localhost:8000 即可看到我们创建的模拟网站。
第三章:Selenium基础
3.0 牛刀小试
# -*- coding: utf-8 -*-
"""
02_start_browser.py
作用:用三种方式启动 Chrome
"""
from time import sleepfrom selenium import webdriver # 总入口
from selenium.webdriver.chrome.service import Service # 驱动服务
from selenium.webdriver.chrome.options import Options # 浏览器参数
from webdriver_manager.chrome import ChromeDriverManager# 方法1:最简方式(驱动已放 PATH)
driver1 = webdriver.Chrome()
driver1.maximize_window() # 浏览器窗口最大化
driver1.get("https://www.baidu.com")
sleep(10)
driver1.quit()# 方法2:指定驱动路径
service = Service(executable_path=r"C:\Users\李昊哲\.wdm\drivers\chromedriver\win64\140.0.7339.82\chromedriver-win32\chromedriver.exe")
driver2 = webdriver.Chrome(service=service)
driver2.maximize_window() # 浏览器窗口最大化
driver2.get("https://www.sogou.com")
sleep(10)
driver2.quit()# 方法3:无头模式 + 常用参数
options = Options()
options.add_argument("--headless") # 无头模式
options.add_argument("--window-size=1920x1080") # 设置浏览器窗体尺寸
driver3 = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver3.get("https://cn.bing.com/")
print("页面标题:", driver3.title)
driver3.quit()
3.1 第一个Selenium脚本
first_script.py
"""
code: first_script.py
"""
# 导入必要的库
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import timedef main():# 初始化Chrome浏览器驱动# 使用webdriver_manager自动管理驱动,无需手动下载driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:# 打开我们的模拟网站首页driver.get("http://localhost:8000")# 打印当前页面标题print(f"页面标题: {driver.title}")# 打印当前页面URLprint(f"当前URL: {driver.current_url}")# 等待3秒,让我们看清效果time.sleep(3)# 刷新页面driver.refresh()print("页面已刷新")time.sleep(2)# 导航到登录页面driver.get("http://localhost:8000/login")print("已导航到登录页面")time.sleep(2)# 后退到上一页driver.back()print("已后退到首页")time.sleep(2)# 前进到下一页driver.forward()print("已前进到登录页面")time.sleep(2)finally:# 关闭浏览器driver.quit()print("浏览器已关闭")if __name__ == "__main__":main()
3.2 代码解析
-
导入库:
webdriver
:Selenium的核心库,提供各种浏览器的驱动接口Service
:用于管理浏览器驱动的服务ChromeDriverManager
:自动管理Chrome驱动的安装和版本匹配
-
初始化浏览器:
webdriver.Chrome()
:创建Chrome浏览器实例- 通过
Service
和ChromeDriverManager
自动处理驱动
-
基本操作:
get(url)
:打开指定URLtitle
:获取页面标题current_url
:获取当前页面URLrefresh()
:刷新页面back()
:后退到上一页forward()
:前进到下一页quit()
:关闭浏览器并释放资源
第四章:8种元素定位方式
Selenium提供了8种元素定位方式,掌握这些是进行自动化操作的基础。
4.1 通过ID定位 (find_element_by_id)
locate_by_id.py
"""
code: locate_by_id.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():# 初始化浏览器驱动driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:# 打开登录页面driver.get("http://localhost:8000/login")time.sleep(1)# 1. 通过ID定位用户名输入框# 思路:找到id为"username"的元素,这是最直接可靠的定位方式username_input = driver.find_element(By.ID, "username")# 操作元素:输入用户名username_input.send_keys("test_user")time.sleep(1)# 2. 通过ID定位密码输入框password_input = driver.find_element(By.ID, "password")password_input.send_keys("test_password")time.sleep(1)# 3. 通过ID定位提交按钮submit_btn = driver.find_element(By.ID, "submit-btn")# 操作元素:点击按钮submit_btn.click()time.sleep(2)finally:# 关闭浏览器driver.quit()if __name__ == "__main__":main()
4.2 通过Name定位 (find_element_by_name)
locate_by_name.py
"""
code: locate_by_name.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000/login")time.sleep(1)# 通过Name定位用户名输入框# 思路:当元素有name属性时,可以使用此方法,适合表单元素username_input = driver.find_element(By.NAME, "username")username_input.send_keys("test")time.sleep(1)# 通过Name定位密码输入框password_input = driver.find_element(By.NAME, "password")password_input.send_keys("password123")time.sleep(1)# 通过Name定位token字段(隐藏字段)# 这在处理CSRF验证时很有用token_input = driver.find_element(By.NAME, "token")print(f"获取到的token值: {token_input.get_attribute('value')}")# 提交表单driver.find_element(By.ID, "submit-btn").click()time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.3 通过Class Name定位 (find_element_by_class_name)
locate_by_class.py
"""
code: locate_by_class.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 通过Class Name定位单个元素# 思路:定位导航链接,class为"nav-link"first_link = driver.find_element(By.CLASS_NAME, "nav-link")print(f"第一个导航链接文本: {first_link.text}")first_link.click()time.sleep(2)# 返回首页driver.back()time.sleep(1)# 通过Class Name定位多个元素# 思路:获取所有产品项,class为"product"products = driver.find_elements(By.CLASS_NAME, "product")print(f"找到 {len(products)} 个产品")# 遍历所有产品并打印名称for product in products:# 在每个产品元素内部查找产品名称name = product.find_element(By.CLASS_NAME, "product-name")print(f"产品名称: {name.text}")time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.4 通过Tag Name定位 (find_element_by_tag_name)
locate_by_tag.py
"""
code: locate_by_tag.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 通过Tag Name定位元素# 思路:获取所有链接标签<a>links = driver.find_elements(By.TAG_NAME, "a")print(f"页面上有 {len(links)} 个链接")# 打印所有链接的文本和URLfor link in links:print(f"链接文本: {link.text}, URL: {link.get_attribute('href')}")# 通过Tag Name定位标题# 思路:找到第一个h1标签heading = driver.find_element(By.TAG_NAME, "h1")print(f"页面主标题: {heading.text}")# 在表单中通过标签名定位输入框(结合其他定位方式更有效)driver.get("http://localhost:8000/login")inputs = driver.find_elements(By.TAG_NAME, "input")print(f"登录表单中有 {len(inputs)} 个输入框")time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.5 通过Link Text定位 (find_element_by_link_text)
locate_by_link_text.py
"""
code: locate_by_link_text.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 通过完整链接文本定位# 思路:精确匹配链接的全部文本login_link = driver.find_element(By.LINK_TEXT, "登录")print(f"找到登录链接: {login_link.get_attribute('href')}")login_link.click()time.sleep(2)# 返回首页driver.back()time.sleep(1)# 定位另一个链接dynamic_link = driver.find_element(By.LINK_TEXT, "动态内容")dynamic_link.click()time.sleep(2)# 返回首页driver.back()time.sleep(1)finally:driver.quit()if __name__ == "__main__":main()
4.6 通过Partial Link Text定位 (find_element_by_partial_link_text)
locate_by_partial_link.py
"""
code: locate_by_partial_link.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 通过部分链接文本定位# 思路:只需匹配链接文本的一部分,适用于文本较长或动态变化的情况form_link = driver.find_element(By.PARTIAL_LINK_TEXT, "表")print(f"找到包含'表'字的链接: {form_link.text}")form_link.click()time.sleep(2)# 返回首页driver.back()time.sleep(1)# 另一个示例upload_link = driver.find_element(By.PARTIAL_LINK_TEXT, "上传")upload_link.click()time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.7 通过XPath定位 (find_element_by_xpath)
XPath是一种在XML文档中定位元素的语言,也可用于HTML。它是最灵活的定位方式之一。
locate_by_xpath.py
"""
code: locate_by_xpath.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 1. 绝对路径定位(不推荐,维护性差)# 思路:从根节点开始的完整路径,页面结构变化会导致失效home_link = driver.find_element(By.XPATH, "/html/body/div/a[1]")print(f"通过绝对路径找到的链接: {home_link.text}")# 2. 相对路径定位# 思路:从任意节点开始,更灵活products = driver.find_elements(By.XPATH, "//div[@class='product']")print(f"通过相对路径找到 {len(products)} 个产品")# 3. 属性匹配# 思路:通过元素的属性值定位username_input = driver.find_element(By.XPATH, "//input[@id='username']")# 如果上面找不到(因为在首页),我们导航到登录页if not username_input.is_displayed():driver.get("http://localhost:8000/login")username_input = driver.find_element(By.XPATH, "//input[@id='username']")username_input.send_keys("test")# 4. 部分属性匹配# 思路:匹配属性值的一部分,使用contains()password_input = driver.find_element(By.XPATH, "//input[contains(@name, 'pass')]")password_input.send_keys("password123")# 5. 文本匹配# 思路:通过元素的文本内容定位submit_btn = driver.find_element(By.XPATH, "//button[text()='登录']")submit_btn.click()time.sleep(2)# 返回首页driver.back()time.sleep(1)# 6. 层级定位# 思路:结合父子关系定位first_product_price = driver.find_element(By.XPATH, "//div[@class='product'][1]//p[@class='product-price']")print(f"第一个产品价格: {first_product_price.text}")# 7. 逻辑运算# 思路:使用and/or组合多个条件dynamic_link = driver.find_element(By.XPATH, "//a[@class='nav-link' and contains(text(), '动态')]")dynamic_link.click()time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.8 通过CSS Selector定位 (find_element_by_css_selector)
CSS选择器是另一种强大的元素定位方式,通常比XPath更简洁。
locate_by_css.py
"""
code: locate_by_css.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 1. ID选择器# 思路:使用#符号加ID值home_link = driver.find_element(By.CSS_SELECTOR, "#home-link")print(f"通过ID选择器找到: {home_link.text}")# 2. Class选择器# 思路:使用.符号加class值nav_links = driver.find_elements(By.CSS_SELECTOR, ".nav-link")print(f"通过Class选择器找到 {len(nav_links)} 个导航链接")# 3. 标签选择器# 思路:直接使用标签名headings = driver.find_elements(By.CSS_SELECTOR, "h1, h3")print(f"找到 {len(headings)} 个标题元素")# 4. 属性选择器# 思路:通过元素属性定位token_input = driver.find_element(By.CSS_SELECTOR, "input[type='hidden'][id='csrf_token']")print(f"CSRF Token值: {token_input.get_attribute('value')}")# 5. 层级选择器# 思路:通过元素层级关系定位product_prices = driver.find_elements(By.CSS_SELECTOR, ".product .product-price")print("所有产品价格:")for price in product_prices:print(price.text)# 6. 伪类选择器# 思路:使用CSS伪类定位first_product = driver.find_element(By.CSS_SELECTOR, ".product:first-child")print(f"第一个产品名称: {first_product.find_element(By.CSS_SELECTOR, '.product-name').text}")# 导航到登录页driver.get("http://localhost:8000/login")time.sleep(1)# 7. 组合选择器# 思路:组合多种条件定位form_elements = driver.find_elements(By.CSS_SELECTOR, "form div input")print(f"登录表单中有 {len(form_elements)} 个输入元素")time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
第五章:等待机制
在自动化测试中,页面元素的加载往往需要时间,使用合适的等待机制至关重要。
selenium_waits.py
"""
code: selenium_waits.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:# 1. 隐式等待# 思路:设置全局等待时间,对所有元素查找操作生效# 注意:隐式等待会影响整个driver的生命周期driver.implicitly_wait(10) # 等待10秒driver.get("http://localhost:8000/dynamic")print("已打开动态内容页面")# 2. 强制等待(不推荐)# 思路:固定等待一段时间,不管元素是否已加载# 缺点:会浪费不必要的时间,或因加载慢而失败print("使用强制等待...")time.sleep(3) # 强制等待3秒# 3. 显式等待# 思路:针对特定元素设置等待条件和超时时间print("使用显式等待...")try:# 等待动态加载的元素出现,最长等待10秒,每500毫秒检查一次dynamic_element = WebDriverWait(driver, 10, 0.5).until(EC.presence_of_element_located((By.ID, "dynamic-content")))print(f"找到动态内容: {dynamic_element.text}")# 等待元素可见visible_element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "visible-after-delay")))print(f"可见元素内容: {visible_element.text}")# 等待元素可点击clickable_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "delayed-button")))print("点击延迟加载的按钮")clickable_button.click()# 等待文本出现WebDriverWait(driver, 10).until(EC.text_to_be_present_in_element((By.ID, "status-message"), "已点击"))status = driver.find_element(By.ID, "status-message")print(f"状态: {status.text}")except TimeoutException:print("超时:未能在指定时间内找到元素")time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
常用的Expected Conditions
presence_of_element_located
:元素存在于DOM中visibility_of_element_located
:元素可见(存在且可见)element_to_be_clickable
:元素可点击text_to_be_present_in_element
:元素包含特定文本title_contains
:页面标题包含特定文本invisibility_of_element_located
:元素不可见frame_to_be_available_and_switch_to_it
:frame可用并切换到该frame
第六章:突破Token限制
许多网站使用Token(如CSRF Token)来防止自动化脚本,以下是突破这些限制的常用方法:
bypass_token.py
"""
code; bypass_token.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import timedef bypass_token_method1():"""方法1:从页面中提取Token并使用"""driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000/login")# 1. 从页面中提取token# 思路:先获取页面中的token值,再在后续操作中使用token_element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, "token")))token_value = token_element.get_attribute("value")print(f"提取到的token值: {token_value}")# 2. 填写表单driver.find_element(By.ID, "username").send_keys("test")driver.find_element(By.ID, "password").send_keys("password123")# 3. 提交表单(会自动带上token)driver.find_element(By.ID, "submit-btn").click()# 验证是否登录成功try:success_message = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[contains(text(), '登录成功')]")))print("登录成功,Token验证通过")except:print("登录失败")time.sleep(3)finally:driver.quit()def bypass_token_method2():"""方法2:使用浏览器上下文保留Token"""driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:# 1. 首次访问获取Token并保存在浏览器中driver.get("http://localhost:8000")print("首次访问获取初始Token")time.sleep(2)# 2. 导航到其他页面,Token会通过Cookie保持driver.get("http://localhost:8000/form")print("导航到表单页面,使用保持的Token")# 3. 填写并提交表单,此时会自动使用Cookie中的Tokendriver.find_element(By.ID, "name").send_keys("测试用户")driver.find_element(By.ID, "email").send_keys("test@example.com")driver.find_element(By.ID, "submit-form").click()# 验证表单提交是否成功try:success_msg = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "form-success")))print(f"表单提交成功: {success_msg.text}")except:print("表单提交失败")time.sleep(3)finally:driver.quit()def bypass_token_method3():"""方法3:使用JavaScript直接设置Token"""driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000/login")# 1. 执行JavaScript获取或设置Token# 思路:有些网站的Token可能存储在JavaScript变量中token = driver.execute_script("""// 模拟从JavaScript变量获取Tokenif (window.csrfToken) {return window.csrfToken;}// 或者直接设置Token元素的值var tokenInput = document.querySelector('input[name="token"]');if (tokenInput) {// 这里可以替换为你获取到的有效TokentokenInput.value = 'override_token_value';return tokenInput.value;}return null;""")print(f"通过JS操作的Token值: {token}")# 2. 填写登录信息driver.find_element(By.ID, "username").send_keys("test")driver.find_element(By.ID, "password").send_keys("password123")driver.find_element(By.ID, "submit-btn").click()time.sleep(3)finally:driver.quit()if __name__ == "__main__":print("=== 方法1:从页面中提取Token ===")bypass_token_method1()print("\n=== 方法2:使用浏览器上下文保留Token ===")bypass_token_method2()print("\n=== 方法3:使用JavaScript直接设置Token ===")bypass_token_method3()
第七章:实战案例
下面是一个综合实战案例,展示如何使用Selenium进行完整的网站爬取和交互:
selenium_practical.py
"""
code: selenium_practical.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
import json
from dataclasses import dataclass, asdict
from typing import List# 数据类用于存储产品信息
@dataclass
class Product:id: intname: strprice: floatcategory: strdef crawl_products(driver) -> List[Product]:"""爬取产品信息"""products = []try:# 导航到产品页面driver.get("http://localhost:8000")# 等待产品加载完成WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "products")))# 获取所有产品元素product_elements = driver.find_elements(By.CLASS_NAME, "product")for element in product_elements:try:# 提取产品信息product_id = int(element.get_attribute("data-id"))name = element.find_element(By.CLASS_NAME, "product-name").textprice_text = element.find_element(By.CLASS_NAME, "product-price").textprice = float(price_text.replace("价格: ¥", ""))category = element.find_element(By.CLASS_NAME, "category").text# 创建产品对象product = Product(id=product_id,name=name,price=price,category=category)products.append(product)print(f"已爬取产品: {name}")# 模拟点击"加入购物车"按钮add_button = element.find_element(By.CLASS_NAME, "add-to-cart")add_button.click()time.sleep(0.5)except Exception as e:print(f"爬取单个产品时出错: {str(e)}")print(f"共爬取 {len(products)} 个产品")return productsexcept Exception as e:print(f"爬取产品列表时出错: {str(e)}")return []def login(driver, username: str, password: str) -> bool:"""登录网站"""try:# 导航到登录页driver.get("http://localhost:8000/login")# 等待页面加载WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "submit-btn")))# 提取并使用tokentoken = driver.find_element(By.NAME, "token").get_attribute("value")print(f"登录使用的token: {token}")# 填写登录表单driver.find_element(By.ID, "username").send_keys(username)driver.find_element(By.ID, "password").send_keys(password)# 提交表单driver.find_element(By.ID, "submit-btn").click()# 验证登录是否成功try:WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[contains(text(), '登录成功')]")))print("登录成功")return Trueexcept TimeoutException:print("登录失败")return Falseexcept Exception as e:print(f"登录过程出错: {str(e)}")return Falsedef main():# 配置Chrome选项chrome_options = webdriver.ChromeOptions()# 添加用户配置,避免被识别为机器人chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36")# 禁用自动化控制特征chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option("useAutomationExtension", False)# 初始化驱动driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=chrome_options)# 进一步隐藏自动化特征driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"""})try:# 1. 爬取产品信息products = crawl_products(driver)# 保存产品信息到JSON文件with open("products.json", "w", encoding="utf-8") as f:json.dump([asdict(p) for p in products], f, ensure_ascii=False, indent=2)print("产品信息已保存到products.json")# 2. 登录网站login_success = login(driver, "test", "password123")if login_success:time.sleep(2)# 3. 访问其他页面driver.get("http://localhost:8000/dynamic")try:# 等待动态内容加载dynamic_content = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.ID, "dynamic-content")))print(f"动态内容: {dynamic_content.text}")except TimeoutException:print("未能加载动态内容")time.sleep(3)finally:# 关闭浏览器driver.quit()print("爬虫完成,浏览器已关闭")if __name__ == "__main__":main()
第八章:最佳实践与反反爬策略
8.1 避免被识别为机器人
- 设置合理的用户代理:模拟真实浏览器的用户代理
- 添加随机延迟:避免操作过于规律
- 禁用自动化特征:隐藏Selenium的特征标识
- 使用真实浏览器配置:加载真实的浏览器配置文件
- 模拟人类行为:随机化点击位置、添加鼠标移动等
8.2 代码组织与维护
- 封装常用操作:将重复的操作封装为函数或类
- 使用Page Object模式:将页面元素和操作封装为对象
- 异常处理:完善的异常处理机制,提高稳定性
- 日志记录:记录关键操作和错误信息
- 配置分离:将配置信息与代码分离,便于维护
8.3 性能优化
- 减少不必要的等待:合理设置等待时间
- 批量操作:尽量减少与浏览器的交互次数
- 无头模式:在不需要可视化时使用无头模式
- 资源限制:限制图片、CSS等非必要资源的加载
总结
本教程全面介绍了Selenium的使用方法,从环境搭建到高级技巧,涵盖了8种元素定位方式、等待机制、突破Token限制等关键内容。
通过FastAPI搭建的模拟网站,您可以安全合法地进行练习。
Selenium是一个强大的工具,不仅可用于网页爬取,还广泛应用于自动化测试。掌握这些技能将极大提升您在Web自动化领域的能力。
随着网站反爬技术的不断升级,爬虫工程师也需要不断学习和适应新的挑战。始终记住,在进行网络爬虫时,要遵守网站的robots协议和相关法律法规。