要生成 sitemap.xml
文件,需要通过爬虫程序抓取网站的所有有效链接。以下是完整的解决方案:
步骤 1:安装必要的 Python 库
- ounter(line
pip install requests beautifulsoup4 lxml
步骤 2:创建 Python 爬虫脚本 (sitemap_generator.py
)
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import xml.etree.ElementTree as ET
from datetime import datetime
def get_all_links(base_url):
# 存储已访问和待访问的链接
visited = set()
queue = [base_url]
all_links = set()
while queue:
url = queue.pop(0)
if url in visited:
continue
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers, timeout=10)
if response.status_code != 200:
continue
# 添加到已访问列表
visited.add(url)
all_links.add(url)
print(f"Crawled: {url}")
# 解析 HTML 获取新链接
soup = BeautifulSoup(response.text, 'lxml')
for link in soup.find_all('a', href=True):
href = link['href'].strip()
full_url = urljoin(url, href)
# 过滤无效链接
parsed = urlparse(full_url)
if parsed.scheme not in ('http', 'https'):
continue
if not parsed.netloc.endswith('91kaiye.cn'): # 仅限本站链接
continue
if '#' in full_url: # 忽略锚点
full_url = full_url.split('#')[0]
# 添加到待访问队列
if full_url not in visited:
queue.append(full_url)
except Exception as e:
print(f"Error crawling {url}: {str(e)}")
return all_links
def create_sitemap(links, filename='sitemap.xml'):
root = ET.Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')
for link in sorted(links):
url_elem = ET.SubElement(root, 'url')
ET.SubElement(url_elem, 'loc').text = link
ET.SubElement(url_elem, 'lastmod').text = datetime.now().strftime('%Y-%m-%d')
ET.SubElement(url_elem, 'changefreq').text = 'daily'
ET.SubElement(url_elem, 'priority').text = '0.8'
tree = ET.ElementTree(root)
tree.write(filename, encoding='utf-8', xml_declaration=True)
print(f"\nSitemap generated: {filename} with {len(links)} URLs")
if __name__ == '__main__':
base_url = 'https://www.91kaiye.cn/'
print("Starting crawl...")
links = get_all_links(base_url)
create_sitemap(links)
步骤 3:运行脚本
- ounter(line
python sitemap_generator.py
执行说明:
-
爬虫逻辑:
-
从首页
https://www.91kaiye.cn/
开始广度优先搜索 -
自动过滤非本站链接、锚点和无效 URL
-
记录每个页面的最后修改日期(默认当天)
-
设置更新频率为
daily
,优先级为0.8
-
-
输出文件:
- 生成的
sitemap.xml
格式如下:- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
<?xml version='1.0' encoding='utf-8'?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.91kaiye.cn/page1</loc>
<lastmod>2023-10-05</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
...
</urlset>
- 生成的
注意事项:
-
反爬措施:
- 如果网站有反爬机制,可能需要:
-
添加
time.sleep(1)
延迟请求 -
使用代理 IP
-
设置更真实的请求头
-
- 如果网站有反爬机制,可能需要:
-
动态内容:
-
对于 JavaScript 渲染的页面(如 Vue/React),需改用
Selenium
或Playwright
-
-
优化建议:
-
在服务器上定期运行(如每周一次)
-
提交到 Google Search Console
- 在
robots.txt
中添加:- ounter(line
Sitemap: https://www.91kaiye.cn/sitemap.xml
-
替代方案:使用在线工具
如果不想运行代码,可用在线服务生成:
-
XML-Sitemaps.com
-
Screaming Frog SEO Spider(桌面工具)
生成后请将 sitemap.xml
上传到网站根目录,并通过百度/Google站长工具提交。