最近突然想起来看一看高中时期未看完的小说,心血来潮就去网上找了找资源,发现不论下载下来的
txt
文件还是专门用来看小说的app
不是有广告,就是缺少章节,观看体验十分不爽,为了解决这个问题,自己动手,丰衣足食,于是这个采集脚本应运而生
技术要点:
- BeautifulSoup4:解析标签
- Requests:模拟http请求
- Python3
脚本使用步骤:
- 安装
BeautifulSoup4
pip3 install beautifulsoup4
- 安装
requests
pip3 install requests
- 保存以下代码为
book.py
import re import sys import requests import time from bs4 import BeautifulSoup from requests.packages.urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) class BookSpider(): '''爬取顶点小说网小说''' def __init__(self): self.headers = { 'Host':'www.dingdiann.com', 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36', } self.chapter_url_list = list() self.chapter_name_list = list() self.book_info = dict() def get_book_url(self, book_name, author_name): '''获取要爬取书籍的详情页''' url = 'https://www.dingdiann.com/searchbook.php' data = { 'keyword':book_name } result = requests.get(url, headers=self.headers, params=data, verify=False).text soup = BeautifulSoup(result, 'lxml') book_name_list = soup.find_all(name='span', attrs={'class':'s2'}) book_author_list = soup.find_all(name='span', attrs={'class':'s4'}) book_name_list.pop(0) book_author_list.pop(0) for candidate_name in book_name_list: book_info_list = list() name = str(candidate_name.a.string) book_url = str(candidate_name.a.get('href')) book_info_tuple = (name, book_url) book_info_list.append(book_info_tuple) author = str(book_author_list[0].string) if author in self.book_info.keys(): self.book_info[author].append(book_info_tuple) book_author_list.pop(0) else: self.book_info[author] = book_info_list book_author_list.pop(0) if self.book_info[author_name]: for info in self.book_info[author_name]: if info[0] == book_name: url = info[1] print('书籍已经找到,您要找的书籍为 ' + book_name + ':' + author_name) print('3S 后将开始下载~~~') time.sleep(3) return url else: print('抱歉,书籍未找到,请确认书籍作者及名称是否正确~~~') def get_book_info(self, url): '''获取书籍的章节列表和地址''' all_url = 'https://www.dingdiann.com' + url result = requests.get(all_url, headers=self.headers, verify=False).text soup = BeautifulSoup(result, 'lxml') div = soup.find_all(id='list')[0] chapter_list = div.dl.contents for text in chapter_list : text = str(text) content = re.findall('<a href="' + url + '(.*?)" style="">(.*?)</a>.*?', text) if content: chapter_url = all_url + content[0][0] chapter_name = content[0][1] self.chapter_url_list.append(chapter_url) self.chapter_name_list.append(chapter_name) for i in range(12): self.chapter_url_list.pop(0) self.chapter_name_list.pop(0) def get_chapter_content(self, name, url): '''获取小说每章内容''' try: result = requests.get(url, headers=self.headers, verify=False).text except: print(name + "下载失败~~~") return False else: soup = BeautifulSoup(result, 'lxml') div = soup.find_all(id='content')[0] div = str(div) result = re.findall('<div id="content">(.*?)<script>', div, re.S)[0].strip() result = re.sub('<br/>', '\n', result) return result def save_book(self, book_name): '''保存小说''' for chapter_name in self.chapter_name_list: while True: chapter_content = self.get_chapter_content(chapter_name, self.chapter_url_list[0]) if chapter_content: with open(book_name + ".txt", 'a') as f: f.write(chapter_name) f.write("\n") f.write(chapter_content) f.write("\n") self.chapter_url_list.pop(0) print(chapter_name + "已经下载完成") break def run(self, book_name, url): self.get_book_info(url) self.save_book(book_name) def main(book_name, author_name): book = BookSpider() url = book.get_book_url(book_name, author_name) book.run(book_name, url) if __name__ == "__main__": main(sys.argv[1], sys.argv[2])
- 使用说明:脚本需要输入两个参数,参数1为
小说名称
,参数2为作者名称
,之后便会将采集到的内容保存在本地,例如:python3 book.py 天珠变 唐家三少
享用
免责声明
- 本脚本采集的小说数据来自
顶点小说网
,只提供数据采集服务,不提供任何贩卖服务- 数据采集自
https://www.dingdiann.com/
,感谢网站管理员的慷慨支持,爬了很多次也没有ban我的ip,希望大家多多支持正版
朋友 交换链接吗
可以,您的链接是?
https://www.shimaisui.com/ (腾讯云)
已添加贵站连接至首页
我想要用python做一个下载小说成为TXT档的小爬虫目标是读书族小说网
我在许文章中有看到类似的爬虫
我自以文章内容下去做修改
但是都失败了
请问我要如何做到
1.分析小说目路网址
2.每一篇章的内容
3.下载转化成完整小说