基于Python的小说自动采集脚本

Tyrant 2019年05月10日 •  Python 免费资源 597 •  4
本文最后修改于 220 天前,部分内容可能已经过时!

最近突然想起来看一看高中时期未看完的小说,心血来潮就去网上找了找资源,发现不论下载下来的txt文件还是专门用来看小说的app不是有广告,就是缺少章节,观看体验十分不爽,为了解决这个问题,自己动手,丰衣足食,于是这个采集脚本应运而生

技术要点:

  • BeautifulSoup4:解析标签
  • Requests:模拟http请求
  • Python3

脚本使用步骤:

  • 安装BeautifulSoup4
pip3 install beautifulsoup4 
  • 安装requests
pip3 install requests
  • 保存以下代码为book.py
import re                                                                             import sys
import requests
import time
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning 
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

class BookSpider():
    '''爬取顶点小说网小说'''
    def __init__(self):
               
       self.headers = { 
            'Host':'www.dingdiann.com',
            'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
        }
       self.chapter_url_list = list()
       self.chapter_name_list = list()
       self.book_info = dict()

    def get_book_url(self, book_name, author_name):
        '''获取要爬取书籍的详情页'''
        url = 'https://www.dingdiann.com/searchbook.php'
        data = {
            'keyword':book_name
        }
        result = requests.get(url, headers=self.headers, params=data, verify=False).text
        soup = BeautifulSoup(result, 'lxml')
        book_name_list = soup.find_all(name='span', attrs={'class':'s2'})
        book_author_list = soup.find_all(name='span', attrs={'class':'s4'})
        book_name_list.pop(0)
        book_author_list.pop(0)
        for candidate_name in book_name_list:
            book_info_list = list()
            name = str(candidate_name.a.string)
            book_url = str(candidate_name.a.get('href'))
            book_info_tuple = (name, book_url)
            book_info_list.append(book_info_tuple)
            author = str(book_author_list[0].string)
            if author in self.book_info.keys():
                self.book_info[author].append(book_info_tuple)
                book_author_list.pop(0)
            else:
                self.book_info[author] = book_info_list
                book_author_list.pop(0)
        if self.book_info[author_name]:
            for info in self.book_info[author_name]:
                if info[0] == book_name:
                    url = info[1]
                    print('书籍已经找到,您要找的书籍为 ' + book_name + ':' + author_name)   
                    print('3S 后将开始下载~~~')
                    time.sleep(3)
                    return url
        else:
            print('抱歉,书籍未找到,请确认书籍作者及名称是否正确~~~')

    def get_book_info(self, url):
        '''获取书籍的章节列表和地址'''
        all_url = 'https://www.dingdiann.com' + url
        result = requests.get(all_url, headers=self.headers, verify=False).text
        soup = BeautifulSoup(result, 'lxml')
        div = soup.find_all(id='list')[0]
        chapter_list = div.dl.contents

        for text in chapter_list :
            text = str(text)
            content = re.findall('<a href="' + url + '(.*?)" style="">(.*?)</a>.*?', text)
            if content:
                chapter_url = all_url + content[0][0]
                chapter_name = content[0][1]
                self.chapter_url_list.append(chapter_url)
                self.chapter_name_list.append(chapter_name)

        for i in range(12):
            self.chapter_url_list.pop(0)
            self.chapter_name_list.pop(0)
    
    def get_chapter_content(self, name, url):
        '''获取小说每章内容'''        
        try:
            result = requests.get(url, headers=self.headers, verify=False).text
        except:
            print(name + "下载失败~~~")
            return False
        else:
            soup = BeautifulSoup(result, 'lxml')
            div = soup.find_all(id='content')[0]
            div = str(div)
            result = re.findall('<div id="content">(.*?)<script>', div, re.S)[0].strip()
            result = re.sub('<br/>', '\n', result)
            return result
    
    def save_book(self, book_name):
        '''保存小说'''
        for chapter_name in self.chapter_name_list:
            while True:
                chapter_content = self.get_chapter_content(chapter_name, self.chapter_url_list[0])
                if chapter_content:
                    with open(book_name + ".txt", 'a') as f:
                        f.write(chapter_name)
                        f.write("\n")
                        f.write(chapter_content)
                        f.write("\n")
                    self.chapter_url_list.pop(0)
                    print(chapter_name + "已经下载完成")
                    break

    def run(self, book_name, url):
        self.get_book_info(url)
        self.save_book(book_name)

def main(book_name, author_name):
    book = BookSpider()
    url = book.get_book_url(book_name, author_name)
    book.run(book_name, url)

if __name__ == "__main__":
    main(sys.argv[1], sys.argv[2])
  • 使用说明:脚本需要输入两个参数,参数1为小说名称,参数2为作者名称,之后便会将采集到的内容保存在本地,例如:
python3 book.py 天珠变 唐家三少

享用

免责声明

  • 本脚本采集的小说数据来自顶点小说网,只提供数据采集服务,不提供任何贩卖服务
  • 数据采集自https://www.dingdiann.com/,感谢网站管理员的慷慨支持,爬了很多次也没有ban我的ip,希望大家多多支持正版
Tags:none
上一篇
打赏
下一篇

添加新评论

已有 4 条评论

 腾讯云 6 个月前 • |

朋友 交换链接吗

 Tyrant 6 个月前 • |
@腾讯云

可以,您的链接是?

 腾讯云 6 个月前 • |
@Tyrant

https://www.shimaisui.com/ (腾讯云)
已添加贵站连接至首页

 阿國 3 个月前 • |

我想要用python做一个下载小说成为TXT档的小爬虫目标是读书族小说网
我在许文章中有看到类似的爬虫
我自以文章内容下去做修改
但是都失败了
请问我要如何做到
1.分析小说目路网址
2.每一篇章的内容
3.下载转化成完整小说