python之百度文库自动化爬取的解决办法
感兴趣的小伙伴,下面一起跟随php教程的雯雯来看看吧!
项目介绍
这篇文章主要为大家详细介绍了python之百度文库自动化爬取的简单示例,具有一定的参考价值,可以用来参考一下。
感兴趣的小伙伴,下面一起跟随php教程的雯雯来看看吧!
项目介绍
可以下载doc,ppt,pdf.对于doc文档可以下载,doc中的表格无法下载,图片格式的文档也可以下载.ppt和pdf是先下载图片再放到ppt中.只要是可以预览的都可以下载。
已有功能
- 将可以预览的word文档下载为word文档,如果文档是扫描件,同样支持.
- 将可以预览的ppt和pdf下载为不可编辑的ppt,因为网页上只有图片,所以理论上无法下载可编辑的版本.
环境安装
代码如下:
1 2 3 4 5 6 7 8 | <code> pip install requests pip install my_fake_useragent pip install python-docx pip install opencv-python pip install python-pptx pip install selenium pip install scrapy</code> |
python实现百度文库自动化爬取
本项目使用的是chromedriver控制chrome浏览器进行数据爬取的的,chromedriver的版本和chrome需要匹配
Windows用看这里
1. 如果你的chrome浏览器版本恰好是87.0.4280,那么恭喜你,你可以直接看使用方式了,因为我下载的chromedriver也是这个版本
2. 如果不是,你需要查看自己的chrome浏览器版本,然后到chromedriver下载地址:http://npm.taobao.org/mirrors/chromedriver/ 这个地址下载对应版本的chromedriver,比如你的浏览器版本是87.0.4280,你就可以找到87.0.4280.20/这个链接,如果你是windows版本然后选择chromedriver_win32.zip进行下载解压。千万不要下载LASEST——RELEASE87.0.4280这个链接,这个链接没有用,之前有小伙伴走过弯路的,注意一下哈。
3. 用解压好的chromedriver.exe替换原有文件,然后跳到使用方式
ubuntu用户看这里
讲道理,你已经用ubuntu了,那位就默认你是大神,你只要根据chrome的版本下载对应的chromdriver(linux系统的),然后把chromedriver的路径改称你下载解压的文件路径就好了,然后跳到使用方式。哈哈哈,我这里就偷懒不讲武德啦
使用方式:
把代码中的url改为你想要下载的链接地址,脚本会自动文档判断类型,并把在当前目录新建文件夹并把文件下载到当前目录。
主要代码
代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | <code> import os import time from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities from scrapy import Selector import requests from my_fake_useragent import UserAgent import docx from docx.shared import Inches import cv2 from pptx import Presentation from pptx.util import Inches #dows是的chromedriver chromedriver_path = "./chromedriver.exe" #用ubuntu的chromedriver # chromedriver_path = "./chromedriver" doc_dir_path = "./doc" ppt_dir_path = "./ppt" # url = "https://wenku.baidu.com/view/4410199cb0717fd5370cdc2e.html?fr=search" # doc_txt p # url = "https://wenku.baidu.com/view/4d18916f7c21af45b307e87101f69e314332fa36.html" # doc_txt span # url = "https://wenku.baidu.com/view/dea519c7e53a580216fcfefa.html?fr=search" # doc_txt span br # url = 'https://wk.baidu.com/view/062edabeb6360b4c2e3f5727a5e9856a5712262d?pcf=2&bfetype=new' # doc_img # url = "https://wenku.baidu.com/view/2af6de34a7e9856a561252d380eb6294dd88228d" # vip限定doc # url = "https://wenku.baidu.com/view/3de365cc6aec0975f46527d3240c844769eaa0aa.html?fr=search" #ppt # url = "https://wenku.baidu.com/view/18a8bc08094e767f5acfa1c7aa00b52acec79c55" #pdf # url = "https://wenku.baidu.com/view/bbe27bf21b5f312b3169a45177232f60dccce772" # url = "https://wenku.baidu.com/view/5cb11d096e1aff00bed5b9f3f90f76c660374c24.html?fr=search" # url = "https://wenku.baidu.com/view/71f9818fef06eff9aef8941ea76e58fafab045a6.html" # url = "https://wenku.baidu.com/view/ffc6b32a68eae009581b6bd97f1922791788be69.html" url = "https://wenku.baidu.com/view/d4d2e1e3122de2bd960590c69ec3d5bbfd0adaa6.html" class DownloadImg(): def __init__(self): self.ua = UserAgent() def download_one_img(self, img_url, saved_path): # 下载图片 header = { "User-Agent" : "{}" .format(self.ua.random().strip()), 'Connection' : 'close' } r = requests.get(img_url, headers=header, stream=True) print ( "请求图片状态码 {}" .format(r.status_code)) # 返回状态码 if r.status_code == 200: # 写入图片 with open(saved_path, mode= "wb" ) as f: f.write(r.content) print ( "download {} success!" .format(saved_path)) del r return saved_path class StartChrome(): def __init__(self): mobile_emulation = { "deviceName" : "Galaxy S5" } capabilities = DesiredCapabilities.CHROME capabilities[ 'loggingPrefs' ] = { 'browser' : 'ALL' } options = webdriver.ChromeOptions() options.add_experimental_option( "mobileEmulation" , mobile_emulation) self.brower = webdriver.Chrome(executable_path=chromedriver_path, desired_capabilities=capabilities, chrome_options=options) # 启动浏览器,打开需要下载的网页 self.brower.get(url) self.download_img = DownloadImg() def click_ele(self, click_xpath): # 单击指定控件 click_ele = self.brower.find_elements_by_xpath(click_xpath) if click_ele: click_ele[0].location_once_scrolled_into_view # 滚动到控件位置 self.brower.execute_script( 'arguments[0].click()' , click_ele[0]) # 单击控件,即使控件被遮挡,同样可以单击 def judge_doc(self, contents): # 判断文档类别 p_list = '' .join(contents.xpath( "./text()" ).extract()) span_list = '' .join(contents.xpath( "./span/text()" ).extract()) # # if span_list # if len(span_list)>len(p_list): # xpath_content_one = "./br/text()|./span/text()|./text()" # elif len(span_list)<len(p_list): # # xpath_content_one = "./br/text()|./text()" # xpath_content_one = "./br/text()|./span/text()|./text()" if len(span_list)!=len(p_list): xpath_content_one = "./br/text()|./span/text()|./text()" else : xpath_content_one = "./span/img/@src" return xpath_content_one def create_ppt_doc(self, ppt_dir_path, doc_dir_path): # 点击关闭开通会员按钮 xpath_close_button = "//div[@class='na-dialog-wrap show']/div/div/div[@class='btn-close']" self.click_ele(xpath_close_button) # 点击继续阅读 xpath_continue_read_button = "//div[@class='foldpagewg-icon']" self.click_ele(xpath_continue_read_button) # 点击取消打开百度app按钮 xpath_next_content_button = "//div[@class='btn-wrap']/div[@class='btn-cancel']" self.click_ele(xpath_next_content_button) # 循环点击加载更多按钮,直到显示全文 click_count = 0 while True: # 如果到了最后一页就跳出循环 if self.brower.find_elements_by_xpath( "//div[@class='pagerwg-loadSucc hide']" ) or self.brower.find_elements_by_xpath( "//div[@class='pagerwg-button' and @style='display: none;']" ): break # 点击加载更多 xpath_loading_more_button = "//span[@class='pagerwg-arrow-lower']" self.click_ele(xpath_loading_more_button) click_count += 1 print ( "第{}次点击加载更多!" .format(click_count)) # 等待一秒,等浏览器加载 time.sleep(1.5) # 获取html内容 sel = Selector(text=self.brower.page_source) #判断文档类型 xpath_content = "//div[@class='content singlePage wk-container']/div/p/img/@data-loading-src|//div[@class='content singlePage wk-container']/div/p/img/@data-src" contents = sel.xpath(xpath_content).extract() if contents:#如果是ppt self.create_ppt(ppt_dir_path, sel) else :#如果是doc self.create_doc(doc_dir_path, sel) # a = 3333 # return sel def create_ppt(self, ppt_dir_path, sel): # 如果文件夹不存在就创建一个 if not os.path.exists(ppt_dir_path): os.makedirs(ppt_dir_path) SLD_LAYOUT_TITLE_AND_CONTENT = 6 # 6代表ppt模版为空 prs = Presentation() # 实例化ppt # # 获取完整html # sel = self.get_html_data() # 获取标题 xpath_title = "//div[@class='doc-title']/text()" title = "" .join(sel.xpath(xpath_title).extract()).strip() # 获取内容 xpath_content_p = "//div[@class='content singlePage wk-container']/div/p/img" xpath_content_p_list = sel.xpath(xpath_content_p) xpath_content_p_url_list=[] for imgs in xpath_content_p_list: xpath_content = "./@data-loading-src|./@data-src|./@src" contents_list = imgs.xpath(xpath_content).extract() xpath_content_p_url_list.append(contents_list) img_path_list = [] # 保存下载的图片路径,方便后续图片插入ppt和删除图片 # 下载图片到指定目录 for index, content_img_p in enumerate(xpath_content_p_url_list): p_img_path_list=[] for index_1,img_one in enumerate(content_img_p): one_img_saved_path = os.path.join(ppt_dir_path, "{}_{}.jpg" .format(index,index_1)) self.download_img.download_one_img(img_one, one_img_saved_path) p_img_path_list.append(one_img_saved_path) p_img_max_shape = 0 for index,p_img_path in enumerate(p_img_path_list): img_shape = cv2.imread(p_img_path).shape if p_img_max_shape<img_shape[0]: p_img_max_shape = img_shape[0] index_max_img = index img_path_list.append(p_img_path_list[index_max_img]) print (img_path_list) # 获取下载的图片中最大的图片的尺寸 img_shape_max=[0,0] for img_path_one in img_path_list: img_path_one_shape = cv2.imread(img_path_one).shape if img_path_one_shape[0]>img_shape_max[0]: img_shape_max = img_path_one_shape # 把图片统一缩放最大的尺寸 for img_path_one in img_path_list: cv2.imwrite(img_path_one,cv2.resize(cv2.imread(img_path_one),(img_shape_max[1],img_shape_max[0]))) # img_shape_path = img_path_list[0] # 获得图片的尺寸 # img_shape = cv2.imread(img_shape_path).shape # 把像素转换为ppt中的长度单位emu,默认dpi是720 # 1厘米=28.346像素=360000 # 1像素 = 12700emu prs.slide_width = img_shape_max[1] * 12700 # 换算单位 prs.slide_height = img_shape_max[0] * 12700 for img_path_one in img_path_list: left = Inches(0) right = Inches(0) # width = Inches(1) slide_layout = prs.slide_layouts[SLD_LAYOUT_TITLE_AND_CONTENT] slide = prs.slides.add_slide(slide_layout) pic = slide.shapes.add_picture(img_path_one, left, right, ) print ( "insert {} into pptx success!" .format(img_path_one)) # os.remove(img_path_one) for root,dirs,files in os.walk(ppt_dir_path): for file in files: if file.endswith( ".jpg" ): img_path = os.path.join(root,file) os.remove(img_path) prs.save(os.path.join(ppt_dir_path, title + ".pptx" )) print ( "download {} success!" .format(os.path.join(ppt_dir_path, title + ".pptx" ))) def create_doc(self, doc_dir_path, sel): # 如果文件夹不存在就创建一个 if not os.path.exists(doc_dir_path): os.makedirs(doc_dir_path) # # 获取完整html # sel = self.get_html_data() # 获取标题 xpath_title = "//div[@class='doc-title']/text()" title = "" .join(sel.xpath(xpath_title).extract()).strip() document = docx.Document() # 创建word文档 document.add_heading(title, 0) # 添加标题 # 获取文章内容 xpath_content = "//div[contains(@data-id,'div_class_')]//p" # xpath_content = "//div[contains(@data-id,'div_class_')]/p" contents = sel.xpath(xpath_content) # 判断内容类别 xpath_content_one = self.judge_doc(contents) if xpath_content_one.endswith( "text()" ): # 如果是文字就直接爬 for content_one in contents: one_p_list = content_one.xpath(xpath_content_one).extract() p_txt = "" for p in one_p_list: if p== " " : p_txt += ( '\n' +p) else : p_txt += p # content_txt_one = '*' .join(content_one.xpath(xpath_content_one).extract()) pp = document.add_paragraph(p_txt) document.save(os.path.join(doc_dir_path, '{}.docx' .format(title))) print ( "download {} success!" .format(title)) elif xpath_content_one.endswith( "@src" ): # 如果是图片就下载图片 for index, content_one in enumerate(contents.xpath(xpath_content_one).extract()): # 获取图片下载路径 content_img_one_url = 'https:' + content_one # 保存图片 saved_image_path = self.download_img.download_one_img(content_img_one_url, os.path.join(doc_dir_path, "{}.jpg" .format( index))) document.add_picture(saved_image_path, width=Inches(6)) # 在文档中加入图片 os.remove(saved_image_path) # 删除下载的图片 document.save(os.path.join(doc_dir_path, '{}.docx' .format(title))) # 保存文档到指定位置 print ( "download {} success!" .format(title)) if __name__ == "__main__" : start_chrome = StartChrome() # start_chrome.create_doc_txt(doc_dir_path) start_chrome.create_ppt_doc(ppt_dir_path, doc_dir_path)</code> |
项目地址
https://github.com/siyangbing/baiduwenku
以上就是python实现百度文库自动化爬取的详细内容,更多关于python 爬取百度文库的资料请关注php教程其它相关文章!
注:关于python之百度文库自动化爬取的简单示例的内容就先介绍到这里,更多相关文章的可以留意