# ArticleSpider **Repository Path**: jzins/ArticleSpider ## Basic Information - **Project Name**: ArticleSpider - **Description**: 爬取博客园异步方式存入数据库 (手动登录后取出cookie保持登录状态) 基于scrapy2.5.1 python3.9 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-01-18 - **Last Updated**: 2023-02-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: Scrapy, Python, Spider ## README # ArticleSpider #### 介绍 爬取博客园异步方式存入数据库 (手动登录后取出cookie保持登录状态) 基于scrapy2.5.1 python3.9 需要python基础和爬虫基础 #### 使用说明 在SETTINGS.py中设置自己的mysql和USER_AGENT 表结构可以参照pipelines.py里的do_insert方法 **重写start_requests方法:** ```python def start_requests(self): # 这是你下载chromedriver.exe的绝对路径 path = "E:\zhuomian\project\python\ArticleSpider\ArticleSpider\spiders\chromedriver.exe" liu = webdriver.Chrome(path) liu.get("https://account.cnblogs.com/signin?returnUrl=https:%2F%2Fnews.cnblogs.com%2F") # 打开登陆界面后让程序睡30秒 这个时间用自己的账号登录上! time.sleep(30) # 获取cookie cookiese = liu.get_cookies() # 将cookie保存 方便后续使用 pickle.dump(cookiese, open("E:/zhuomian/project/python/ArticleSpider/cookies/boke.cookie", "wb")) # cookie保存到字典中 用于返回到response cookie_dict = {} # 取name和value的值(取重要的就可以) for cookie in cookiese: cookie_dict[cookie["name"]] = cookie["value"] # 返回到response return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict)] ``` 因为返回到了response所以用parse(self, response)方法来调用一下(自己看就行,不演示了) 登录一次后就已经把cookie获取到了,后续直接读取cookie文件就行,不需要再登录获取了 修改代码: ```python def start_requests(self): cookiese = pickle.load(open("E:/zhuomian/project/python/ArticleSpider/cookies/boke.cookie", "rb")) cookie_dict = {} for cookie in cookiese: cookie_dict[cookie["name"]] = cookie["value"] return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict)] # 这是你下载chromedriver.exe的绝对路径 # path = "E:\zhuomian\project\python\ArticleSpider\ArticleSpider\spiders\chromedriver.exe" # liu = webdriver.Chrome(path) # liu.get("https://account.cnblogs.com/signin?returnUrl=https:%2F%2Fnews.cnblogs.com%2F") # 打开登陆界面后让程序睡30秒 这个时间用自己的账号登录上! # time.sleep(30) # 获取cookie # cookiese = liu.get_cookies() # 将cookie保存 方便后续使用 # pickle.dump(cookiese, open("E:/zhuomian/project/python/ArticleSpider/cookies/boke.cookie", "wb")) # cookie保存到字典中 用于返回到response # cookie_dict = {} # 取name和value的值(取重要的就可以) # for cookie in cookiese: # cookie_dict[cookie["name"]] = cookie["value"] # 返回到response # return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict)] ```