利用Scrapy爬取链家杭州

在恶补了一下关于class的概念之后,对于爬虫框架scrapy的运用稍微熟练了一点,于是对前段时间用beautifulsoup方式爬取链家的代码进行了更新。

这次爬取的仍然是链家杭州二手房,只不过将上次爬取的在售区换成了成交区。

Scrapy的学习,可以通过查阅下面的资料,适当穿插进行吧。

Scrapy爬虫框架的参考资料

好,言归正传。

首先是就是分析网页结构,任意打开一个链家二手房板块页面,计数发现该页面下有总计30条(图中只截取了4条)的二手房信息,而总计有100个页面。
网页结构

因此,不难得到应该采取的爬虫策略为:

1. 爬取每一个页面30条的二手房信息的网址链接
2. 爬取每个二手房网址链接内的标题、价格等因素

notes: 如果对以上爬取过程进行细分,第1条则是首先获取所有页面的url,然后获取每个页面中30条二手房的url;第2条则是对第1条获得的二手房url进行分析,进一步获取标题、价格等具体因素。

但是,又发现链家网站并没有把所有的二手房信息直接放出来,每个版块内无论有多少二手房,也只呈现总计100个页面,每页30条总计3000条的租房信息。


筛选条件

那么就只能通过选择不同的筛选条件,将所有的二手房进行划分,将每个筛选条件下的二手房数量控制在3000条以下,再将所有筛选条件下的二手房信息合并以取得所有的信息。

此处,我选择的是以总价为条件进行筛选。

#在0-50万的筛选条件下,url为
# url = "https://hz.lianjia.com/chengjiao/pg1/ea10000bp0ep50/"
#其中pg1为当前筛选条件下的第1页,bp0为总价筛选下限,ep50为总价筛选上限

#1.设置筛选条件为
# page_group_list = ['ea10000bp0ep50/',
# 'ea10000bp50ep100/',
# 'ea10000bp100ep120/',
# 'ea10000bp120ep140/',
# 'ea10000bp140ep160/',
# 'ea10000bp160ep180/',
# 'ea10000bp180ep200/',
# 'ea10000bp200ep250/',
# 'ea10000bp250ep300/',
# 'ea10000bp300ep10000/']

#2.每个筛选条件下的页面数量通过pg后的数字进行迭代
#pg(1,2,3,4,5....)

#3.每个筛选条件下的最大页面数量也需要获得,因为不是所有条件下都是100页

url分析完毕,开始具体的写代码。这次所写的Scrapy爬虫框架,大致由items、peplines、settings以及Spiders几个部分构成,items用于定义所想爬取的元素,peplines用于实现爬取元素的输出,settings用于调整爬虫具体参数,而spiders则是爬虫的核心,在spiders中实现具体的爬取过程。

a.定义items
import scrapy
class LianjiaItem(scrapy.Item):
# 房屋名称
housename = scrapy.Field()
# 产权年限
propertylimit = scrapy.Field()
# 链接
houselink = scrapy.Field()
# 挂牌总价
totalprice = scrapy.Field()
# 单价
unitprice = scrapy.Field()
# 房屋户型
housetype = scrapy.Field()
# 建筑面积
constructarea = scrapy.Field()
# 套内面积
housearea = scrapy.Field()
# 楼层
housefloor = scrapy.Field()
# 房屋用途
house_use = scrapy.Field()
# 交易属性
tradeproperty = scrapy.Field()
# 关注次数
guanzhu = scrapy.Field()
# 带看次数
daikan = scrapy.Field()
# 所属行政区域
district = scrapy.Field()
# 成交总价
selltotalprice = scrapy.Field()
# 成交均价
sellunitprice = scrapy.Field()
# 成交时间
selltime = scrapy.Field()
# 成交周期
sellperiod = scrapy.Field()
# 小区均价
villageunitprice = scrapy.Field()
# 小区建成年代
villagetime = scrapy.Field()
b.定义spiders
# -*- coding: utf-8 -*-
import scrapy
import requests
from lxml import etree
import json
from Lianjia.items import LianjiaItem
import re


class ChengjiaoSpider(scrapy.Spider):
name = 'chengjiao'
# allowed_domains = ['lianjia.com']
baseURL = 'https://hz.lianjia.com/chengjiao/pg'
offset_page = 1
offset_list = 0
page_group_list = ['ea10000bp0ep50/',
'ea10000bp50ep100/',
'ea10000bp100ep120/',
'ea10000bp120ep140/',
'ea10000bp140ep160/',
'ea10000bp160ep180/',
'ea10000bp180ep200/',
'ea10000bp200ep250/',
'ea10000bp250ep300/',
'ea10000bp300ep10000/']

url = baseURL + str(offset_page) + page_group_list[offset_list]

start_urls = [url]

#用于获取当前筛选条件下的最大页面数量
def getmax(self,url):
requ = requests.get(url,allow_redirects=False)
if requ.status_code == 200:
resp = requ.text
tree = etree.HTML(resp)
str_max = tree.xpath("http://div[@class='page-box house-lst-page-box']/@page-data")[0]
dic_max = json.loads(str_max)
maxnum = dic_max['totalPage']
return maxnum
else:
print 'Open Page Error'

#用于获取页面下的二手房url。
#callback参数用于将返回的值传递给指定的方法,meta参数用于将变量item传递给指定的方法
def parse(self, response):
node_list = response.xpath("http://div[@class='info']/div[@class='title']/a")
for node in node_list:
item = LianjiaItem()
item['houselink'] = node.xpath("./@href").extract()[0]
yield scrapy.Request(item['houselink'],callback=self.parse_content,meta={'key':item})
#如果爬取的页数小于该筛选条件下的最大页面数,则页面数量+1,并继续爬取下一页;
#当页数大于或等于该筛选条件下的最大页面数时,说明已经爬完该条件下的所有页面,
#则页数重新从1开始计,并换下一个筛选条件。
if self.offset_page < self.getmax(response.url):
self.offset_page += 1
nexturl = self.baseURL + str(self.offset_page) + self.page_group_list[self.offset_list]
yield scrapy.Request(nexturl,callback=self.parse)
else:
if self.offset_list < len(self.page_group_list)-1:
self.offset_page = 1
self.offset_list += 1
nexturl = self.baseURL + str(self.offset_page) + self.page_group_list[self.offset_list]
yield scrapy.Request(nexturl,callback=self.parse)

#爬取具体的信息
#通过meta参数接受上一个方法传递的值item
def parse_content(self,response):
item = response.meta['key']
# 房屋名称
try:
item['housename'] = response.xpath("http://div[@class='house-title']/div[@class='wrapper']/h1/text()").extract()[0].strip()
except:
item['housename'] = 'None'
# 产权年限
try:
item['propertylimit'] = response.xpath("http://div[@class='content']/ul/li[13]/text()").extract()[0].strip()
except:
item['propertylimit'] = 'None'
# 挂牌总价
try:
item['totalprice'] = response.xpath("http://div[@class='msg']/span[1]/label/text()").extract()[0].strip()
except:
item['totalprice'] = 'None'
# 房屋户型
try:
item['housetype'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[1]/text()").extract()[0].strip()
except:
item['housetype'] = 'None'
# 建筑面积
try:
item['constructarea'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[3]/text()").extract()[0].strip()
except:
item['constructarea'] = 'None'
# 套内面积
try:
item['housearea'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[5]/text()").extract()[0].strip()
except:
item['housearea'] = 'None'
# 房屋用途
try:
item['house_use'] = response.xpath("http://div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[4]/text()").extract()[0].strip()
except:
item['house_use'] = 'None'
# 交易属性
try:
item['tradeproperty'] = response.xpath("http://div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[2]/text()").extract()[0].strip()
except:
item['tradeproperty'] = 'None'
# 关注次数
try:
item['guanzhu'] = response.xpath("http://div[@class='msg']/span[5]/label/text()").extract()[0].strip()
except:
item['guanzhu'] = 'None'
# 带看次数
try:
item['daikan'] = response.xpath("http://div[@class='msg']/span[4]/label/text()").extract()[0].strip()
except:
item['daikan'] = 'None'
# 行政区
try:
pre_district = response.xpath("http://section[@class='wrapper']/div[@class='deal-bread']/a[3]/text()").extract()[0].strip()
pattern = u'(.*?)二手房成交价格'
item['district'] = re.search(pattern,pre_district).group(1)
except:
item['district'] = 'None'
# 成交总价
try:
item['selltotalprice'] = response.xpath("http://span[@class='dealTotalPrice']/i/text()").extract()[0].strip()
except:
item['selltotalprice'] = 'None'
# 成交均价
try:
item['sellunitprice'] = response.xpath("http://div[@class='price']/b/text()").extract()[0].strip()
except:
item['sellunitprice'] = 'None'
# 成交时间
try:
item['selltime'] = response.xpath("http://div[@id='chengjiao_record']/ul[@class='record_list']/li/p[@class='record_detail']/text()").extract()[0].split(u',')[-1]
except:
item['selltime'] = 'None'

yield item
c.定义settings
# -*- coding: utf-8 -*-

# Scrapy settings for Lianjia project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Lianjia'

SPIDER_MODULES = ['Lianjia.spiders']
NEWSPIDER_MODULE = 'Lianjia.spiders'

#LOG_FILE = r"C:\test\CHENGJ_pro.doc"
#LOG_LEVEL = 'INFO'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Lianjia (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
   'Accept-Language': 'zh-CN,zh;q=0.9',
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Lianjia.middlewares.LianjiaSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Lianjia.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'Lianjia.pipelines.LianjiaPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
d.定义peplines
import json

class LianjiaPipeline(object):
    def __init__(self):
        self.f = open('c:\\test\\ceshi.json','w')
    
    
    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii=False)+'\n'
        self.f.write(content.encode('utf-8'))
        return item
    
    def close_spider(self,spider):
        self.f.close()
补充:将json转换为excel
import json
import pandas as pd

path = r"C:\test\ceshi.json"
f = open(path)

records = [json.loads(line) for line in f.readlines()]
df = pd.DataFrame(records)

df.to_csv(r"C:\test\chengjiao.csv",encoding='gb18030')

在看了静觅的教程之后,将spiders的代码进行了更新,其它部分不变。整体上代码更加清晰,少了很多的判断语句和迭代。

import scrapy
import requests
from lxml import etree
import json
from Lianjia.items import LianjiaItem
import re
from scrapy.http import Request


class ChengjiaoSpider(scrapy.Spider):
    name = 'chengjiao_pro'
    baseURL = 'https://hz.lianjia.com/chengjiao/pg'
    offset_page = 1
    page_group_list = ['ea10000bp0ep50/',
                      'ea10000bp50ep100/',
                       'ea10000bp100ep120/',
                       'ea10000bp120ep140/',
                       'ea10000bp140ep160/',
                       'ea10000bp160ep180/',
                       'ea10000bp180ep200/',
                       'ea10000bp200ep250/',
                       'ea10000bp250ep300/',
                       'ea10000bp300ep10000/']    

    
    def start_requests(self):
        for i in self.page_group_list:
            url = self.baseURL + str(self.offset_page) + i
            yield Request(url,callback=self.parse) 
            
        
    def parse(self, response):
        maxnum_dict = json.loads(response.xpath("http://div[@class='page-box house-lst-page-box']/@page-data").extract()[0])
        maxnum = int(maxnum_dict['totalPage'])
        for num in range(1,maxnum+1):
#            item = LianjiaItem()
            split_str = self.baseURL + str(num)
            url = split_str + response.url.split(self.baseURL + str(self.offset_page))[1]
            yield Request(url,self.get_link,dont_filter=True)
#            item['iurl'] = url
#            item['resurl'] = response.url
#            yield item
            
            
    def get_link(self,response):
        node_list = response.xpath("http://div[@class='info']/div[@class='title']/a")
        for node in node_list:
            item = LianjiaItem()
            item['houselink'] = node.xpath("./@href").extract()[0]
            yield scrapy.Request(item['houselink'],callback=self.parse_content,meta={'key':item})

    def parse_content(self,response):
        item = response.meta['key']
#        房屋名称
        try:
            item['housename'] = response.xpath("http://div[@class='house-title']/div[@class='wrapper']/h1/text()").extract()[0].strip()
        except:
            item['housename'] = 'None'
#        产权年限
        try:
            item['propertylimit'] = response.xpath("http://div[@class='content']/ul/li[13]/text()").extract()[0].strip()
        except:
            item['propertylimit'] = 'None'
#        挂牌总价
        try:
            item['totalprice'] = response.xpath("http://div[@class='msg']/span[1]/label/text()").extract()[0].strip()
        except:
            item['totalprice'] = 'None'
#        房屋户型
        try:
            item['housetype'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[1]/text()").extract()[0].strip()
        except:
            item['housetype'] = 'None'
#        建筑面积
        try:
            item['constructarea'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[3]/text()").extract()[0].strip()
        except:
            item['constructarea'] = 'None'
#        套内面积
        try:
            item['housearea'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[5]/text()").extract()[0].strip()
        except:
            item['housearea'] = 'None'
#        房屋用途
        try:
            item['house_use'] = response.xpath("http://div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[4]/text()").extract()[0].strip()
        except:
            item['house_use'] = 'None'
#        交易属性
        try:
            item['tradeproperty'] = response.xpath("http://div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[2]/text()").extract()[0].strip()
        except:
            item['tradeproperty'] = 'None'
#        关注次数
        try:
            item['guanzhu'] = response.xpath("http://div[@class='msg']/span[5]/label/text()").extract()[0].strip()
        except:
            item['guanzhu'] = 'None'
#        带看次数
        try:            
            item['daikan'] = response.xpath("http://div[@class='msg']/span[4]/label/text()").extract()[0].strip()
        except:
            item['daikan'] = 'None'
#        行政区
        try:
            pre_district = response.xpath("http://section[@class='wrapper']/div[@class='deal-bread']/a[3]/text()").extract()[0].strip()
            pattern = u'(.*?)二手房成交价格'
            item['district'] = re.search(pattern,pre_district).group(1)
        except:
            item['district'] = 'None'
#        成交总价
        try:
            item['selltotalprice'] = response.xpath("http://span[@class='dealTotalPrice']/i/text()").extract()[0].strip()
        except:
            item['selltotalprice'] = 'None'
#        成交均价
        try:
            item['sellunitprice'] = response.xpath("http://div[@class='price']/b/text()").extract()[0].strip()
        except:
            item['sellunitprice'] = 'None'
#        成交时间
        try:
            item['selltime'] = response.xpath("http://div[@id='chengjiao_record']/ul[@class='record_list']/li/p[@class='record_detail']/text()").extract()[0].split(u',')[-1]
        except:
            item['selltime'] = 'None'
        yield item
最后编辑于
?著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,992评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,212评论 3 388
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,535评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,197评论 1 287
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,310评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,383评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,409评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,191评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,621评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,910评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,084评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,763评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,403评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,083评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,318评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,946评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,967评论 2 351

推荐阅读更多精彩内容

  • 最近想在工作相关的项目上做技术改进,需要全而准的车型数据,寻寻觅觅而不得,所以就只能自己动手丰衣足食,到网上获(窃...
    littlelory阅读 3,892评论 7 19
  • 序言第1章 Scrapy介绍第2章 理解HTML和XPath第3章 爬虫基础第4章 从Scrapy到移动应用第5章...
    SeanCheney阅读 15,059评论 13 61
  • scrapy是python最有名的爬虫框架之一,可以很方便的进行web抓取,并且提供了很强的定制型,这里记录简单学...
    bomo阅读 2,103评论 1 11
  • 你是一团篝火 在暗里燃烧 温暖旅人的夜晚 冷下一处灰烬 他们却不曾为谁停留 风声又起 他们走了,于是 你成为世界的情人
    同敬阅读 282评论 5 12
  • 德惠迎宾广场座落在德惠西站前,刚刚建成不久,几年前这里还是一片农田。随着德惠市城区的开发建设,特别是哈大高速铁路的...
    宏波_阅读 834评论 2 3