Scrapy官方文档系列——下载图片及图片处理（scrapy 图片下载）

25-02-14 22

在本文中，您将会了解到关于Scrapy官方文档系列——下载图片及图片处理的新资讯，同时我们还将为您解释scrapy图片下载的相关在本文中，我们将带你探索Scrapy官方文档系列——下载图片及图片处理的

在本文中，您将会了解到关于Scrapy官方文档系列——下载图片及图片处理的新资讯，同时我们还将为您解释scrapy 图片下载的相关在本文中，我们将带你探索Scrapy官方文档系列——下载图片及图片处理的奥秘，分析scrapy 图片下载的特点，并给出一些关于4、web爬虫，scrapy模块标签选择器下载图片，以及正则匹配标签、angular.js+node.js实现下载图片处理详解、Haproxy官方文档翻译（第三章）全局参数(1) 附英文原文、IOS 图片上传处理图片压缩图片处理的实用技巧。

本文目录一览：

Scrapy官方文档系列——下载图片及图片处理（scrapy 图片下载）
4、web爬虫，scrapy模块标签选择器下载图片，以及正则匹配标签
angular.js+node.js实现下载图片处理详解
Haproxy官方文档翻译（第三章）全局参数(1) 附英文原文
IOS 图片上传处理图片压缩图片处理

Scrapy官方文档系列——下载图片及图片处理（scrapy 图片下载）

Scrapy provides an item pipeline for downloading images attached to a particular item, for example, when you scrape products and also want to download their images locally.

This pipeline, called the Images Pipeline and implemented in the ImagesPipeline class, provides a convenient way for downloading and storing images locally with some additional features:

Convert all downloaded images to a common format (JPG) and mode (RGB)
Avoid re-downloading images which were downloaded recently
Thumbnail generation
Check images width/height to make sure they meet a minimum constraint

This pipeline also keeps an internal queue of those images which are currently being scheduled for download, and connects those items that arrive containing the same image, to that queue. This avoids downloading the same image more than once when it’s shared by several items.

The Python Imaging Library is used for thumbnailing and normalizing images to JPEG/RGB format, so you need to install that library in order to use the images pipeline.

Using the Images Pipeline

The typical workflow, when using the ImagesPipeline goes like this:

In a Spider, you scrape an item and put the URLs of its images into a image_urls field.
The item is returned from the spider and goes to the item pipeline.
When the item reaches the ImagesPipeline, the URLs in the image_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the images have finish downloading (or fail for some reason).
When the images are downloaded another field (images) will be populated with the results. This field will contain a list of dicts with information about the images downloaded, such as the downloaded path, the original scraped url (taken from the image_urls field) , and the image checksum. The images in the list of the images field will retain the same order of the original image_urls field. If some image failed downloading, an error will be logged and the image won’t be present in the images field.

Usage example

In order to use the image pipeline you just need to enable it and define an item with the image_urls and images fields:

from scrapy.item import Item

class MyItem(Item):

    # ... other item fields ...
    image_urls = Field()
    images = Field()

If you need something more complex and want to override the custom images pipeline behaviour, see Implementing your custom Images Pipeline.

Enabling your Images Pipeline

To enable your images pipeline you must first add it to your project ITEM_PIPELINES setting:

ITEM_PIPELINES = [''scrapy.contrib.pipeline.images.ImagesPipeline'']

And set the IMAGES_STORE setting to a valid directory that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.

For example:

IMAGES_STORE = ''/path/to/valid/dir''

Images Storage

File system is currently the only officially supported storage, but there is also (undocumented) support for Amazon S3.

File system storage

The images are stored in files (one per image), using a SHA1 hash of their URLs for the file names.

For example, the following image URL:

http://www.example.com/image.jpg

Whose SHA1 hash is:

3afec3b4765f8f0a07b78f98c07b83f013567a0a

Will be downloaded and stored in the following file:

<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

Where:

<IMAGES_STORE> is the directory defined in IMAGES_STORE setting
full is a sub-directory to separate full images from thumbnails (if used). For more info see Thumbnail generation.

Additional features

Image expiration

The Image Pipeline avoids downloading images that were downloaded recently. To adjust this retention delay use the IMAGES_EXPIRES setting, which specifies the delay in number of days:

# 90 days of delay for image expiration
IMAGES_EXPIRES = 90

Thumbnail generation

The Images Pipeline can automatically create thumbnails of the downloaded images.

In order use this feature, you must set IMAGES_THUMBS to a dictionary where the keys are the thumbnail names and the values are their dimensions.

For example:

IMAGES_THUMBS = {
    ''small'': (50, 50),
    ''big'': (270, 270),
}

When you use this feature, the Images Pipeline will create thumbnails of the each specified size with this format:

<IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg

Where:

<size_name> is the one specified in the IMAGES_THUMBS dictionary keys (small, big, etc)
<image_id> is the SHA1 hash of the image url

Example of image files stored using small and big thumbnail names:

<IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg

The first one is the full image, as downloaded from the site.

Filtering out small images

You can drop images which are too small, by specifying the minimum allowed size in the IMAGES_MIN_HEIGHT and IMAGES_MIN_WIDTH settings.

For example:

IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

Note: these size constraints don’t affect thumbnail generation at all.

By default, there are no size constraints, so all images are processed.

Implementing your custom Images Pipeline

Here are the methods that you should override in your custom Images Pipeline:

class scrapy.contrib.pipeline.images.ImagesPipeline

item_completed(results, items, info)
The ImagesPipeline.item_completed() method called when all image requests for a single item have completed (either finished downloading, or failed for some reason).The item_completed() method must return the output that will be sent to subsequent item pipeline stages, so you must return (or drop) the item, as you would in any pipeline.
Here is an example of the item_completed() method where we store the downloaded image paths (passed in results) in the image_paths item field, and we drop the item if it doesn’t contain any images:
```
from scrapy.exceptions import DropItem

def item_completed(self, results, item, info):
    image_paths = [x[''path''] for ok, x in results if ok]
    if not image_paths:
        raise DropItem("Item contains no images")
    item[''image_paths''] = image_paths
    return item
```
By default, the item_completed() method returns the item.

success is a boolean which is True if the image was downloaded successfully or False if it failed for some reason
image_info_or_error is a dict containing the following keys (if success is True) or a Twisted Failure if there was a problem.
url - the url where the image was downloaded from. This is the url of the request returned from the get_media_requests() method.
path - the path (relative to IMAGES_STORE) where the image was stored
checksum - a MD5 hash of the image contents

get_media_requests(item, info)
As seen on the workflow, the pipeline will get the URLs of the images to download from the item. In order to do this, you must override theget_media_requests() method and return a Request for each image URL:
```
def get_media_requests(self, item, info):
    for image_url in item[''image_urls'']:
        yield Request(image_url)
```
Those requests will be processed by the pipeline and, when they have finished downloading, the results will be sent to the item_completed() method, as a list of 2-element tuples. Each tuple will contain (success, image_info_or_failure) where:
The list of tuples received by item_completed() is guaranteed to retain the same order of the requests returned from the get_media_requests() method.
Here’s a typical value of the results argument:
```
[(True,
  {''checksum'': ''2b00042f7481c7b056c4b410d28f33cf'',
   ''path'': ''full/7d97e98f8af710c7e7fe703abc8f639e0ee507c4.jpg'',
   ''url'': ''http://www.example.com/images/product1.jpg''}),
 (True,
  {''checksum'': ''b9628c4ab9b595f72f280b90c4fd093d'',
   ''path'': ''full/1ca5879492b8fd606df1964ea3c1e2f4520f076f.jpg'',
   ''url'': ''http://www.example.com/images/product2.jpg''}),
 (False,
  Failure(...))]
```
By default the get_media_requests() method returns None which means there are no images to download for the item.

Custom Images pipeline example

Here is a full example of the Images Pipeline whose methods are examplified above:

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item[''image_urls'']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x[''path''] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item[''image_paths''] = image_paths
        return item

4、web爬虫，scrapy模块标签选择器下载图片，以及正则匹配标签

【百度云搜索，搜各种资料:http://bdy.lqkweb.com】

【搜网盘，搜各种资料:http://www.swpan.cn】

标签选择器对象

HtmlXPathSelector()创建标签选择器对象，参数接收response回调的html对象
需要导入模块：from scrapy.selector import HtmlXPathSelector

select()标签选择器方法，是HtmlXPathSelector里的一个方法，参数接收选择器规则，返回列表元素是一个标签对象

extract()获取到选择器过滤后的内容，返回列表元素是内容

选择器规则

　　//x 表示向下查找n层指定标签，如：//div 表示查找所有div标签
　　/x 表示向下查找一层指定的标签
　　/@x 表示查找指定属性,可以连缀如：@id @src
　　[@] 表示查找指定属性等于指定值的标签,可以连缀，查找class名称等于指定名称的标签
　　/text() 获取标签文本类容
　　[x] 通过索引获取集合里的指定一个元素

获取指定的标签对象

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from urllib import request                     #导入request模块
import os

class AdcSpider(scrapy.Spider):
    name = ''adc''                                        #设置爬虫名称
    allowed_domains = [''www.shaimn.com'']
    start_urls = [''http://www.shaimn.com/xinggan/'']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)               #创建HtmlXPathSelector对象，将页面返回对象传进去

        items = hxs.select(''//div[@]/li'')  #标签选择器，表示获取所有class等于showlist的div，下面的li标签
        print(items)                                       #返回标签对象

循环获取到每个li标签里的子标签，以及各种属性或者文本

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from urllib import request                     #导入request模块
import os

class AdcSpider(scrapy.Spider):
    name = ''adc''                                        #设置爬虫名称
    allowed_domains = [''www.shaimn.com'']
    start_urls = [''http://www.shaimn.com/xinggan/'']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)               #创建HtmlXPathSelector对象，将页面返回对象传进去

        items = hxs.select(''//div[@]/li'')  #标签选择器，表示获取所有class等于showlist的div，下面的li标签
        # print(items)                                     #返回标签对象
        for i in range(len(items)):                        #根据li标签的长度循环次数
            title = hxs.select(''//div[@]/li[%d]//img/@alt'' % i).extract()   #根据循环的次数作为下标获取到当前li标签，下的img标签的alt属性内容
            src = hxs.select(''//div[@]/li[%d]//img/@src'' % i).extract()     #根据循环的次数作为下标获取到当前li标签，下的img标签的src属性内容
            if title and src:
                print(title,src)  #返回类容列表

将获取到的图片下载到本地

urlretrieve()将文件保存到本地，参数1要保存文件的src，参数2保存路径
urlretrieve是urllib下request模块的一个方法，需要导入from urllib import request

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from urllib import request                     #导入request模块
import os

class AdcSpider(scrapy.Spider):
    name = ''adc''                                        #设置爬虫名称
    allowed_domains = [''www.shaimn.com'']
    start_urls = [''http://www.shaimn.com/xinggan/'']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)               #创建HtmlXPathSelector对象，将页面返回对象传进去

        items = hxs.select(''//div[@]/li'')  #标签选择器，表示获取所有class等于showlist的div，下面的li标签
        # print(items)                                     #返回标签对象
        for i in range(len(items)):                        #根据li标签的长度循环次数
            title = hxs.select(''//div[@]/li[%d]//img/@alt'' % i).extract()   #根据循环的次数作为下标获取到当前li标签，下的img标签的alt属性内容
            src = hxs.select(''//div[@]/li[%d]//img/@src'' % i).extract()     #根据循环的次数作为下标获取到当前li标签，下的img标签的src属性内容
            if title and src:
                # print(title[0],src[0])                                                    #通过下标获取到字符串内容
                file_path = os.path.join(os.getcwd() + ''/img/'', title[0] + ''.jpg'')          #拼接图片保存路径
                request.urlretrieve(src[0], file_path)                          #将图片保存到本地，参数1获取到的src，参数2保存路径

xpath()标签选择器，是Selector类里的一个方法，参数是选择规则【推荐】

选择器规则同上

selector()创建选择器类，需要接受html对象
需要导入：from scrapy.selector import Selector

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from scrapy.selector import Selector

class AdcSpider(scrapy.Spider):
    name = ''adc''                                        #设置爬虫名称
    allowed_domains = [''www.shaimn.com'']
    start_urls = [''http://www.shaimn.com/xinggan/'']

    def parse(self, response):
        items = Selector(response=response).xpath(''//div[@]/li'').extract()
        # print(items)                                     #返回标签对象
        for i in range(len(items)):
            title = Selector(response=response).xpath(''//div[@]/li[%d]//img/@alt'' % i).extract()
            src = Selector(response=response).xpath(''//div[@]/li[%d]//img/@src'' % i).extract()
            print(title,src)

正则表达式的应用

正则表达式是弥补，选择器规则无法满足过滤情况时使用的，

分为两种正则使用方式

　　1、将选择器规则过滤出来的结果进行正则匹配

　　2、在选择器规则里应用正则进行过滤

1、将选择器规则过滤出来的结果进行正则匹配，用正则取最终内容

最后.re(''正则'')

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from scrapy.selector import Selector

class AdcSpider(scrapy.Spider):
    name = ''adc''                                        #设置爬虫名称
    allowed_domains = [''www.shaimn.com'']
    start_urls = [''http://www.shaimn.com/xinggan/'']

    def parse(self, response):
        items = Selector(response=response).xpath(''//div[@]/li//img'')[0].extract()
        print(items)                                     #返回标签对象
        items2 = Selector(response=response).xpath(''//div[@]/li//img'')[0].re(''alt="(\w+)'')
        print(items2)

# <img src="http://www.shaimn.com/uploads/170724/1-1FH4221056141.jpg" alt="人体艺术mmSunny前凸后翘性感诱惑写真">
# [''人体艺术mmSunny前凸后翘性感诱惑写真'']

2、在选择器规则里应用正则进行过滤

[re:正则规则]

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from scrapy.selector import Selector

class AdcSpider(scrapy.Spider):
    name = ''adc''                                        #设置爬虫名称
    allowed_domains = [''www.shaimn.com'']
    start_urls = [''http://www.shaimn.com/xinggan/'']

    def parse(self, response):
        items = Selector(response=response).xpath(''//div'').extract()
        # print(items)                                     #返回标签对象
        items2 = Selector(response=response).xpath(''//div[re:test(@class, "showlist")]'').extract()  #正则找到div的class等于showlist的元素
        print(items2)

【转载自：http://www.leiqiankun.com/?id=47】

angular.js+node.js实现下载图片处理详解