2016-07-14 52 views
5

Sto usando Ubuntu 14.04 LTS.Scrapy ottiene NoneType Errore durante l'utilizzo di Privoxy Proxy per Tor

Ho provato Polipo, ma continuava a rifiutare le connessioni di Firefox anche se aggiungevo me stesso come permesso Client e ore di ricerca senza soluzione. Così, invece, ho installato Privoxy e ho verificato che funziona con Firefox andando sul sito web di Tor e ha detto che questo browser è configurato per l'utilizzo di Tor. Questo conferma che dovrei essere in grado di analizzare i siti web di Tor.

Tuttavia, quando ho usato Scrapy, ricevo un errore che nessuno sembra avere ...?

2016-07-14 02:43:34 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'myProject.middlewares.RandomUserAgentMiddleware', 
'myProject.middlewares.ProxyMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-07-14 02:43:34 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2016-07-14 02:43:34 [scrapy] INFO: Enabled item pipelines: 
['myProject.pipelines.MysqlPipeline'] 
2016-07-14 02:43:34 [scrapy] INFO: Spider opened 
2016-07-14 02:43:34 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-07-14 02:43:34 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-07-14 02:43:34 [Tor] DEBUG: User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10 <GET http://thehiddenwiki.org> 
2016-07-14 02:43:34 [scrapy] ERROR: Error downloading <GET http://thehiddenwiki.org> 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks 
    result = result.throwExceptionIntoGenerator(g) 
    File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator 
    return g.throw(self.type, self.value, self.tb) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py", line 43, in process_request 
    defer.returnValue((yield download_func(request=request,spider=spider))) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred 
    result = f(*args, **kw) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request 
    return handler.download_request(request, spider) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 60, in download_request 
    return agent.download_request(request) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 259, in download_request 
    agent = self._get_agent(request, timeout) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 239, in _get_agent 
    _, _, proxyHost, proxyPort, proxyParams = _parse(proxy) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/webclient.py", line 37, in _parse 
    return _parsed_url_args(parsed) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args 
    host = b(parsed.hostname) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/webclient.py", line 17, in <lambda> 
    b = lambda s: to_bytes(s, encoding='ascii') 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/python.py", line 117, in to_bytes 
    'object, got %s' % type(text).__name__) 
TypeError: to_bytes must receive a unicode, str or bytes object, got NoneType 

Ho cercato questo errore "to_byte" ma sono andato al codice sorgente di Scrapy.

So che questo codice funziona senza il proxy perché ha raschiato il mio sito web localhost e altri siti Web, ma non Tor ovviamente poiché ha bisogno del proxy per accedere ai siti web di Cipolla.

Cosa sta succedendo?

Middlewares.py

class RandomUserAgentMiddleware(object): 
    def process_request(self, request, spider): 
     ua = random.choice(settings.get('USER_AGENT_LIST')) 
     if ua: 
      request.headers.setdefault('User-Agent', ua) 
      #this is just to check which user agent is being used for request 
      spider.log(
       u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request), 
       level=log.DEBUG 
      ) 

class ProxyMiddleware(object): 
    def process_request(self, request, spider): 
     request.meta['proxy'] = settings.get('HTTP_PROXY') 

Settings.py

USER_AGENT_LIST = [ 
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7', 
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0', 
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10' 
] 

DOWNLOADER_MIDDLEWARES = { 
    'myProject.middlewares.RandomUserAgentMiddleware': 400, 
    'myProject.middlewares.ProxyMiddleware': 410, 
    #'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None 
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None 
    # Disable compression middleware, so the actual HTML pages are cached 
} 

HTTP_PROXY = 'localhost:8118' 
+0

Utilizzare 'HTTP_PROXY = 'http: // localhost: 8118'' –

+0

E quella sir era la risposta. Senti la tassa per postarla e ti invierò + mark come soluzione! – Arrow

+0

Per la cronaca, ho creato https://github.com/scrapy/scrapy/issues/2127 –

risposta

6

Internamente, Scrapy uses urllib(2)'s _parse_proxy per rilevare le impostazioni proxy. Da urllib docs:

La funzione urlopen() funziona in modo trasparente con i proxy che non richiedono l'autenticazione. In un ambiente Unix o Windows, impostare le variabili di ambiente http_proxy o ftp_proxy su un URL che identifica il server proxy prima di avviare l'interprete Python.

% http_proxy="http://www.someproxy.com:3128" 
% export http_proxy 
% python 
... 

E quando utilizza proxy chiave meta, Scrapy aspetta la stessa sintassi, cioè deve contenere il regime, per esempio 'http://localhost:8118'.

This is in the docs, anche se un po 'sepolto:

È possibile anche impostare la chiave meta proxy per-richiesta, a un valore come http://some_proxy_server:port.