Sto provando a racimolare thesession.org per creare una tabella di quante volte ogni melodia è stata aggiunta ai libri di testo del memeber in modo da trovare alcuni brani popolari da imparare. Ho iniziato con il tutorial scrapy here e sto cercando di modificarlo per soddisfare i miei scopi. Il problema è che sebbene il sito web thesession.org abbia circa 10.390 brani, il mio scraper restituisce solo i dati su 10 di essi (solo quelli su http://www.thesession.org/tunes/index.php). Come posso ottenere i dati su tutti i brani (o sui cento brani classificati in alto)? Qualsiasi consiglio sarebbe molto apprezzato.scarabeo di python non sembra ricevere dati da tutti gli URL disponibili
Ecco quello che ho finora:
items.py
from scrapy.item import Item, Field
class tuneItem(Item):
url = Field()
name1 = Field()
name2 = Field()
key = Field()
count = Field()
pass
tune_spider.py
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from tutorial.items import tuneItem
from scrapy.conf import settings
class tunesSpider(CrawlSpider):
name = "irishtunes"
allowed_domains = ["thesession.org"]
start_urls = ["http://www.thesession.org/tunes"]
rules = [Rule(SgmlLinkExtractor(allow=['/display/\d+'], deny=['/members/','/recordings/','/index/','/display/\d+/.']), 'parse_tune')]
def parse_tune(self, response):
x = HtmlXPathSelector(response)
tune = tuneItem()
tune['url'] = response.url
tune['name1'] = x.select("//div[@id='details']//div[@class='box']/h1/text()").extract()
tune['name2'] = x.select("//div[@id='details']//div[@class='box']/h2/text()").extract()
tune['key'] = x.select("//div[@id='details']//div[@class='box']/p[1]/text()").extract()
tune['count'] = x.select("//div[@id='details']//div[@class='box']/p[3]/text()").re('\d+')
return tune
corro il raschietto aprendo la mia console, andare alla directory contenente file cfg dell'esercitazione e in esecuzione scrapy crawl irishtunes --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv
Ecco ciò che ottengo:
C:\Users\BM\Desktop\scrape\tutorial>scrapy crawl irishtunes --set FEED_URI=scrap
ed_data.csv --set FEED_FORMAT=csv
2011-11-25 22:45:47-0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: tutoria
l)
2011-11-25 22:45:47-0800 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled item pipelines:
2011-11-25 22:45:48-0800 [irishtunes] INFO: Spider opened
2011-11-25 22:45:48-0800 [irishtunes] INFO: Crawled 0 pages (at 0 pages/min), sc
raped 0 items (at 0 items/min)
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Redirecting (301) to <GET http://ww
w.thesession.org/tunes/> from <GET http://www.thesession.org/tunes>
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/> (referer: None)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11602> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11602>
{'count': [u'1'],
'key': [u'Key signature: Dmajor'],
'name1': [u"Brendan Begley's"],
'name2': [u'polka'],
'url': 'http://www.thesession.org/tunes/display/11602'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11593> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11593>
{'count': [u'3'],
'key': [u'Key signature: Amajor'],
'name1': [u'Carleton County Breakdown'],
'name2': [u'reel'],
'url': 'http://www.thesession.org/tunes/display/11593'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11597> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11597>
{'count': [u'3'],
'key': [u'Key signature: Dmajor'],
'name1': [u"Kasper's Rant"],
'name2': [u'hornpipe'],
'url': 'http://www.thesession.org/tunes/display/11597'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11594> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11594>
{'count': [u'5'],
'key': [u'Key signature: Gmajor'],
'name1': [u'The Full Of The Bag'],
'name2': [u'hornpipe'],
'url': 'http://www.thesession.org/tunes/display/11594'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11599> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11599>
{'count': [u'1'],
'key': [u'Key signature: Adorian'],
'name1': [u'The New Steamboat'],
'name2': [u'reel'],
'url': 'http://www.thesession.org/tunes/display/11599'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11598> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11598>
{'count': [u'4'],
'key': [u'Key signature: Gmajor'],
'name1': [u"Galen's Arrival"],
'name2': [u'reel'],
'url': 'http://www.thesession.org/tunes/display/11598'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11596> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11596>
{'count': [u'2'],
'key': [u'Key signature: Amixolydian'],
'name1': [u'Culloden Day'],
'name2': [u'strathspey'],
'url': 'http://www.thesession.org/tunes/display/11596'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11595> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11595>
{'count': [u'2'],
'key': [u'Key signature: Aminor'],
'name1': [u'Miss Sine Flemington'],
'name2': [u'barndance'],
'url': 'http://www.thesession.org/tunes/display/11595'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11600> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11600>
{'count': [u'2'],
'key': [u'Key signature: Dmajor'],
'name1': [u"Joan Martin's"],
'name2': [u'polka'],
'url': 'http://www.thesession.org/tunes/display/11600'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11601> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11601>
{'count': [u'2'],
'key': [u'Key signature: Gmajor'],
'name1': [u'My Time Inside 2005'],
'name2': [u'waltz'],
'url': 'http://www.thesession.org/tunes/display/11601'}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Closing spider (finished)
2011-11-25 22:45:49-0800 [irishtunes] INFO: Stored csv feed (10 items) in: scrap
ed_data.csv
2011-11-25 22:45:49-0800 [irishtunes] INFO: Dumping spider stats:
{'downloader/request_bytes': 3655,
'downloader/request_count': 12,
'downloader/request_method_count/GET': 12,
'downloader/response_bytes': 31620,
'downloader/response_count': 12,
'downloader/response_status_count/200': 11,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2011, 11, 26, 6, 45, 49, 500000),
'item_scraped_count': 10,
'request_depth_max': 1,
'scheduler/memory_enqueued': 12,
'start_time': datetime.datetime(2011, 11, 26, 6, 45, 48, 10000)}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Spider closed (finished)
2011-11-25 22:45:49-0800 [scrapy] INFO: Dumping global stats:
{}
EDIT: La risposta da @reclosedev mi ha fatto sulla strada. Per chiunque chiedendo circa l'esito, ecco una fotografia ...
(1) La maggior parte dei brani sono a meno di 10 tunebooks dei membri
(2) La popolarità di tutti i 10.379 brani che ho potuto raschiare dal sito (come misurato dal numero di tunebooks sono in) segue una distribuzione di legge di potenza
(3) Ed ecco i brani che sono> 1000 tu nebooks sul sito, che mostra i nomi dei brani top-ranked e quanti tunebooks sono in
Risultati interessanti, ma potresti aver risparmiato il problema: http://www.irishtune.info/session/tunes.php – alanng