2013-01-23 4 views
5

Ho una pagina HTML con più div comePython: come estrarre l'URL dalla pagina HTML usando BeautifulSoup?

<div class="article-additional-info"> 
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t... 
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"> 
<span class="arrows">»</span> 
</a> 
</div> 

<div class="article-additional-info"> 
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe... 
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"> 
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments"> 
</div> 

e ho bisogno di ottenere il valore <a href=> per tutti i div con classe article-additional-info Sono nuovo di BeautifulSoup

quindi ho bisogno gli URL

"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece" 
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece" 

Qual è il modo migliore per raggiungere questo obiettivo?

risposta

8

In base ai vostri criteri, restituisce tre URL (non due) - Volete filtrare il terzo?

idea di base è quella di scorrere il codice HTML, tirando fuori i soli elementi nella classe, e poi scorrendo tutti i link in quella classe, tirando fuori i collegamenti effettivi:

In [1]: from bs4 import BeautifulSoup 

In [2]: html = # your HTML 

In [3]: soup = BeautifulSoup(html) 

In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
    ...:  for link in item.find_all('a'): 
    ...:   print link.get('href') 
    ...:   
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 

Questo limita la cerca solo quegli elementi con il tag di classe article-additional-info e al suo interno cerca tutti i tag di ancoraggio (a) e acquisisce il loro corrispondente collegamento href.

2
from bs4 import BeautifulSoup as BS 
html = # Your HTML 
soup = BS(html) 
for text in soup.find_all('div', class_='article-additional-info'): 
    for links in text.find_all('a'): 
     print links.get('href') 

che stampa:

http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
2

Dopo aver lavorato con la documentazione, l'ho fatto nel modo seguente, grazie a tutti per le vostre risposte, li apprezzo

>>> import urllib2 
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews') 
>>> soup = BeautifulSoup(f.fp) 
>>> for link in soup.select('.article-additional-info'): 
... print link.find('a').attrs['href'] 
... 
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece 
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece 
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece 
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece 
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article.ece 
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece 
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece 
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece 
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece 
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece 
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece 
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece 
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece 
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece 
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece 
>>> 
0
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
...:  for link in item.find_all('a'): 
...:   print link.get('href') 
...: 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
+1

prega non collegarti di nuovo al tuo sito, è [** spam **] (http://stackoverflow.com/help/promotion) per [so]. –