reddit_scraper.py
Learning about web scraping, having issues with my scraper: I am currently following a tutorial and writing it in py 3 using jupiter. I have no idea as to what the printed error means? Any suggestions?
(My scrape code)
import urllib.request
from bs4 import BeautifulSoup
url = “https://old.reddit.com/top/”
download the URL and extract the content to the variable html
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
pass the HTML to Beautifulsoup.
soup = BeautifulSoup(html,‘html.parer’)
get the HTML of the table called site Table where all the links are displayed
main_table = soup.find(“div”,attrs={‘id’:‘siteTable’})
Now we go into main_table and get every a element in it which has a class “title”
links = main_table.find_all(“a”,class_=“title”)
from each link extract the text of link and the link itself
List to store a dict of the data we extracted
extracted_records = []
for link in links:
title = link.textT
url = link[‘href’]
record = {
‘title’:title,
‘url’:url
}
extracted_records.append(record)
print(extracted_records)
(The return)
HTTPError Traceback (most recent call last)
<ipython-input-5-ad957fafb6e1> in ()
4 #download the URL and extract the content to the variable html
5 request = urllib.request.Request(url)
—-> 6 html = urllib.request.urlopen(request).read()
7 #pass the HTML to Beautifulsoup.
8 soup = BeautifulSoup(html,‘html.parer’)
~\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
–> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):
~\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
529 for processor in self.process_response.get(protocol, []):
530 meth = getattr(processor, meth_name)
–> 531 response = meth(req, response)
532
533 return response
~\Anaconda3\lib\urllib\request.py in http_response(self, request, response)
639 if not (200 <= code < 300):
640 response = self.parent.error(
–> 641 ‘http’, request, response, code, msg, hdrs)
642
643 return response
~\Anaconda3\lib\urllib\request.py in error(self, proto, args)
567 if http_err:
568 args = (dict, ‘default’, ‘http_error_default’) + orig_args
–> 569 return self._call_chain(args)
570
571 # XXX probably also want an abstract factory that knows when it makes
~\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, args)
501 for handler in handlers:
502 func = getattr(handler, meth_name)
–> 503 result = func(args)
504 if result is not None:
505 return result
~\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
647 class HTTPDefaultErrorHandler(BaseHandler):
648 def http_error_default(self, req, fp, code, msg, hdrs):
–> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp)
650
651 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 429: Too Many Requests
Went to the end of reality and back, remote neural monitoring is deep.
hint: sites usually do not like scrapers,,,
@dloser I know it probably doesn’t mean much to you code veterans but I have finally coded my first succesfull e-mail scraper, and my god it feel’s bloody marvelouse lol. Wona see?
Went to the end of reality and back, remote neural monitoring is deep.