NamasteMan
6 years ago

0

Learning about web scraping, having issues with my scraper: I am currently following a tutorial and writing it in py 3 using jupiter. I have no idea as to what the printed error means? Any suggestions?

(My scrape code)

import urllib.request
from bs4 import BeautifulSoup
url = “https://old.reddit.com/top/”

download the URL and extract the content to the variable html

request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()

pass the HTML to Beautifulsoup.

soup = BeautifulSoup(html,‘html.parer’)

get the HTML of the table called site Table where all the links are displayed

main_table = soup.find(“div”,attrs={‘id’:‘siteTable’})

Now we go into main_table and get every a element in it which has a class “title”

links = main_table.find_all(“a”,class_=“title”)

from each link extract the text of link and the link itself

List to store a dict of the data we extracted

extracted_records = []
for link in links:
title = link.textT
url = link[‘href’]
record = {
‘title’:title,
‘url’:url
}
extracted_records.append(record)
print(extracted_records)

(The return)

HTTPError Traceback (most recent call last)
<ipython-input-5-ad957fafb6e1> in ()
4 #download the URL and extract the content to the variable html
5 request = urllib.request.Request(url)
—-> 6 html = urllib.request.urlopen(request).read()
7 #pass the HTML to Beautifulsoup.
8 soup = BeautifulSoup(html,‘html.parer’)

~\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
–> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):

~\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
529 for processor in self.process_response.get(protocol, []):
530 meth = getattr(processor, meth_name)
–> 531 response = meth(req, response)
532
533 return response

~\Anaconda3\lib\urllib\request.py in http_response(self, request, response)
639 if not (200 <= code < 300):
640 response = self.parent.error(
–> 641 ‘http’, request, response, code, msg, hdrs)
642
643 return response

~\Anaconda3\lib\urllib\request.py in error(self, proto, args)
567 if http_err:
568 args = (dict, ‘default’, ‘http_error_default’) + orig_args
–> 569 return self._call_chain(args)
570
571 # XXX probably also want an abstract factory that knows when it makes

~\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, args)
501 for handler in handlers:
502 func = getattr(handler, meth_name)
–> 503 result = func(args)
504 if result is not None:
505 return result

~\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
647 class HTTPDefaultErrorHandler(BaseHandler):
648 def http_error_default(self, req, fp, code, msg, hdrs):
–> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp)
650
651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 429: Too Many Requests

Went to the end of reality and back, remote neural monitoring is deep.

dloser
6 years ago

3

http://lmgtfy.com/?q=http+429

hint: sites usually do not like scrapers,,,