reddit_scraper.py

NamasteMan
6 years ago

0

Learning about web scraping, having issues with my scraper: I am currently following a tutorial and writing it in py 3 using jupiter. I have no idea as to what the printed error means? Any suggestions?

(My scrape code)

import urllib.request
from bs4 import BeautifulSoup
url = “https://old.reddit.com/top/

download the URL and extract the content to the variable html

request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()

pass the HTML to Beautifulsoup.

soup = BeautifulSoup(html,‘html.parer’)

get the HTML of the table called site Table where all the links are displayed

main_table = soup.find(“div”,attrs={‘id’:‘siteTable’})

Now we go into main_table and get every a element in it which has a class “title”

links = main_table.find_all(“a”,class_=“title”)

from each link extract the text of link and the link itself

List to store a dict of the data we extracted

extracted_records = []
for link in links:
title = link.textT
url = link[‘href’]
record = {
‘title’:title,
‘url’:url
}
extracted_records.append(record)
print(extracted_records)

(The return)

HTTPError Traceback (most recent call last)
<ipython-input-5-ad957fafb6e1> in ()
4 #download the URL and extract the content to the variable html
5 request = urllib.request.Request(url)
—-> 6 html = urllib.request.urlopen(request).read()
7 #pass the HTML to Beautifulsoup.
8 soup = BeautifulSoup(html,‘html.parer’)

~\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
–> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):

~\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
529 for processor in self.process_response.get(protocol, []):
530 meth = getattr(processor, meth_name)
–> 531 response = meth(req, response)
532
533 return response

~\Anaconda3\lib\urllib\request.py in http_response(self, request, response)
639 if not (200 <= code < 300):
640 response = self.parent.error(
–> 641 ‘http’, request, response, code, msg, hdrs)
642
643 return response

~\Anaconda3\lib\urllib\request.py in error(self, proto, args)
567 if http_err:
568 args = (dict, ‘default’, ‘http_error_default’) + orig_args
–> 569 return self._call_chain(
args)
570
571 # XXX probably also want an abstract factory that knows when it makes

~\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, args)
501 for handler in handlers:
502 func = getattr(handler, meth_name)
–> 503 result = func(
args)
504 if result is not None:
505 return result

~\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
647 class HTTPDefaultErrorHandler(BaseHandler):
648 def http_error_default(self, req, fp, code, msg, hdrs):
–> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp)
650
651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 429: Too Many Requests

2replies
2voices
292views
dloser
6 years ago

3

http://lmgtfy.com/?q=http+429

hint: sites usually do not like scrapers,,,

NamasteMan
6 years ago

0

@dloser I know it probably doesn’t mean much to you code veterans but I have finally coded my first succesfull e-mail scraper, and my god it feel’s bloody marvelouse lol. Wona see?

You must be logged in to reply to this discussion. Login
1 of 3

This site only uses cookies that are essential for the functionality of this website. Cookies are not used for tracking or marketing purposes.

By using our site, you acknowledge that you have read and understand our Privacy Policy, and Terms of Service.

Dismiss