
Web scraping is one of the most fundamental things in programming and web development. Let’s take a simple approach to web scraping using python programming language.
let’s dive into this scenario We want to know the most influential 5 physicists in the history of the world. If we google this we will get the most accurate resource with a 50-100 name list. we have to extract information with only the information range we need.
This is where Python and web scraping come in. Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.
so, now we have to create the web scrapping program in IDE to complete the program.
step 1 – setting up the environment.
To create/code our program, we need an IDE. Here I am using VS code by Microsoft. You can download the latest version by clicking the link.
We also need to install the Python 3 package on the computer.
We also need to install two libraries, requests and beautifulsoup4.
run the following commands in the Windows terminal.
pip install requests BeautifulSoup4
step 2 – Importing the libraries
Now. we have created the environment for creating the program.
first., we will import installed libraries.
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
Step 3 – making web requests
Now we have to create a function that helps to make web requests. This will be used to download the web pages. We can use requests library and get() function.
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)
def log_error(e):
"""
It is always a good idea to log errors.
This function just prints them, but you can
make it do anything.
"""
print(e)
This article is focusing on building the basic environment for the web scraping bot.
In next article, we will go further more into this systems performance and adding advanced features.