What is web scraping?
web scraping is a technique that is used for extracting a large amount of data from websites and store them into a local file or table in databases.
Advantages of web scraping
The main aim of web scraping is collecting data from different websites. we can use the data in various ways like
- lead generation
- SEO for organic traffic
- Building e-commerce sites
- Personal data collection
Web scraping allows us to extract the data from websites, but how we can utilize those data is up to you. Even we can sell the data also (Caution: Not my suggestion, knowledge purpose only)
There are so many programming languages and tools for web scraping. Here we use python and its libraries for web scraping.
web scraping with python
web scraping with python is very easy and simple, I thought this is the best way to learn how to extract the data from different websites. There are so many libraries for web scraping in python like Scrapy, Beautiful soup, urlib, requests, etc.
BeautifulSoup: It is a package or library from python which allows us to get the data from HTML pages. we can access each element in the HTML pages and also we can modify them from the user end.
requests: It is a package from python that allows us to access the HTTP requests and sessions. It is one of the most downloaded packages of python.
Note: Every month 11,000,000 downloads.
BeautifulSoup and requests
The best thing in python is community support and there are hundreds of thousands of python packages. we can download or install them easily by just passing on command. These all python packages are managed by PYPI
pip install <package name>
Now let’s begin our show by installing Beautiful soup and requests.
pip install bs4 pip install requests
Also read: How to generate PDF using Python
scraping a website
let’s start our episode by hitting a website and get the data from it. open your editor and create a new file and write the below lines.
import requests from bs4 import BeautifulSoup res = requests.get('http://www.devpyjp.com/') # you can use your own urls print(res)
Note: If you get any errors when you taking URLs, please change that into https to http.
let me explain the above code in just seconds, we import necessary python packages for web scraping. we create a res variable to store the web content.
when you print the variable res, we will get the output as below.
<Response > #200 represents the sucess code in HTTP response codes.
If you want to get all the content of that URL that you mentioned above, just add this addition to the above lines.
res = requests.get('http://www.devpyjp.com/').content print(res)
content pushes all HTML & script code to res variables, now if print res then we will get the entire HTML content of that page.
let them pass into Beautifulsoup to access all the HTML elements. BeautifulSoup objects allow us to access all content of the HTML page, where we stored all the content in res.
soup = BeautifulSoup(res,'lxml') print(soup) # BeautifulSoup(response_variable, parser) parser=['lxml','HTML']
we passed requests response content to BeautifulSoup and create a soup object. when you print the soup, we will get the entire HTML content of the URL page.
I know, it was a little clumsy so we need to prettify it and that can help us to find our needed HTML elements.let’s modify it like below.
soup = BeautifulSoup(res,'lxml')
prettify() method gives a pretty look when you print the soup object. Now you can observe the difference between previous and present outputs.
Ok, I am stopping here for better practices, please read the next articles for detail explanation, Thank you!
If you like our explanation, please appreciate us or subscribe to our newsletter below to get notification of our new articles of Python web scraping.