Web Scraping Using Beautifulsoup In Python



Summary: Web scraping is the process of extracting data from the internet. It is also known as web harvesting or web data extraction. Python allows us to perform web scraping using automated techniques. BeautifulSoup is a Python library used to parse data (structured data) from HTML and XML documents.

  • In this article, we are going to see how to scrape images from websites using python. For scarping images, we will try different approaches. Method 1: Using BeautifulSoup and Requests. Bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python.
  • Web scraping python beautifulsoup tutorial with example: The data present are unstructured and web scraping will help to collect data and store it. There are many ways of scraping websites and online services. Use the API of the website.
  • Web scraping refers to the extraction of data from a website. With practically limitless data floating around the web, web scraping is a very important tool to use this data for useful purposes.

The internet is an enormous wealth of data. Whether you are a data scientist, a business person, a student, or a professional, all of you have scraped data from the internet. Yes, that’s right! I repeat – you have already scraped data if you have used the internet for your work or even entertainment. So what does web scraping mean? It is the simple act of extracting data from a website. Even copying and pasting data from the internet is web scraping. So if you have downloaded your favorite song from the internet or copied your favorite quote from the web, it means you have already scrapped data from the internet.

Python web-scraping beautifulsoup. Improve this question. Follow asked Sep 16 '20 at 6:39. Elauser elauser. 169 5 5 bronze badges $ endgroup $.

In this article we are going to explore some of the most frequently asked questions regarding web scraping and then we shall go through the entire process of creating a web scraper and visualize how we can automate the task of web scraping! So without further delay let us begin our journey with web scraping.

What is Web Scraping?

Web scraping is the process of extracting data from the internet. It is also known as web harvesting or web data extraction. Python allows us to perform web scraping using automated techniques.

Some of the most commonly used libraries in Python for web scraping are:

  • The requests library.
  • The Beautiful Soup 4 library.
  • Selenium.
  • Scrapy.

In this article we are going to explore the BeautifulSoup library and the requests library to scrape data from the website.

Why Do We Scrape Data From The Internet?

Web scraping if performed using the proper guidelines can prove to be extremely useful and can make our life easy by automating everyday tasks that we perform repeatedly over the internet.

  • If you are a data analyst and you need to extract data from the internet on a day to day basis then creating an automated web crawler is the solution to reducing your burden of extracting data manually every day.
  • You can use web scrappers to extract information about products from online shopping websites and compare product prices and specifications.
  • You can use web scraping for content marketing and social media promotions.
  • As a student or a researcher, you can use web scraping to extract data for your research/project from the web.

The bottom-line is, “Automated web scraping allows you to work smart!”

Is Web Scraping Legal?

Now, this is a very important question but unfortunately, there is no specific answer for this. There are some websites that don’t mind if you scrape content from their webpage while there are others that prohibit content scraping. Therefore it is absolutely necessary that you follow the guidelines and do not violate the website’s policies while scraping content from their webpage.

Let us have a look at the few important guidelines that we must keep in mind while scraping content over the internet.

Remember:

Before we dive in to web scraping it is important that we understand how the web works and what is hypertext markup language because that is what we are going to extract our data from. Hence, let us have a brief discussion upon the HTTP request response model and HTML.

The HTTP Request/Response Model

The entire working principle of how the web works can be quite complicated but let us try and understand things at a simple level that would give us an idea of how we are going to approach web scraping.

In simple words, the HTTP request/response is a communications model used by HTTP and other extended protocols that are based on HTTP according to which a client (web browser) sends a request for a resource or a service to the server and the server sends back a response corresponding to the resource if the request is successfully processed otherwise the server responds with an error message in case it is unable to process the request.

There are numerous HTTP methods used to interact with the web server; but the most commonly used ones are get and post.

  • GET : used to request data from a specific resource in the web server.
  • POST : used to send data to a server to create/update a resource.

Other HTTP methods are:

  • PUT
  • HEAD
  • DELETE
  • PATCH
  • OPTIONS

Note: To scrape data from a website we will send a request to the web server using the requests library along with the get() method.

HTML – Hypertext Markup Language

Though HTML is a topic of discussion in itself and it is beyond the scope of this article, however you must be aware of the basic structure of HTML. Do not worry, you do not need to learn how to design a webpage using HTML and CSS but you must be aware of some of the key elements/tags used while creating a webpage using HTML.

💡 HTML has a hierarchical / tree structure. This property enables us to access elements of the HTML document while scraping the webpage based on their parent and child relationship. In order to visualize the HTML tree structure let us have a look at the image given below.

I have a listed a couple of links if you want to further explore and learn about how HTML works :

Creating The Web Scraper

Now let us begin creating our web scraper. The website that we are going to scrape is a job dashboard which lists the most recent Python jobs. In this walkthrough we shall be scraping:

  • The Job Title
  • The Location Of the Job
  • The Name Of the Organization

Website to be scraped: The Free Python Job Board

Step 1: Navigate and Inspect The Website/Webpage

The first and foremost task while scraping data from any webpage is to open up the webpage from which we are scraping the data and inspect the website using developer tools. You may also view the page source.

To navigate using developer tools:

  1. Right click on the webpage.
  2. select Inspect.

Note: Inspect element is a developer tool implemented into most web browsers which include Google Chrome, Firefox, Safari, and Internet Explorer. It allows us to view and edit the HTML and CSS source code at the backend. The changes made to the code are reflected in real-time in your browser window. The best part is you don’t have to worry about breaking the page while you play around with the code because the changes made by you will only take effect for the duration of your session, and are only reflected on your screen. In other words, Inspect Element provides us a sort of ‘what if’ experience without affecting the content for any other user.

To view page source:

  1. right click on web page.
  2. select View page source

Therefore, initially, we need to drill down the HTML source code and identify the elements that we have to focus upon while scraping the contents. Thus, the image given below denotes the sections that we need to work upon while scraping.

Step 2: Create The User-Agent

A user agent is a client (typically a web browser) that is used to send requests to the webserver on behalf of the user. While getting automated requests again and again from the same machine/system, the web server might guess that the request is automated and is being sent by a bot. Thus it blocks the request. Therefore we can use a user agent to fake a browser visit to a particular webpage which makes the server believe that the request was from an original user and not a bot.

Syntax:

Step 3: Import The Requests Library

✨ The Requests Library

The requests library allows us to send the getrequest to web server.

Here’s how this works:

  • Import the Python library requests that handles the details of requesting the websites from the server in an easy-to-process format.
  • Use the requests.get(...) method to access the website and pass the URL 'http://pythonjobs.github.io/' as an argument so that the function knows which location to access.
  • Access the actual body of the get request (the return value is a request object that also contains some useful meta information like the file type, etc.) and store it in a variable using the .content attribute.

Syntax:

✨ Checking The Status Code

Once the HTTP request is processed by the server it sends a response that contains a status code. The status code indicates whether a specific response was successfully processed or not.

There are majorly 5 different categories of status codes:

Syntax:

Step 4: Parse HTML using BeautifulSoup Library

✨ The BeautifulSoup Library

BeautifulSoup is a Python library used to parse data (structured data) from HTML and XML documents.

  • Import the BeautifulSoup Library.
  • Create the BeautifulSoup Object. The first parameter represents the HTML data while the second parameter is the parser.

Syntax:

Once we have created the BeautifulSoup object, we need to use different options provided to us by the BeautifulSoup library to navigate and find elements within the HTML document and scrape data from it.

💡 Attention:

In case you want to understand how to navigate through the HTML document using the components of the BeautifulSoup library, please refer to our tutorial to learn about the various options provided by BeautifulSoup to parse an HTML document.

Let us have a look at the code and then we will understand the working principle/logic behind it.

  • In the outer loop i.e. for job in soup.find_all('section', class_='job_list'), we find the parent element, which in this case is the section tag having an HTML class with the name job and then iterate over it.
  • The title variable represents a list comprehension and is used to store the job titles. In other words, the job.find_all('div', class_='job') function is used to search all div tags having the class name job and then store the data in the list title.
  • The inner loop i.e. for n, tag in enumerate(job.find_all('div', class_='job')) has a couple of functionalities:
    1. Search all div elements with the class info.
    2. Keep count of each iteration with the help of the enumerate function.
  • Inside the inner loop, the list comprehension company_element stores all contents that are within the span tag with the class info.
  • Finally, with the help of the counter n of the enumerate function, we extract the elements of the title tag (that stores the job titles) with help of their index. The location and company names are extracted from the 0th and 3rd index of the list company_element.

The Final Solution

Now let us consolidate all the steps to reach the final solution/code as shown below:

Output:

Web Scraping Using Beautifulsoup In Python

Hurrah! We have successfully created our first web scraper script.

Examples

As the saying goes – “Practice makes a man perfect!” Therefore, please have a look at the following article which lists the process of web scraping with the help of five examples. Click on the button/link given below to have a look at these examples and practice them to master the skill of web scraping using Python’s BeautifulSoup library.

Conclusion

I hope that after reading the entire article you can scrape data from webpages with ease! Please read the supporting articles in order to get a stronger grip on the mentioned concepts.

Please subscribe and stay tuned for more interesting articles in the future.

Where to Go From Here?

Enough theory, let’s get some practice!

To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?

Practice projects is how you sharpen your saw in coding!

Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?

Web

Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.

Join my free webinar “How to Build Your High-Income Skill Python” and watch how I grew my coding business online and how you can, too—from the comfort of your own home.

I am a professional Python Blogger and Content creator. I have published numerous articles and created courses over a period of time. Presently I am working as a full-time freelancer and I have experience in domains like Python, AWS, DevOps, and Networking.

You can contact me @:

Related Posts

APIs are not always available. Sometimes you have to scrape data from a webpage yourself. Luckily the modules Pandas and Beautifulsoup can help!

Related Course:Complete Python Programming Course & Exercises

Web scraping

Pandas has a neat concept known as a DataFrame. A DataFrame can hold data and be easily manipulated. We can combine Pandas with Beautifulsoup to quickly get data from a webpage.

If you find a table on the web like this:

Web Scraping Using Beautiful Soup In Python

We can convert it to JSON with:

And in a browser get the beautiful json output:

Converting to lists

Rows can be converted to Python lists.
We can convert it to a dataframe using just a few lines:

Web Scraping Using Beautiful Soup In Python Pdf

Pretty print pandas dataframe

Web Scraping Beautiful Soup Python

You can convert it to an ascii table with the module tabulate.
This code will instantly convert the table on the web to an ascii table:
This will show in the terminal as: