Web Scraping with Python and Beautiful Soup
For any machine learning task the first thing we need to do is data collection. There are number of different ways to collect data. Let's see some of those.
1. API's: The preferred way of data collection is consuming API's, the reason is, API's are well structured, also consuming an API is very simple.
2. Public Data sets: This is the second best option, there are people who have collected data and made it available to the public. Sites like Kaggle, University of California's Machine Learning Repository are some place where you can find public data sets.
If you could not find data from the above two methods then the last option is to go for the web scraping. Choose the website which provides data for you, scrape the content, prepare your own data set. In this post I'm going to show you an example of scraping the cricbuzz website blog posts.
Let's understand the scraping procedure. It's very simple, we have python libraries that will allow us to get the source html of the page, after getting the html, we will find the place where the required data is in from all the other junk, then extract the content with regular expressions or some other technique.
As I said before, for the extraction part we can use regular expressions but it will make our task little hard, so in this example I am going to use a third party library called Beautiful Soup to extract the data from the source html. It's enough theory let's do something practical.
Let's understand the scraping procedure. It's very simple, we have python libraries that will allow us to get the source html of the page, after getting the html, we will find the place where the required data is in from all the other junk, then extract the content with regular expressions or some other technique.
As I said before, for the extraction part we can use regular expressions but it will make our task little hard, so in this example I am going to use a third party library called Beautiful Soup to extract the data from the source html. It's enough theory let's do something practical.
We will start simple by scraping a single page and then enhance it more to automate the scraping procedure.
The post I am going to scrape: http://www.cricbuzz.com/cricket-news/100707/destinys-child-zimbabwes-middle-order-batsman-sikandar-raza-treats-triumphs-and-failures-the-same.
Let's start,
Step 1 - Read the source html of the page.
I used urlopen to get the html, if you print the html variable, you could see the html source of the page.
Step 2 - Create a Beautiful Soup object for the html source, so we can use Beautiful Soup functions and attributes.
If you see the above code, Beautiful Soup object will take the source html and a parser type as arguments. Beautiful Soup will create a tree representation of source html. Different parser types will create different tree. To know more about this.
Before continuing the next step, it's better to know some very basics of Beautiful Soup.
1. Accessing an element with Beautiful Soup (It's as simple as putting BeautifulSoupObject.elementName)
2. To get the text inside an element (BeautifulSoupObject.Element.getText())
3. To get the list of all the elements of a element type (BSObject.find_all(ElementType))
4. Filtering list with attributes (BSObject.find_all( 'element', attribute_type = 'attribute_value'))
5. Get value of a attribute (BSObject.Element.get('attributeName'))
That's all that we need to know on Beautiful Soup, let's continue with web scraping. Before moving on to code, we should determine how to extract the required data from the web page, for that we need to find the html element which contains the data.
First let's see how we can extract the title of the post. First we need to find the containing element. (For that Google Developer Tool's inspect element will help.)
The title of the post is in an element like this,
<h1 class="nws-dtl-hdln" itemprop="headline"> ...... </h1>
What about the content? Each paragraph in the article is a section element.
<section class="cb-nws-dtl-itms" itemprop="articleBody"></section>
Step 3 - Scraping the content
That's it, if you print the content you could see the scraped data.
Let's enhance this a little further. Rather than manually inputting the URL, let's scrape the post URL's from the home page and then scrape the corresponding page. Let's see how to do this, First we need find the section which contains the list of posts, then we need to extract the anchor tags, from that we can get href attribute values. Once we get the links we will do the scraping for each link.
I have attached the final code below. For the re usability purpose, I have divided the code into some methods. That's it.
Happy coding 😉.
Comments
Post a Comment