NLP for learners-Scraping news sites and saving it to the hard drive with BeautifulSoup

This time, we will extract the text of the article from the news site.

Get the URLs in a chain.

The text of the article is extracted from the BBC news site.

First, extract the html of the top page, and then extract the URLs including the string /news/.

Then, the new URLs are extracted from the html of the extracted URLs. In this way, the URLs of the articles are obtained in a chain.

pre_extracted_urls = []
pre_extracted_urls.append('https://www.bbc.com/news')

Store the top page URL in the list pre_extracted_urls.

for depth in range(2):

The URLs of the articles are extracted in a chain using an iterative process. The depth varies from 0 to 1. If the depth range is too large, the process can take a very long time to complete.

    extracted_urls = []
    for i in range(len(pre_extracted_urls)):

The list extracted_urls contains the extracted URLs.

At first, pre_extracted_urls contains just one URL of the top page. Therefore, the process is not repeated.

However, in the next step, multiple URLs are extracted from the top page, so the process is repeated according to the number of URLs.

        try:
            res = requests.get(pre_extracted_urls[i], timeout=3.0)
        except Timeout:
            print('Connection timeout')
            continue

try: executes the process once. Then, when an exception occurs, except: is executed.

requirements.get() extracts html from the specified URL.

If there is no response from the server, the process may stop without proceeding to the next step, so an exception should be written.

except Timeout: is executed when there is no response from the server. It prints a message and returns to the beginning of the iteration and executes the next extraction.

        soup = BeautifulSoup(res.text, 'html.parser')
        elems = soup.find_all(href=re.compile("/news/"))
        print(str(len(elems))+' URLs extracted'+'('+str(i+1)+'/'+str(len(pre_extracted_urls))+')')

The extracted html is stored in res.text and it is passed to BeutifulSoup. .find_all() extracts the URLs containing /news/ and stores them in a list elem.

        for j in range(len(elems)):
            url = elems[j].attrs['href']
            if not 'http' in url:
                extracted_urls.append('https://www.bbc.com'+url)
            else:
                extracted_urls.append(url)

Since the string stored in the list elem contains tags, the URL is extracted using .attrs['href'] and stored in a string url.

If the URL does not contain the domain name, the domain name is added to the list extracted_urls.

    extracted_urls = np.unique(extracted_urls).tolist()

The extracted string contains duplicate URLs. np.unique() removes the duplicate data. The operation creates a numpy array, so we use .tolist() to convert it back to a normal array and store it in extracted_urls.

    pre_extracted_urls = extracted_urls
    urls.extend(extracted_urls)
    urls = np.unique(urls).tolist()

The extracted URLs are stored again in pre_extracted_urls.

The next iteration of the process extracts new URLs based on the stored URLs.

At the same time, it adds the extracted URLs to the list urls.

To add a single element to the list, we use .append(), but to add multiple elements, like a list, we use .extend().

Furthermore, we use np.unique() to remove duplicate URLs.

Extracting the text from a list of URLs

for i in range(len(urls)):
    try:
        res = requests.get(urls[i], timeout=3.0)
    except Timeout:
        print('Connection timeout')
        continue
    soup = BeautifulSoup(res.text, "html.parser")

Iterate according to the number of elements in the list urls.

As mentioned above, request.get() extracts html from the article URL and passes it to BeutifulSoup().

    elems = soup.select('#page > div > div.container > div > div.column--primary > div.story-body > div.story-body__inner > p')

You can get the selector by using Chrome. Right click on the text displayed in the browser and click Validate. You will see html on the right side of the screen and right click on the blue line. Then, click Copy –> Copy selector to copy the selector to the clipboard. The copied string contains :nth-child(), but you don’t need it.

soup.select() extracts the string in the specified selector and stores it in the list elems.

    if not len(elems) == 0:
        for j in range(len(elems)):
            texts.append(str(elems[j]))

If html does not contain the body text, then the number of elements in the list elems is 0. If the number of elements is non-zero, the string is added to the list texts, repeating the process according to the number of elements in elems.

text = ' '.join(texts)
p = re.compile(r"<[^>]*?>")
text = p.sub("", text)
text = re.sub('["“”,—]','', text)
text = text.lower()
text = text.replace('. ','\n')

.join() combines the list texts into a single string text.

re.compile() is used to give a regular expression rule, and .sub() removes the string that matches the rule. Here we remove the tags from the string. Also removes double quotation marks and hyphens in the same way.

.lower() converts the string to lowercase and .replace() converts a period to a newline character.

with io.open('articles_bbc.txt', 'w', encoding='utf-8') as f:
    f.write(text)

Save the text to the text file articles_bbc.txt.

When we run the program, it extracts 2.59MB of the article text.

Here is the whole code.

import numpy as np
import requests
from requests.exceptions import Timeout
from bs4 import BeautifulSoup
import io
import re
urls = []
pre_extracted_urls = []
pre_extracted_urls.append('https://www.bbc.com/news')
for depth in range(2):
    print('start depth: '+str(depth)+'.........................')
    extracted_urls = []
    for i in range(len(pre_extracted_urls)):
        try:
            res = requests.get(pre_extracted_urls[i], timeout=3.0)
        except Timeout:
            print('Connection timeout')
            continue
        soup = BeautifulSoup(res.text, 'html.parser')
        elems = soup.find_all(href=re.compile("/news/"))
        print(str(len(elems))+' URLs extracted'+'('+str(i+1)+'/'+str(len(pre_extracted_urls))+')')
        for j in range(len(elems)):
            url = elems[j].attrs['href']
            if not 'http' in url:
                extracted_urls.append('https://www.bbc.com'+url)
            else:
                extracted_urls.append(url)
    extracted_urls = np.unique(extracted_urls).tolist()
    pre_extracted_urls = extracted_urls
    urls.extend(extracted_urls)
    urls = np.unique(urls).tolist()
    print('total: '+str(len(urls))+' URLs')
print('start extracting html....')
texts = []
for i in range(len(urls)):
    try:
        res = requests.get(urls[i], timeout=3.0)
    except Timeout:
        print('Connection timeout')
        continue
    soup = BeautifulSoup(res.text, "html.parser")
    elems = soup.select('#page > div > div.container > div > div.column--primary > div.story-body > div.story-body__inner > p')
    if not len(elems) == 0:
        for j in range(len(elems)):
            texts.append(str(elems[j]))
    print(str(i+1)+' / '+str(len(urls))+' finished')
text = ' '.join(texts)
p = re.compile(r"<[^>]*?>")
text = p.sub("", text)
text = re.sub('["“”,—]','', text)
text = text.lower()
text = text.replace('. ','\n')
with io.open('articles_bbc2.txt', 'w', encoding='utf-8') as f:
    f.write(text)
with io.open('articles_bbc.txt', 'w', encoding='utf-8') as f:
    f.write(text)