In this series of articles, I’ll introduce you to the process of natural language processing using deep learning. I am also a beginner in Python. This article is aimed at people who have worked with a few other languages, but know little or nothing about Python.
You might think deep learning is difficult, but it’s easy enough to write and run code in it.The goal here is not to understand the intricacies of deep learning, but rather to use it in a practical way to get results.
As an example, we are going to create an AI that generates English sentences. For the training, we will use articles from the American VOA Learning English site.
You need to check the copyright laws of your country before you copy the article to your computer’s hard drive.
Scraping is the process of extracting the necessary data from a website. The scraping process varies from website to website. Here, I’m going to extract the articles from VOA Learning English.
From the top page, go to the “As it is” category page.
We can find several articles. Let’s try to get the URLs of each article from here.
import numpy as np import requests from bs4 import BeautifulSoup import io import re
import of an extension.
numpy is for efficient computation; it can be written
as np, which appears in the following lines as
requests is used to extract the html of a website. It is used in the following lines where it is shown as
BeutifulSoup is used to parse the html and extract only the required parts.
These three extensions have to be installed beforehand (installation instructions omitted).
io is used for file input and output. re is used for regular expressions.
url = 'https://learningenglish.voanews.com/z/3521' res = requests.get(url)
requests is used as an extension.
url contains the address where the html is to be retrieved and it is retrieved by
res is not a string but an object. We can imagine an object as a box containing a variety of information. By writing the above, some information, including the html string, will be thrown into
soup = BeautifulSoup(res.text, 'html.parser')
BeautifulSoup extension is used.
res.text points to the label
res object. We write html on a piece of paper and put a sticky note with
text on it.
res.text contains the following html string.
We give this to
BeutifulSoup and store the information in an object called
soup. We then transfer the information to a special box for parsing html.
soup.find_all() extracts the part of the article links from the object
In html, each link to an article is written as
<a href='article URL'> </a>.
When you write
<a href='URL'> </a> part is extracted from the html. Furthermore, only URLs containing the character
/a/ are extracted and stored in a list
elems. This is because the URLs of each article on the VOA Learning English site have the form
elems contains tags, URLs are further extracted from the list.
links =  for i in range(len(elems)): links.append('https://learningenglish.voanews.com'+elems[i].attrs['href'])
links =  indicates that links is a list.
for statement repeats the process.
range()specifies a range of iterations, e.g.
for i in range(10): where
i varies from
len(elems) repeats the number of elements in
elems. For example, if there are 5 extracted addresses,
elems ranges from
elems. In this case,
len(elems) = 5.
elems[i].attrs['href'] extracts only the content of
elems, which is the string containing the tag.
elems = ‘<a href=”/a/new-us-citizens-look-forward-to-voting/5538093.html”>’
elems.attrs[‘href’] = ‘/a/new-us-citizens-look-forward-to-voting/5538093.html’
URLs do not include the domain name, so
'https://learningenglish.voanews.com' should be added.
links.append() adds elements to the list (array) of links. Here,
links stores the extracted URLs one by one.
Run the code to check the contents of the links.
https://learningenglish.voanews.com/a/5526946.html https://learningenglish.voanews.com/a/5526946.html https://learningenglish.voanews.com/a/new-us-citizens-look-forward-to-voting/5538093.html https://learningenglish.voanews.com/a/new-us-citizens-look-forward-to-voting/5538093.html https://learningenglish.voanews.com/a/coronavirus-stops-starts-testing-europeans-patience/5543763.html https://learningenglish.voanews.com/a/coronavirus-stops-starts-testing-europeans-patience/5543763.html ....
Duplicate URLs seem to have been extracted.
links = np.unique(links)
numpy numerical library we mentioned earlier here. When we import it, we declare that
numpy is represented by the string
np, so we write it as
unique() is a feature to remove duplicate data. Thus, you can use
numpy to organize your data easily. Here, the duplicated data is deleted and stored in links again.
The strings stored in
links are joined with
.join(). If you write
'\n'.join(), you can insert a line feed code between the linked strings. In this way, we get a string with a URL in each line.
with io.open('article-url.txt', 'w', encoding='utf-8') as f: f.write(text)
Finally, the program writes the resulting string to the file
When you open the newly created text file, you will see
https://learningenglish.voanews.com/a/5543923.html https://learningenglish.voanews.com/a/after-multiple-crises-this-time-lebanese-feel-broken-/5542477.html https://learningenglish.voanews.com/a/coronavirus-stops-starts-testing-europeans-patience/5543763.html ....
We can confirm that the URLs of the articles have been extracted.
Here is the whole code.
import numpy as np import requests from bs4 import BeautifulSoup import io import re url = 'https://learningenglish.voanews.com/z/3521' res = requests.get(url) soup = BeautifulSoup(res.text, 'html.parser') elems = soup.find_all(href=re.compile("/a/")) links =  for i in range(len(elems)): links.append('https://learningenglish.voanews.com'+elems[i].attrs['href']) links = np.unique(links) text='\n'.join(links) with io.open('article-url.txt', 'w', encoding='utf-8') as f: f.write(text)