How to add parts of speech(POS) information to your full-text search site with stanza

公開日:2020/10/26

最終更新日:2023/01/06

If you have some text and you want to search for it using linguistic methods, you may need to create a stand-alone search software.

Another way is to create a search site that works in a browser using a language like php.

The advantage of building software in the browser is that it allows many people to share information with less effort.

This article will show you specifically how to search for sentences using part-of-speech information.

The languages we will be using are php and python. We also use stanza. Please install stanza beforehand.

Extracting Part-of-Speech(POS) Information with python and stanza

We will have five sentences to search for.

Although I spent most days alone, occasionally I met some interesting people during this trip.
There was Sam, the postal worker, who was walking the long hike in shorter sections.
He could only hike during his vacation when he often hiked 100 to 200 kilometers.
He had five more years to go to complete the entire trail/
And there was a happy couple, Suzanne and David, who were hiking as part of their honeymoon.

Save the text as text.txt.

Extract words and parts of speech from a text with stanza.

stanza.download('en')
nlp = stanza.Pipeline(lang='en')

This is the python code. stanza.download('en') downloads the data that stanza needs to analyze the English language. The first time you run this code, it downloads a large amount of data. However, the data is stored on your PC, so the download is done only once. Pipeline(lang='en') is, simply put, a box named nlp that contains a robot to analyze the language.

with io.open('text.txt', encoding='utf-8') as f:
    texts = f.read().splitlines()

Reads the text and divides it by lines.

for line in range(len(texts)):

Iterate over the five sentences.

    doc = nlp(texts[line])

The annotation for each sentence is stored in doc.

    for word in doc.sentences[0].words:
            char.append(word.text)
            xpos.append(word.xpos)
    char_all.append(char)
    xpos_all.append(xpos)

word.txt represents each word in a sentence and word.xpos represents a part of speech(POS).

char_all =
[['Although', 'I', 'spent', 'most', 'days', 'alone', ',', 'occasionally', 'I', 'met', 'some', 'interesting', 'people', 'during', 'this', 'trip', '.'],
 ['There', 'was', 'Sam', ',', 'the', 'postal', 'worker', ',', 'who', 'was', 'walking', 'the', 'long', 'hike', 'in', 'shorter', 'sections', '.'],
 ['He', 'could', 'only', 'hike', 'during', 'his', 'vacation', 'when', 'he', 'often', 'hiked', '100', 'to', '200', 'kilometers', '.'],
 ['He', 'had', 'five', 'more', 'years', 'to', 'go', 'to', 'complete', 'the', 'entire', 'trail', '/'],
 ['And', 'there', 'was', 'a', 'happy', 'couple', ',', 'Suzanne', 'and', 'David', ',', 'who', 'were', 'hiking', 'as', 'part', 'of', 'their', 'honeymoon', '.']]

xpos_all =

[['IN', 'PRP', 'VBD', 'JJS', 'NNS', 'RB', ',', 'RB', 'PRP', 'VBD', 'DT', 'JJ', 'NNS', 'IN', 'DT', 'NN', '.'],
 ['EX', 'VBD', 'NNP', ',', 'DT', 'JJ', 'NN', ',', 'WP', 'VBD', 'VBG', 'DT', 'JJ', 'NN', 'IN', 'JJR', 'NNS', '.'],
 ['PRP', 'MD', 'RB', 'VB', 'IN', 'PRP$', 'NN', 'WRB', 'PRP', 'RB', 'VBD', 'CD', 'IN', 'CD', 'NNS', '.'],
 ['PRP', 'VBD', 'CD', 'JJR', 'NNS', 'TO', 'VB', 'TO', 'VB', 'DT', 'JJ', 'NN', ','],
 ['CC', 'EX', 'VBD', 'DT', 'JJ', 'NN', ',', 'NNP', 'CC', 'NNP', ',', 'WP', 'VBD', 'VBG', 'IN', 'NN', 'IN', 'PRP$', 'NN', '.']]

Here is the entire code.

import numpy as np
import sys
import io
import os
import stanza
import json
#load stanza
stanza.download('en')
nlp = stanza.Pipeline(lang='en')
#read the text
texts = []
with io.open('ssearch/text.txt', encoding='utf-8') as f:
    texts = f.read().splitlines()
char_all = []
xpos_all = []
for line in range(5):
    char = []
    lemma = []
    xpos = []
    doc = nlp(texts[line])
    for word in doc.sentences[0].words:
            char.append(word.text)
            xpos.append(word.xpos)
    char_all.append(char)
    xpos_all.append(xpos)
    print(line+1,'/',len(texts),end='\r')
with open('sentences.json','w') as f:
    json.dump(texts,f)
with open('char.json','w') as f:
    json.dump(char_all,f)
with open('xpos.json','w') as f:
    json.dump(xpos_all,f)

Overview of Search Algorithms

Here’s an overview of the search algorithm. A simple algorithm is proposed here for the learner.

If you type “were [VBG]”, the search engine search for “were” at first.

$chars[4] = ['And', 'there', 'was', 'a', 'happy', 'couple', ',', 'Suzanne', 'and', 'David', ',', 'who', 'were', 'hiking', 'as', 'part', 'of', 'their', 'honeymoon', '.']

“were” is found in $chars[4][12]. When the word is found, the search for the remaining words in the sentence is skipped.

Also, a score of 100 points is added to this sentence.

Next, “[VBG]” is searched. “[VBG]” represents the present participle.

The search engine starts with the word or part of speech after “were” instead of at the beginning of the sentence.

$xposes[4] = ['CC', 'EX', 'VBD', 'DT', 'JJ', 'NN', ',', 'NNP', 'CC', 'NNP', ',', 'WP', 'VBD', 'VBG', 'IN', 'NN', 'IN', 'PRP$', 'NN', '.']]

“VBG” is found in $xposes[4][13]. A score of 100 points is also added to this sentence.

This sentence scored a total of 200 points.

If two words are found apart, the score is reduced according to the distance between them.

Finally, each sentence is sorted in order of its score and the search results are displayed.

Full-text search by part-of-speech information with php

Create a search site using the information on sentences, words and parts of speech(POS) saved in json format.

$json = file_get_contents('./sentences.json');
$json = mb_convert_encoding($json, 'UTF8');
$text = json_decode($json,true);

Reads five sentences and stores them in $text.

$json = file_get_contents('./char.json');
$json = mb_convert_encoding($json, 'UTF8');
$chars = json_decode($json,true);

Reads the word and stores it in $chars. It is a two-dimensional array.

$json = file_get_contents('./xpos.json');
$json = mb_convert_encoding($json, 'UTF8');
$xposes = json_decode($json,true);

Reads the part of speech and stores it in $xposes. It is a two-dimensional array.

$query = htmlspecialchars(@$_GET['search'], ENT_QUOTES, 'UTF-8');

The words entered into the form are stored in $query.

$q_chars = explode(" ",$query);
$q_len = count($q_chars);

explode() splits a string. If you type “were [VBG]”, $q_chars = ["were","[VBG]"].

count() counts the number of arrays. If there are two words in the input, $q_len = 2.

  for ($id = 0; $id < count($text); $id++) {

There are five sentences, so we iterate five times.

    $chars_len = count($chars[$id]);
    $seq = -1;
    $score = 0;

$chars_len stores the number of words in each sentence. $seq stores the positions that match the input words. Its initial value is set to -1. The $score stores the score for each sentence.

    for ($i = 0; $i < $q_len; $i++) {

The number of words you type is iterated.

      $q_char = $q_chars[$i];

$q_char variable stores words to be searched for.

      for ($j = 0; $j < $chars_len; $j++) {

The words in the sentence will be searched from the beginning.

        if ($seq == -1 || $j > $seq) {
          if(preg_match('/\[.*\]/u',$q_char)) {
            $char = $xposes[$id][$j];
            if(strcmp($q_char, '['.$char.']') == 0) {
              $score += 100;
            break;
            }

As mentioned in the overview, assume that “were [VBG]” is searched. When “were” is searched for, $seq=-1. Also, when [VBG] is searched, words between $xposes[$id][0] and $xposes[id][12] are skipped.

preg_match() returns a word match with a regular expression. If a word is surrounded by [ ], the part of speech is searched.

strcmp() returns 0 if two words are matched. Then,100 points are added to the score.

When a match is found, the rest of the search will be skipped with break.

          } else {
            $char = $chars[$id][$j];
            if (strcmp($q_char, $char) == 0 && $i == 0) {
              $score += 100;
              $seq = $j;
            break;
            }
            if (strcmp($q_char, $char) == 0 && $i != 0) {
              $score += 100 - ($j - $seq)*5;
              $seq = $j;
            break;
            }
          }

If the search target is not a part of speech, the string $chars is searched.

When “were” is searched for, $i=0. Then, 100 points are added to the score and the position of the word is stored in $seq.

When the second word is found, the score is reduced according to the distance to the first word.

The amount of scores to be added or subtracted needs to be adjusted to achieve the expected results.

    $scores[$id] = $score;

$score is stored in the array $scores.

  arsort($scores);

arsort() sorts the array in descending order.

  foreach($scores as $key => $value) {
    echo "<p>";
    echo $scores[$key].': ';
    echo $text[$key];
    echo "</p>";
  }

foreach() processes the elements of the associative array one by one.

Scores and sentences are displayed in order from highest to lowest.

Here is the entire code.

<html lang="en">
<head>
  <title>POS Search</title>
</head>
<body>
<?php
$json = file_get_contents('./sentences.json');
$json = mb_convert_encoding($json, 'UTF8');
$text = json_decode($json,true);
$json = file_get_contents('./char.json');
$json = mb_convert_encoding($json, 'UTF8');
$chars = json_decode($json,true);
$json = file_get_contents('./xpos.json');
$json = mb_convert_encoding($json, 'UTF8');
$xposes = json_decode($json,true);
$query = htmlspecialchars(@$_GET['search'], ENT_QUOTES, 'UTF-8');
$q_chars = explode(" ",$query);
$q_len = count($q_chars);
function search_text() {
  global $text, $chars, $xposes;
  global $query, $q_chars, $q_len;
  for ($id = 0; $id < count($text); $id++) {
    $chars_len = count($chars[$id]);
    $seq = -1;
    $score = 0;
    for ($i = 0; $i < $q_len; $i++) {
      $q_char = $q_chars[$i];
      for ($j = 0; $j < $chars_len; $j++) {
        if ($seq == -1 || $j > $seq) {
          if(preg_match('/\[.*\]/u',$q_char)) {
            $char = $xposes[$id][$j];
            if(strcmp($q_char, '['.$char.']') == 0) {
              $score += 100;
            break;
            }
          } else {
            $char = $chars[$id][$j];
            if (strcmp($q_char, $char) == 0 && $i == 0) {
              $score += 100 - $j;
              $seq = $j;
            break 2;
            }
            if (strcmp($q_char, $char) == 0 && $i != 0) {
              $score += 100 - ($j - $seq)*5;
              $seq = $j;
            break 2;
            }
          }
        }
      }
    }
    $scores[$id] = $score;
  }
  arsort($scores);
  foreach($scores as $key => $value) {
    echo "<p>";
    echo $scores[$key].': ';
    echo $text[$key];
    echo "</p>";
  }
}
?>
<div>
  <form method="get">
    <div>
      <div>
        <?php
        echo '<input type="text" id="search_form" name="search" size="33" maxlength="50"';
        echo 'value="'.htmlspecialchars(@$_GET['search'], ENT_QUOTES, 'UTF-8').'" autofocus>';
        ?>
      </div>
      <div>
        <input type="submit" value="search">
      </div>
    </div>
  </form>
  <div id="result">
    <?php
      if (@$_GET['search'] != null) {
        search_text();
      }
    ?>
  </div>
</div>
</body>
</html>

How to add parts of speech(POS) information to your full-text search site with stanza

Extracting Part-of-Speech(POS) Information with python and stanza

Overview of Search Algorithms

Full-text search by part-of-speech information with php

関連