Top 10 most frequent 6-grams#

This notebook demonstrates how to use the SIA API to find the top 10 most frequent 6-grams from a set of Sharkspeare plays.

This experiment explores the question of how much authors repeat themselves. They use very common words like “the” over and over again, obviously, but what about combinations of words like phrases? If we focus on longer phrases, say sequences of six words, how often do these reappear within and between different Shakespeare plays (to take one famous author)? (We can also look between authors and ask how often different authors share the same longer sequences. Surprisingly, there is very little crossover. Authors repeat sequences across their own works, but the sequences are genuinely rare and don’t generally appear elsewhere – see The Arden Research Handbook of Shakespeare and Textual Studies (2021), ed. Erne, pp. 226-229).

Some of these are repeated because they are in the chorus of a song and the song itself appears in different plays. Others are repeated (one might suspect) because they come naturally to the author, perhaps unconsciously – as with “I could find it in my heart” which appears twice in Much Ado about Nothing, and once each in Henry IV Part 1, Comedy of Errors, and The Tempest.

Firstly, a number of dependency modules are imported here.

  • requests: is used for sending requests to the SIA API.

  • matplotlib.pyplot: we use the Matplotlib to create the chart based on the results from the SIA API.

  • IPython.display: is used for rendering the results from the SIA API as HTML tables.

import requests
import matplotlib.pyplot as plt
from IPython.display import display, HTML

Next, we set up the API endpoint here. In this example, we use the Word Frequencies API.

request_url = "https://sia.ardc-hdcl-sia-iaw.cloud.edu.au/api/v1/word-frequencies"

In order to use the API, an API key is requried to authenticate the requests. The API key must be specified in a custom HTTP header X-API-KEY and sent along with every request.

You should use your own API keys for your own notebooks and always keep your keys confidential. Read more about how to create API keys in SIA.

api_key = "255446bcdde7ca9fe776258d09e8411bbb8d1cade2ebd6aba440f80f6817c3fd"

Then, we start to prepare the request data which we can send to the Word Frequencies API. In this example, we are going to use a text set containing 20 Shakespear plays which has been already uploaded in the SIA platform. Instead of passing the actual text contents to the API, we can tell the API to use one of the texts or text sets from SIA by specifying its ID.

The URL of a text/text set page from SIA Application indicates the ID of that text/text set. For example:

https://sia.ardc-hdcl-sia-iaw.cloud.edu.au/text-sets/86

In this case, the ID of the “20 Shakespear plays” text set is 86.

We will also pass serveral word frequecies options to the Word Frequencies API. These options are:

  • blockMethod: We set the block method to 0(By text), which makes each text from the text set as a single segment.

  • numberOfNGrams: We want to find the frequencies of 6 adjacent words.

  • outputSize: We set it to 10 as we are only interested in the top 10 most frequent words.

  • excludeWords: We are excluding some common punctuation marks from our analysis.

To view more details about options of Word Frequencies API, read the API documentation.

@Hugh: more explainations about these options may be specified here

request_data = {
    'textSet': 86,
    'option':{
        'blockMethod': 0,       # Segment by text
        'numberOfNGrams' : 6,
        'outputSize': 10,
        'excludeWords': ["[","\\", "]", "_", "`", "!", "\"", "#", "%", "'", "(", ")", "+", ",", "-", "–", ".", "/", ":", ";", "{", "|", "}", "=", "~", "?" ],
    }
}

The SIA API accept JSON as the request data. Here we have constructed a Python dictionary object with the text set identifier and the word frequencies options. Next we are going to put all things together and use the Requests module to send the request to the SIA API.

# Make the API Request
response = requests.post(request_url, json=request_data, headers={"X-API-KEY": api_key}, timeout=1200)

We have specified the API endpoint, request data we defined earlier and the X-API-KEY HTTP header for the API request and received the response. Please note that the API call can take serveral minutes to finish based on the size of the text or text set. Therefore, we have set the request timeout to 1200 seconds.

Before we start unpacking the response data, we want to make sure the API call was successful by checking the HTTP response code. Read the API documentation for all error codes.

print(f"{response.status_code} {response.reason}")
assert response.status_code == 200
response_data = response.json()
200 OK

Now, we have the response data ready from Word Frequencies API. We firstly want to display the top most frequent 6-grams in a table that we can know what they are and the frequencies in each text from the text sets. We are going to firstly unpack the response data and make it into a tabular data format. Note that the words returned from the Word Frequencies API are sorted from the highest to lowest frequency by default, and the orders of words returned from blocks are consistent. Read more about the response data of Word Frequencies API.

table_headers = []
table_rows = []
for block in response_data['blocks']:
    # Add table header row.
    if len(table_headers) == 0:
        table_headers.append(['Word'])
    table_headers[0].append(block['name'])

    # Add data rows.
    for i in range(len(block['frequencies'])):
        frequency = block['frequencies'][i]
        # Check whether the row has been created. If not, initialise the row with the word text.
        if i > len(table_rows) - 1:
            table_rows.append([frequency['word']])
        # Append the word frequency to its corresponding row.
        table_rows[i].append(frequency['value'])

    # Add the "Word Types" row.
    i += 1
    if i > len(table_rows) - 1:
        table_rows.append(['Word Types'])
    table_rows[i].append(block['uniqueWordCount'])

    # Add the "Size" row.
    i += 1
    if i > len(table_rows) - 1:
        table_rows.append(['Size'])
    table_rows[i].append(block['size'])

We have created two 2-dimension list. The table_headers list contains a single table header row of the block names. The table_rows list contains rows of frequencies of the 6-grams. The next step is to generate the HTML markups based on the tabular data and render it.

# Start with the opening tags of container and table elements.
html = '<div style="overflow-x: auto; margin-top: 40px;"><table border="1">'

# Append table headers.
for table_row in table_headers:
    html += '<tr>'
    for table_cell in table_row:
        html += f'<th style="white-space: nowrap;">{table_cell}</td>'
    html += '</tr>'

# Append HTML for table rows.
for table_row in table_rows:
    html += '<tr>'
    for table_cell in table_row:
        html += f'<td>{table_cell}</td>'
    html += '</tr>'

# Close the table and container elements.
html += '</table></div>'

# Render the HTML.
display(HTML(html))
Word1 Henry IV Q1 (1)Antony and Cleopatra (1)Comedy of Errors (1)Coriolanus (1)Hamlet Q2 (1)Henry V F (1)Julius Caesar (1)King Lear Q1 (1)Love's Labor's Lost Q1 (1)Macbeth (1)Merchant of Venice Q1 (1)Midsummer Night's Dream Q1 (1)Much Ado About Nothing Q1 (1)Othello Q1 (1)Richard II Q1 (1)Richard III Q1 (1)Romeo and Juliet Q2 (1)Tempest (1)Twelfth Night (1)Winter's Tale (1)
hey.ho.the.wind.and.the00000001000000000050
ho.the.wind.and.the.rain00000001000000000050
for.the.rain.it.raineth.every00000001000000000040
i.could.find.in.my.heart10100000000020000100
the.rain.it.raineth.every.day00000001000000000040
with.hey.ho.the.wind.and00000001000000000040
chooseth.me.shall.get.as.much00000000004000000000
get.as.much.as.he.deserves00000000004000000000
give.me.a.cup.of.sack40000000000000000000
how.now.what.is.the.matter00000211000000000000
Word Types2065218922116062219324180222651573219937162241363817985139251642419539181052179920236133741524521399
Size2069218928116102220924192222991576219944163101365918033139391645319547181322186520276133811531921411

Now we have our table rendered. From the table, we can see that two 6-grams each appear 5 times in a single play; one 6-gram appears in 4 different plays. The 6-grams can be overlapping (“hey ho the wind and the”, “ho the wind and the rain”).

@Hugh: more insights about the table can be described here

Next, we are going to create a bar chart to show the overall top 10 most frequent 6-grams with their frequencies in the text set. Because word frequencies returned from the API are in the context of each block, we will need to sum up the word frequencies from all blocks first.

# Create a dictionary to hold word frequencies
word_frequency_map = {}
for block in response_data["blocks"]:
    for freq in block['frequencies']:
        word = freq['word']
        value = freq['value']
        word_frequency_map[word] = word_frequency_map.get(word, 0) + value

We have created a dictionary with the words as keys, and total word frequencies in the text set as values. Then we will use the Matplotlib library to visualise this. We set the 6-grams as the x-axis, and their word frequencies as the y-axis.

# Plotting
plt.figure(figsize=(12, 6))
plt.bar(list(word_frequency_map.keys()), list(word_frequency_map.values()), color='blue')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 10 6-grams in the 20 plays by frequency ')
plt.xticks(rotation=70)
plt.show()
../_images/1e1646c34888d4e5f4c5731aeaf5c81845fb8195a027696ed434a061c4684579.png

@Hugh: some insights about the chart