Dot (.) represents beginning and the end of a word. Number under the bigram AB is % probability that given letter A, next letter would be B.
Diagram should be read per row. Total probability in row is 100%. First row represents probability of each character starting a word. Second - probability of characters that appear after "а".
Firsts column - probability that when we are on a character of that row, word ends. Not probability of a word ending with that character. So total sum of probabilities in column is not 100%.
words = open('/usr/share/dict/ukrainian').read().splitlines() # needs package wukrainian to be installed
itos = ".абвгґдеєжзиіїйклмнопрстуфхцчшщьюя'-"
stoi = {s: i for i, s in enumerate(itos)}
nchars = len(itos)
import torch
import random
N = torch.zeros((len(stoi), len(stoi)), dtype=torch.int32)
for w in words:
chrs = ['.'] + list(w.lower()) + ['.']
for c1, c2 in zip(chrs, chrs[1:]):
i1 = stoi[c1]
i2 = stoi[c2]
N[i1, i2] += 1
P = N.float()
P = P / P.sum(1, keepdim=True)
import matplotlib.pyplot as plt
%matplotlib inline
# plt.imshow(N)
fig = plt.figure(figsize=(16, 16))
plt.imshow(P, cmap='Blues')
for i in range(nchars):
for j in range(nchars):
chstr = itos[i] + itos[j]
plt.text(j, i, chstr, ha="center", va="bottom", color='gray')
plt.text(j, i, '%.1f' % (P[i, j].item()*100.0), ha="center", va="top", color='gray')
plt.axis('off')
fig.savefig('uk_digrams.png', bbox_inches='tight')