November 2, 2019

Language detection : Part one

Mastering language is arguably Homo sapiens’ greatest cognitive ability. It’s, in fact, the centerpiece of the Turing test. Work in natural language understanding has been going on for decades. While machines are yet to truly understand human language, current efforts like OpenAI’s GPT-2 and Google’s BERT demonstrate that machines are on the edge of this computational accomplishment. However, a lot of these efforts are biased against the majority of African languages.

In this series, we seek to demonstrate how a machine can identify a language when fed with a phrase. We are going to focus our efforts on Kenyan languages, namely Kikuyu, Luo, Kalenjin, Kiswahili. For the sake of validating our efforts, we will also include English as a control.

We will be using the latest version on Tensorflow as our framework of choice. We don’t assume any previous experience with TensorFlow or deep learning. The intent is to represent this process in an intuitive flow, hence it will be broken into two parts.

Part one

Part two

Now that we have set our objectives, let’s begin.

PART ONE

Fetching and preprocessing the data

Fetching and preprocessing of data is plausibly the most important step in any machine learning effort. African languages corpus is largely scarce, and if present, very insufficient. For this effort, we will use the bible corpus. This involves a separate web scrapping process that is out of the scope of this article.

We have done the work of scrapping the bible so you can access the data here

Let’s begin by importing libraries that we need.

import tensorflow as tf
import pandas as pd
import unicodedata
import re
import numpy as np
import os
import io
import string
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
print(tf.__version__)
print("GPU available {}".format(tf.test.is_gpu_available()))

If everything is good, a quick run of the above should show the version of TensorFlow we are using and whether GPU is enabled.

While GPU support is not required, it comes in handy to speed up training. It is one of those things that’s nice to have

2.0.0
GPU available True

We will define helper functions for the preprocessing step.

def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

def preprocess_sentence(w):
  w = unicode_to_ascii(w.lower().strip())
  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)
  w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
  pattern = r"[{}]".format(string.punctuation)
  w = re.sub(pattern, '', w).strip().lower()
  w = w.rstrip().strip()
  return w
def create_words_set(path):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
    words = [ w for l in lines for w in preprocess_sentence(l).split() ]
    return words

The most important bit to understand is this line

words = [ w for l in lines for w in preprocess_sentence(l).split() ]

Essentially, what we’ll be doing is creating a word-based dictionary. This strategy has a downstream effect on the performance of the model.

We will now fetch the data using TensorFlow’s utility function in the keras package.

path_to_kikuyu_file = tf.keras.utils.get_file('kikuyu.txt', origin='https://gitlab.com/daviddexter/ml-datasets/raw/master/asili/kikuyu.txt')
path_to_english_file = tf.keras.utils.get_file('english.txt', origin='https://gitlab.com/daviddexter/ml-datasets/raw/master/asili/english.txt')
path_to_kiswahili_file = tf.keras.utils.get_file('kiswahili.txt', origin='https://gitlab.com/daviddexter/ml-datasets/raw/master/asili/kiswahili.txt')
path_to_luo_file = tf.keras.utils.get_file('luo.txt', origin='https://gitlab.com/daviddexter/ml-datasets/raw/master/asili/luo.txt')
path_to_kalenjin_file = tf.keras.utils.get_file('kalenjin.txt', origin='https://gitlab.com/daviddexter/ml-datasets/raw/master/asili/kalenjin.txt')
kikuyu_words = create_words_set(path_to_kikuyu_file)
english_words = create_words_set(path_to_english_file)
kiswahili_words = create_words_set(path_to_kiswahili_file)
luo_words = create_words_set(path_to_luo_file)
kalenjin_words = create_words_set(path_to_kalenjin_file)

The get_file function will often cache the downloaded dataset. Let’s take a quick peek at the dataset by printing the first 10 words from each language.

print(kikuyu_words[:10])
print(english_words[:10])
print(kiswahili_words[:10])
print(luo_words[:10])
print(kalenjin_words[:10])
['kiambiriria', 'ini', 'kia', 'maundu', 'mothe', 'ngai', 'niombire', 'iguru', 'na', 'thi']
['when', 'god', 'began', 'to', 'create', 'the', 'heavens', 'and', 'the', 'earth']
['hapo', 'mwanzo', 'mungu', 'aliumba', 'mbingu', 'na', 'dunia', 'dunia', 'ilikuwa', 'bila']
['kar', 'chakruok', 'nyasaye', 'nochueyo', 'polo', 'gi', 'piny', 'piny', 'ne', 'onge']
['eng', 'taunet', 'ko', 'ki', 'toi', 'kamuktaindet', 'koyai', 'kipsengwet', 'ak', 'ng']

The data looks pretty clean at first glance. Notice punctuations have been removed leaving only words. One thing note is that African languages, especially of Bantu origin, have special annotations that depict accent. Since our goal is solely on language identification, the accent will not affect accuracy quality. In the event it does, we will figure out ways to address it.

Neural networks only digest numbers. Therefore we turn each word to an integer so that we may be able to feed it to our model. In this case, we will see keras Tokenizer class to help us do this. Let’s create a function that will house the tokenization logic

def tokenize(lang):
  tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
  tokenizer.fit_on_texts(lang)
  tensor = tokenizer.texts_to_sequences(lang)
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,padding='post')
  return tensor, tokenizer

In a nutshell, the function takes a list of words, converting them into a sequence. Finally, it pads the sequence at the end. This is so because words have different lengths.

What we are trying to do is a classic supervised classification problem. Hence our features need to have their corresponding labels. In this case, Kikuyu will have a label of 0, English a label of 1, Kiswahili a label of 2, Luo a label of 3 and Kalenjin a label of 4

A dict will be useful in this definition.

LANG_MAP = {"kikuyu":0,"english":1,"kiswahili":2,"luo":3,"kalenjin":4}
REV_LANG_MAP = { v:k for k,v in LANG_MAP.items()} # to used during inference

Now we will create a function that takes a list of words of a language and maps each word to their corresponding label.

def lang_dataset(lang_words,label):
    lang_array = np.array(lang_words)
    lang_labels = np.full_like(lang_array,label,dtype=np.int32)
    lang_dataset = np.append(lang_array.reshape(-1,1),lang_labels.reshape(-1,1),axis=1)
    return lang_dataset

Essentially, the function creates a NumPy array from the list of words and appends a label against each word. We then pass in the words and create a final data array which contains features on the first column and labels on the second column.

kik_dataset = lang_dataset(kikuyu_words,LANG_MAP["kikuyu"])
eng_dataset = lang_dataset(english_words,LANG_MAP["english"])
kis_dataset = lang_dataset(kiswahili_words,LANG_MAP["kiswahili"])
luo_dataset = lang_dataset(luo_words,LANG_MAP["luo"])
kalenjin_dataset = lang_dataset(kalenjin_words,LANG_MAP["kalenjin"])
data = np.vstack((kik_dataset,eng_dataset,kis_dataset,luo_dataset,kalenjin_dataset))

Lets have a look at how our data now looks like

print(data.shape)
dd = {'word':data[:,0],"lang":data[:,1]}
d = pd.DataFrame(data=dd)
d.head(10)

The output is

(2806831, 2)
	word 	lang
0 	kiambiriria 	0
1 	ini 	0
2 	kia 	0
3 	maundu 	0
4 	mothe 	0
5 	ngai 	0
6 	niombire 	0
7 	iguru 	0
8 	na 	0
9 	thi 	0

Looks good. Now we need to separate the features from the labels so that we pass the features through the tokenization function we defined above. Once the features have been tokenized, we split the features and labels into train set, validation set and test set. We use TensorFlow Dataset API to create the final datasets.

features,labels = data[:,0].reshape(-1,1) , data[:,1].reshape(-1,1).astype(np.int32)
features,tokenizer =  tokenize(features.reshape(1,-1).tolist()[0])
vocab_count = len(tokenizer.word_index) + 1
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.35, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.35, random_state=42)
train_dataset = tf.data.Dataset.from_tensor_slices((X_train,y_train)).shuffle(buffer_size=100).batch(1000)
val_dataset = tf.data.Dataset.from_tensor_slices((X_val,y_val)).shuffle(buffer_size=100).batch(500)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test,y_test)).batch(500)

As you may have noticed, a lot is going to warrant the above to be a function of its own.

def create_dataset():
  def lang_dataset(lang_words,label):
    lang_array = np.array(lang_words)
    lang_labels = np.full_like(lang_array,label,dtype=np.int32)
    lang_dataset = np.append(lang_array.reshape(-1,1),lang_labels.reshape(-1,1),axis=1)
    return lang_dataset
  kik_dataset = lang_dataset(kikuyu_words,LANG_MAP["kikuyu"])
  eng_dataset = lang_dataset(english_words,LANG_MAP["english"])
  kis_dataset = lang_dataset(kiswahili_words,LANG_MAP["kiswahili"])
  luo_dataset = lang_dataset(luo_words,LANG_MAP["luo"])
  kalenjin_dataset = lang_dataset(kalenjin_words,LANG_MAP["kalenjin"])

  data = np.vstack((kik_dataset,eng_dataset,kis_dataset,luo_dataset,kalenjin_dataset))
  features,labels = data[:,0].reshape(-1,1) , data[:,1].reshape(-1,1).astype(np.int32)
  # tokenize feature words dictionary
  features,tokenizer =  tokenize(features.reshape(1,-1).tolist()[0])
  vocab_count = len(tokenizer.word_index) + 1
  X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.35, random_state=42)
  X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.35, random_state=42)
  train_dataset = tf.data.Dataset.from_tensor_slices((X_train,y_train)).shuffle(buffer_size=100).batch(1000)
  val_dataset = tf.data.Dataset.from_tensor_slices((X_val,y_val)).shuffle(buffer_size=100).batch(500)
  test_dataset = tf.data.Dataset.from_tensor_slices((X_test,y_test)).batch(500)
  return train_dataset,val_dataset,test_dataset,vocab_count,data,tokenizer

With the logic setup, let’s now create our datasets.

train_dataset, val_dataset, test_dataset, vocab, data, app_tokenizer = create_dataset()

We return vocab which is the size of the vocabulary and app_tokenizer. Vocabulary size will be used later when we define the embedding layer while the tokenizer will be used when performing inference.

We expect our features and labels to be of the same shape. Let’s see if that’s correct

for s_f,s_l in train_dataset.take(1):
  print(s_f.numpy().shape)
  print(s_l.numpy().shape)

Indeed it is, as shown below. Good. We are on the right track.

(1000, 1)
(1000, 1)

Build,evaluate and tune the base neural network

With our data setup, can now begin defining our model. Machine learning, or rather deep learning for that matter, is more of an art than a science. Many experimentations have to been to achieve an acceptable accuracy. We will first define our base model which will be a standard feed-forward network.

def build_base_model():
  """
  Fully connected network
  """
  model_input = tf.keras.layers.Input(shape=(1,))
  x = tf.keras.layers.Embedding(vocab,16)(model_input)
  x = tf.keras.layers.GlobalAveragePooling1D()(x)
  x = tf.keras.layers.Dense(128,kernel_initializer="he_normal",use_bias=False,kernel_regularizer=tf.keras.regularizers.l2())(x)
  x = tf.keras.layers.Activation("relu")(x)
  x = tf.keras.layers.Dense(64,kernel_initializer="he_normal",use_bias=False,kernel_regularizer=tf.keras.regularizers.l2())(x)
  x = tf.keras.layers.Activation("relu")(x)
  pred = tf.keras.layers.Dense(5,activation="softmax")(x)
  model = tf.keras.Model(model_input,outputs=[pred])
  model.compile(optimizer=tf.keras.optimizers.Adam() ,loss='sparse_categorical_crossentropy',
                metrics=['sparse_categorical_accuracy'])
  return model

Here is a functional-based FCN with an embedding layer and two dense layers. We use embedding to convert the features into a vector of a fixed dimensional space. A similar vectorization technique is one-hot-encoding but the challenge with this technique is that the dimensional space is exponential to the feature space. To learn more about embedding, follow this link

We have chosen to have Adam as our optimizer and sparse_categorical_crossentropy as our loss. While the loss will not be changed since our labels are integers, we can swap in any optimizer. We will do just that as we evaluate our model.

Let’s build the model and see its summary information.

model = build_base_model()
model.summary()
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         [(None, 1)]               0
_________________________________________________________________
embedding (Embedding)        (None, 1, 16)             2338704
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0
_________________________________________________________________
dense (Dense)                (None, 128)               2048
_________________________________________________________________
activation (Activation)      (None, 128)               0
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8192
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 325
=================================================================
Total params: 2,349,269
Trainable params: 2,349,269
Non-trainable params: 0

Lets define afew callbacks

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_sparse_categorical_accuracy', patience=3)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_sparse_categorical_accuracy', factor=0.2,patience=1, min_lr=0.00001,verbose=1)

The first callback serves to stop training incase the accuracy does not grow after 3 epochs. The last one serves to reduce the learning rate. This is a very useful trick since the tweaking of the learning rate happens during the training. Finally, we’ll train the model

model.fit(train_dataset,validation_data=val_dataset, epochs=50,
                    callbacks=[reduce_lr,early_stopping],verbose=0)

The training takes a few minutes, stopping after 9 epochs.

results = model.evaluate(test_dataset)
print('test loss, test acc:', results)
688/688 [==============================] - 3s 5ms/step - loss: 0.2381 - sparse_categorical_accuracy: 0.8939
test loss, test acc: [0.23811545959392258, 0.89394975]

89% accuracy. Not a bad start. At this point, we can comfortably start tuning a few parameters to see if the accuracy increases. Let’s change the optimizer from Adam to SGD.

model.compile(optimizer=tf.keras.optimizers.SGD(nesterov=True,momentum=0.9),loss='sparse_categorical_crossentropy',
                metrics=['sparse_categorical_accuracy'])

With SGD, the loss and accuracy graphs are exponential lean and smooth. Even though the accuracy dropped to 88%, it seems that SGD maybe the best optimizer for this problem.

Test different architectures against the base neural network

We’ll define 3 more architectures and try them individually

def build_bilstm():
  """
  Bidirectional LSTM
  """
  model_input = tf.keras.layers.Input(shape=(1,))
  x = tf.keras.layers.Embedding(vocab,8)(model_input)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128,return_sequences=True))(x)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,return_sequences=True))(x)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(x)
  x = tf.keras.layers.Dropout(0.2)(x)
  pred = tf.keras.layers.Dense(5,activation="softmax")(x)
  model = tf.keras.Model(model_input,outputs=[pred])
  model.compile(optimizer=tf.keras.optimizers.SGD(nesterov=True,momentum=0.9),loss='sparse_categorical_crossentropy',
                metrics=['sparse_categorical_accuracy'])
  return model

def build_conv1d():
  """
  CONV1D network
  """
  model_input = tf.keras.layers.Input(shape=(1,))
  x = tf.keras.layers.Embedding(vocab,16)(model_input)
  x = tf.keras.layers.Conv1D(100,3,activation='rrelu',padding='same')(x)
  x = tf.keras.layers.Conv1D(100,3,activation='relu',padding='same')(x)
  x = tf.keras.layers.MaxPooling1D(3,padding='same')(x)
  x = tf.keras.layers.GlobalAveragePooling1D()(x)
  pred = tf.keras.layers.Dense(5,activation="softmax")(x)
  model = tf.keras.Model(model_input,outputs=[pred])
  model.compile(optimizer=tf.keras.optimizers.SGD(nesterov=True,momentum=0.9),
                loss='sparse_categorical_crossentropy',metrics=['sparse_categorical_accuracy'])
  return model
def build_ensemble():
  """
  FNN and Conv1D ensemble
  """
  model_input = tf.keras.layers.Input(shape=(1,))
  x = tf.keras.layers.Embedding(vocab,16)(model_input)
  x1 = tf.keras.layers.GlobalAveragePooling1D()(x)
  x1 = tf.keras.layers.Dense(128,kernel_initializer="he_normal",use_bias=False,kernel_regularizer=tf.keras.regularizers.l2())(x1)
  x1 = tf.keras.layers.Activation("relu")(x1)
  x1 = tf.keras.layers.Dense(64,kernel_initializer="he_normal",use_bias=False,kernel_regularizer=tf.keras.regularizers.l2())(x1)
  x2 = tf.keras.layers.Conv1D(100,3,activation='relu',padding='same')(x)
  x2 = tf.keras.layers.Conv1D(100,3,activation='relu',padding='same')(x2)
  x2 = tf.keras.layers.MaxPooling1D(3,padding='same')(x2)
  x2 = tf.keras.layers.GlobalAveragePooling1D()(x2)
  x = tf.keras.layers.concatenate([x1,x2])
  pred = tf.keras.layers.Dense(5,activation="softmax")(x)
  model = tf.keras.Model(model_input,outputs=[pred])
  model.compile(optimizer=tf.keras.optimizers.SGD(nesterov=True,momentum=0.9),
                loss='sparse_categorical_crossentropy',metrics=['sparse_categorical_accuracy'])
  return model

After running all the different architectures, the accuracy ranges from 88.20% to 89.50%, evidence that perhaps the data and not the architectures, may be at fault. However, before ruling out the architectures, let’s add masking to the embedding layer of the base model and evaluate it.

x = tf.keras.layers.Embedding(vocab,16, mask_zero=True)(model_input)

The accuracy doesn’t increase significantly even after adding a mask to the embedding. At this point, it’s clear that our data is the culprit. When we take a close look at our data, we realize our min is 1. These are just characters that add noise to the data. We will, therefore, remove all features whose length is less than two and evaluate its performance. In our create_words_set, let’s add the logic to do just that.

def create_words_set(path):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
    words = [ w for l in lines for w in preprocess_sentence(l).split() ]
    words = [ w for w in words if len(w) > 2 ] # ==> remove words shorter than two characters
    return words

Settle on a model that achieves optimal evaluation score

With short words removed, the accuracy increases to 92.35%. This is good progress. We can test the other architecture on the now refined dataset and see which model achieves the highest accuracy. After testing, all models evaluate to the same accuracy range. We can, therefore, say that all models have saturated their learning and very little can be done to improve them. We can, therefore, settle on the base model having SGD as the optimizer and either relu/elu as the activation function. In the second part, we will change gears and rethink our strategy.

Final thoughts

As you may have learned by now, the process of training a deep learning model is more of experimentation than logic flow. There are no silver bullets. Some tweaks may increase or decrease the evaluation metrics of your model. Thank you for taking the time to read this. Critique and comments are welcomed. See you in the second part.

Share

© Mwangi Kariuki 2019-2024