In the first part of this series, we laid the foundation and developed a model that could identify a language with an error rate of 8%. In this second part, we will change our strategy and develop an even superior model than achieves 99% accuracy.
I encourage you to read the first part so that you may get the intuition of why we are changing strategy.
Let’s begin
Part two
Change data and architecture strategy
In the first part, we build our language dictionary by labeling individual words to their corresponding languages. The strategy had a few weakness;
- Words from different languages may have the same spelling, letter-by-letter, but have different meanings.
- African languages share historical origins, hence they have some resemblance in word structure, with distinct differences which are lost when analyzing using only words.
- With word-based, short words (less than 2 characters) were removed to improve accuracy. With such, we reduced the dataset size and perhaps lost meaningful info that only the neural network could see.
With the above weakness highlighted, our new strategy will use a sentence-base dictionary.
Let’s change the create_words_set
function to return full-length sentences rather than words.
def create_words_set(path):
lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
words = [ preprocess_sentence(l) for l in lines]
return words
With this new strategy, we may find ourselves with a relatively imbalanced dataset. A quick examination of the data shows this to be true.
print(len(kikuyu_words))
print(len(english_words))
print(len(kiswahili_words))
print(len(luo_words))
print(len(kalenjin_words))
29079
29079
31020
35132
31028
Clearly, Kikuyu
and English
are somewhat under-represented though not quite so severe. Perhaps the difference may not harm the performance but that has to be proved with experimentation. For now, we will limit ourselves to a balanced dataset hence all will be trimmed down to the length of Kikuyu and English.
Build new network architecture and evaluate
The rest of the functions remain the same except for the Input layers
of the model. Particularly the shape. Let’s take a look.
for s_f,s_l in train_dataset.take(1):
print(s_f.numpy().shape)
print(s_l.numpy().shape)
(1000, 237)
(1000, 1)
As you can see there is a mismatch in the shape of the dataset and what the model expects. The model won’t begin training if data shapes do not match. That’s easily solved by changing the shape in the input layer
definition.
model_input = tf.keras.layers.Input(shape=(237,))
Here are how our final models’ definitions looks.
def build_fnn():
"""
Fully connected network
"""
model_input = tf.keras.layers.Input(shape=(237,))
x = tf.keras.layers.Embedding(vocab,16,mask_zero=True)(model_input)
x = tf.keras.layers.GlobalAveragePooling1D()(x)
x = tf.keras.layers.Dense(128,kernel_initializer="he_normal",use_bias=False,kernel_regularizer=tf.keras.regularizers.l2())(x)
x = tf.keras.layers.Activation("elu")(x)
x = tf.keras.layers.Dense(64,kernel_initializer="he_normal",use_bias=False,kernel_regularizer=tf.keras.regularizers.l2())(x)
x = tf.keras.layers.Activation("elu")(x)
pred = tf.keras.layers.Dense(5,activation="softmax")(x)
model = tf.keras.Model(model_input,outputs=[pred])
model.compile(optimizer=tf.keras.optimizers.SGD(nesterov=True,momentum=0.99) ,loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
return model
def build_bilstm():
"""
Bidirectional LSTM
"""
model_input = tf.keras.layers.Input(shape=(237,))
x = tf.keras.layers.Embedding(vocab,16)(model_input)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128,return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(x)
x = tf.keras.layers.Dropout(0.2)(x)
pred = tf.keras.layers.Dense(5,activation="softmax")(x)
model = tf.keras.Model(model_input,outputs=[pred])
model.compile(optimizer=tf.keras.optimizers.Adam(),loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
return model
Our base model, in this case, will be built by the function build_bilstm
. It returns a model which has powerful capabilities to process sequences.
To add to the flavor, the processing is bidirectional, hence the layer learns forward and backward relationships. These kind of layers are
the building blocks of state of the art machine translation systems currently used around the world.
We’ll now build and look at how the model looks like.
model = build_bilstm()
model.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 237)] 0
_________________________________________________________________
embedding (Embedding) (None, 237, 16) 2240016
_________________________________________________________________
bidirectional (Bidirectional (None, 237, 256) 148480
_________________________________________________________________
bidirectional_1 (Bidirection (None, 237, 128) 164352
_________________________________________________________________
bidirectional_2 (Bidirection (None, 64) 41216
_________________________________________________________________
dropout (Dropout) (None, 64) 0
_________________________________________________________________
dense (Dense) (None, 5) 325
=================================================================
Total params: 2,594,389
Trainable params: 2,594,389
Non-trainable params: 0
The first noticeable thing is that the number of trainable params
has significantly increased. This can be attributed to the Bidirectional LSTM layer.
Often, you’ll want to have a few trainable params
to not overfit. In this case, however, the number looks reasonable so it should not be of much
worry.
Lets now train our new model
model.fit(train_dataset,validation_data=val_dataset, epochs=50,
callbacks=[PrintDot(),reduce_lr,early_stopping],verbose=0)
The training does not last long, stopping after 9 epochs. Regardless, the model has managed to achieve superior accuracy of 99%.
results = model.evaluate(test_dataset)
print('test loss, test acc:', results)
36/36 [==============================] - 6s 176ms/step - loss: 0.0169 - sparse_categorical_accuracy: 0.9961
test loss, test acc: [0.016938598677774683, 0.9960701]
Test the model with random phrases
Lets put the model to a test with random phrases sourced from the web.
# helper function
def predict_sample(sample):
sample = app_tokenizer.texts_to_sequences(sample)
sample = tf.keras.preprocessing.sequence.pad_sequences(sample,padding='post',maxlen=237)
pred = model.predict(sample,verbose=0)
lang = REV_LANG_MAP[np.argmax(pred[0])]
print(lang)
Test one - Expects english
predict_sample(['yesterday l was at the people park, Uhuru Park. Its a space where most Nairobians who live in the slums and concrete jungle get a chance to relax and step on real grass.']) # english
english
Test two - Expects kikuyu
predict_sample(['Iguru na thi nicikuui Ngai waragia ma cikainanaina']) #kikuyu
kikuyu
Test three - Expects luo
predict_sample(['Tem ane nyisa ni anyalo yudo thumbegi koso anyalo horoni pesa gi yoo mane jatelo.']) # luo
luo
Test four - Expects kiswahili
predict_sample(['mabaya na aendelee kutenda mabaya, na aliye mchafu aendelee kuwa mchafu. mwenye kutenda mema na azidi kutenda mema, na aliye mtakatifu na azidi kuwa mtakatifu.']) # kiswahili
kiswahili
Test five - Expects kalenjin
predict_sample(['ki ikwa irimennyi, ak ki ek kakwautigap euut pogol ak artam ak ang kakwaetap chiito, nooto ko ne po malaikaiyat']) # kalenjin
kalenjin
As you can see, the model correctly identifies a phrase as either English, Kikuyu, Luo, Kiswahili or Kalenjin. For fun’s sake, let’s train the FCN model and see how it fairs
model = build_model()
history = model.fit(train_dataset,validation_data=val_dataset, epochs=50,
callbacks=[PrintDot(),reduce_lr,early_stopping],verbose=0)
The model barely converges. It stops quickly after 4 epochs, reaching the lowest learning rate with an accuracy of 19%
..
Epoch 00002: ReduceLROnPlateau reducing learning rate to 0.0019999999552965165.
.
Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.0003999999724328518.
.
Epoch 00004: ReduceLROnPlateau reducing learning rate to 7.999999215826393e-05.
results = model.evaluate(test_dataset)
print('test loss, test acc:', results)
36/36 [==============================] - 0s 9ms/step - loss: nan - sparse_categorical_accuracy: 0.1962
test loss, test acc: [nan, 0.1961599]
Bidirectional LSTM is clearly the best model for this task.
Final thoughts
Language identification is an important step in a much larger system of NLP. As stated at the very beginning of this series, African languages are very much under-represented. More data and research is needed if we are to build systems that understand our own native languages. I look forward to the day when an 80-year-old grandmother in Turkana can operate a computer or a smartphone just by using her native language.
Thank you for taking the time to read this piece.
Future effort
Data is perhaps the most limiting fact in NLP for African languages. But a critical look reveals that there is an abundance of data only not in text form rather in audio form. In Kenyan for instance, most vernacular languages have radio stations that broadcast in those languages. That itself is a rich source of data. How viable is it? In a future series, we will demonstrate how what we have accomplished today, can be achieved with audio rather than text. Stay tuned.