November 2, 2019

Language detection : Part two

In the first part of this series, we laid the foundation and developed a model that could identify a language with an error rate of 8%. In this second part, we will change our strategy and develop an even superior model than achieves 99% accuracy.

I encourage you to read the first part so that you may get the intuition of why we are changing strategy.

Let’s begin

Part two

Change data and architecture strategy

In the first part, we build our language dictionary by labeling individual words to their corresponding languages. The strategy had a few weakness;

With the above weakness highlighted, our new strategy will use a sentence-base dictionary.

Let’s change the create_words_set function to return full-length sentences rather than words.

def create_words_set(path):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
    words = [ preprocess_sentence(l) for l in lines]
    return words

With this new strategy, we may find ourselves with a relatively imbalanced dataset. A quick examination of the data shows this to be true.

print(len(kikuyu_words))
print(len(english_words))
print(len(kiswahili_words))
print(len(luo_words))
print(len(kalenjin_words))
29079
29079
31020
35132
31028

Clearly, Kikuyu and English are somewhat under-represented though not quite so severe. Perhaps the difference may not harm the performance but that has to be proved with experimentation. For now, we will limit ourselves to a balanced dataset hence all will be trimmed down to the length of Kikuyu and English.

Build new network architecture and evaluate

The rest of the functions remain the same except for the Input layers of the model. Particularly the shape. Let’s take a look.

for s_f,s_l in train_dataset.take(1):
  print(s_f.numpy().shape)
  print(s_l.numpy().shape)
(1000, 237)
(1000, 1)

As you can see there is a mismatch in the shape of the dataset and what the model expects. The model won’t begin training if data shapes do not match. That’s easily solved by changing the shape in the input layer definition.

model_input = tf.keras.layers.Input(shape=(237,))

Here are how our final models’ definitions looks.

def build_fnn():
  """
  Fully connected network
  """
  model_input = tf.keras.layers.Input(shape=(237,))
  x = tf.keras.layers.Embedding(vocab,16,mask_zero=True)(model_input)
  x = tf.keras.layers.GlobalAveragePooling1D()(x)
  x = tf.keras.layers.Dense(128,kernel_initializer="he_normal",use_bias=False,kernel_regularizer=tf.keras.regularizers.l2())(x)
  x = tf.keras.layers.Activation("elu")(x)
  x = tf.keras.layers.Dense(64,kernel_initializer="he_normal",use_bias=False,kernel_regularizer=tf.keras.regularizers.l2())(x)
  x = tf.keras.layers.Activation("elu")(x)
  pred = tf.keras.layers.Dense(5,activation="softmax")(x)
  model = tf.keras.Model(model_input,outputs=[pred])
  model.compile(optimizer=tf.keras.optimizers.SGD(nesterov=True,momentum=0.99) ,loss='sparse_categorical_crossentropy',
                metrics=['sparse_categorical_accuracy'])
  return model
def build_bilstm():
  """
  Bidirectional LSTM
  """
  model_input = tf.keras.layers.Input(shape=(237,))
  x = tf.keras.layers.Embedding(vocab,16)(model_input)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128,return_sequences=True))(x)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,return_sequences=True))(x)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(x)
  x = tf.keras.layers.Dropout(0.2)(x)
  pred = tf.keras.layers.Dense(5,activation="softmax")(x)
  model = tf.keras.Model(model_input,outputs=[pred])
  model.compile(optimizer=tf.keras.optimizers.Adam(),loss='sparse_categorical_crossentropy',
                metrics=['sparse_categorical_accuracy'])
  return model

Our base model, in this case, will be built by the function build_bilstm. It returns a model which has powerful capabilities to process sequences. To add to the flavor, the processing is bidirectional, hence the layer learns forward and backward relationships. These kind of layers are the building blocks of state of the art machine translation systems currently used around the world.

We’ll now build and look at how the model looks like.

model = build_bilstm()
model.summary()
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         [(None, 237)]             0
_________________________________________________________________
embedding (Embedding)        (None, 237, 16)           2240016
_________________________________________________________________
bidirectional (Bidirectional (None, 237, 256)          148480
_________________________________________________________________
bidirectional_1 (Bidirection (None, 237, 128)          164352
_________________________________________________________________
bidirectional_2 (Bidirection (None, 64)                41216
_________________________________________________________________
dropout (Dropout)            (None, 64)                0
_________________________________________________________________
dense (Dense)                (None, 5)                 325
=================================================================
Total params: 2,594,389
Trainable params: 2,594,389
Non-trainable params: 0

The first noticeable thing is that the number of trainable params has significantly increased. This can be attributed to the Bidirectional LSTM layer. Often, you’ll want to have a few trainable params to not overfit. In this case, however, the number looks reasonable so it should not be of much worry. Lets now train our new model

model.fit(train_dataset,validation_data=val_dataset, epochs=50,
                    callbacks=[PrintDot(),reduce_lr,early_stopping],verbose=0)

The training does not last long, stopping after 9 epochs. Regardless, the model has managed to achieve superior accuracy of 99%.

results = model.evaluate(test_dataset)
print('test loss, test acc:', results)
36/36 [==============================] - 6s 176ms/step - loss: 0.0169 - sparse_categorical_accuracy: 0.9961
test loss, test acc: [0.016938598677774683, 0.9960701]

Test the model with random phrases

Lets put the model to a test with random phrases sourced from the web.

# helper function
def predict_sample(sample):
  sample = app_tokenizer.texts_to_sequences(sample)
  sample = tf.keras.preprocessing.sequence.pad_sequences(sample,padding='post',maxlen=237)
  pred = model.predict(sample,verbose=0)
  lang = REV_LANG_MAP[np.argmax(pred[0])]
  print(lang)
Test one - Expects english
predict_sample(['yesterday l was at the people park, Uhuru Park. Its a space where most Nairobians who live in the slums and concrete jungle get a chance to relax and step on real grass.']) # english
english
Test two - Expects kikuyu
predict_sample(['Iguru na thi nicikuui Ngai waragia ma cikainanaina']) #kikuyu
kikuyu
Test three - Expects luo
predict_sample(['Tem ane nyisa ni anyalo yudo thumbegi koso anyalo horoni pesa gi yoo mane jatelo.']) # luo
luo
Test four - Expects kiswahili
predict_sample(['mabaya na aendelee kutenda mabaya, na aliye mchafu aendelee kuwa mchafu. mwenye kutenda mema na azidi kutenda mema, na aliye mtakatifu na azidi kuwa mtakatifu.']) # kiswahili
kiswahili
Test five - Expects kalenjin
predict_sample(['ki ikwa irimennyi, ak ki ek kakwautigap euut pogol ak artam ak ang kakwaetap chiito, nooto ko ne po malaikaiyat']) # kalenjin
kalenjin

As you can see, the model correctly identifies a phrase as either English, Kikuyu, Luo, Kiswahili or Kalenjin. For fun’s sake, let’s train the FCN model and see how it fairs

model = build_model()
history = model.fit(train_dataset,validation_data=val_dataset, epochs=50,
                    callbacks=[PrintDot(),reduce_lr,early_stopping],verbose=0)

The model barely converges. It stops quickly after 4 epochs, reaching the lowest learning rate with an accuracy of 19%

..
Epoch 00002: ReduceLROnPlateau reducing learning rate to 0.0019999999552965165.
.
Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.0003999999724328518.
.
Epoch 00004: ReduceLROnPlateau reducing learning rate to 7.999999215826393e-05.
results = model.evaluate(test_dataset)
print('test loss, test acc:', results)
36/36 [==============================] - 0s 9ms/step - loss: nan - sparse_categorical_accuracy: 0.1962
test loss, test acc: [nan, 0.1961599]

Bidirectional LSTM is clearly the best model for this task.

Final thoughts

Language identification is an important step in a much larger system of NLP. As stated at the very beginning of this series, African languages are very much under-represented. More data and research is needed if we are to build systems that understand our own native languages. I look forward to the day when an 80-year-old grandmother in Turkana can operate a computer or a smartphone just by using her native language.

Thank you for taking the time to read this piece.

Future effort

Data is perhaps the most limiting fact in NLP for African languages. But a critical look reveals that there is an abundance of data only not in text form rather in audio form. In Kenyan for instance, most vernacular languages have radio stations that broadcast in those languages. That itself is a rich source of data. How viable is it? In a future series, we will demonstrate how what we have accomplished today, can be achieved with audio rather than text. Stay tuned.

Share

© David Dexter 2022