Implementing Bert with keras

Toshiba Kamruzzaman
2 min readSep 20, 2020


BERT’s Architecture

BERT, or Bidirectional Encoder Representations from Transformers, improves upon standard Transformer by removing the unidirectionality constraint by using a masked language model (MLM) pre-training objective. The BERT architecture builds on top of Transformer. We currently have two variants available:

  • BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
  • BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters

There are two steps in BERT: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.

Model Implementation

The BERT tokenizer is still from the BERT python module (bert-for-tf2).

import tensorflow_hub as hub
import tensorflow as tf
import bert
FullTokenizer = bert.bert_tokenization.FullTokenizer
from tensorflow.keras.models import Model # Keras is the new high level API for TensorFlow
import math

The Model

The goal of this model is to use the pre-trained BERT to generate the embedding vectors. Therefore, we need only the required inputs for the BERT layer and the model has only the BERT layer as a hidden layer. Of course, inside the BERT layer, there is a more complex architecture.

max_seq_length = 128  # Your choice here.
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
bert_layer = hub.KerasLayer("",
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])


The BERT layer requires 3 input sequence:

  • Token ids: for every token in the sentence. We restore it from the BERT vocab dictionary
  • Mask ids: for every token to mask out tokens used only for the sequence padding (so every sequence has the same length).
  • Segment ids: 0 for one-sentence sequence, 1 if there are two sentences in the sequence and it is the second one .
def get_masks(tokens, max_seq_length):
"""Mask for padding"""
if len(tokens)>max_seq_length:
raise IndexError("Token length more than max seq length!")
return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))

def get_segments(tokens, max_seq_length):
"""Segments: 0 for the first sequence, 1 for the second"""
if len(tokens)>max_seq_length:
raise IndexError("Token length more than max seq length!")
segments = []
current_segment_id = 0
for token in tokens:
if token == "[SEP]":
current_segment_id = 1
return segments + [0] * (max_seq_length - len(tokens))

def get_ids(tokens, tokenizer, max_seq_length):
"""Token ids from Tokenizer vocab"""
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
return input_ids


With these steps, we can generate BERT contextualised embedding vectors for our sentences! Don’t forget to add [CLS] and [SEP] separator tokens to keep the original format!

s = "This is a nice sentence."
stokens = tokenizer.tokenize(s)
stokens = ["[CLS]"] + stokens + ["[SEP]"]

input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)

pool_embs, all_embs = model.predict([[input_ids],[input_masks],[input_segments]])