Implementing Bert with keras
BERT’s Architecture
BERT, or Bidirectional Encoder Representations from Transformers, improves upon standard Transformer by removing the unidirectionality constraint by using a masked language model (MLM) pre-training objective. The BERT architecture builds on top of Transformer. We currently have two variants available:
- BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
- BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters
There are two steps in BERT: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.
Model Implementation
The BERT tokenizer is still from the BERT python module (bert-for-tf2).
import tensorflow_hub as hub
import tensorflow as tf
import bert
FullTokenizer = bert.bert_tokenization.FullTokenizer
from tensorflow.keras.models import Model # Keras is the new high level API for TensorFlow
import math
The Model
The goal of this model is to use the pre-trained BERT to generate the embedding vectors. Therefore, we need only the required inputs for the BERT layer and the model has only the BERT layer as a hidden layer. Of course, inside the BERT layer, there is a more complex architecture.
max_seq_length = 128 # Your choice here.
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
name="segment_ids")
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])
Preprocessing
The BERT layer requires 3 input sequence:
- Token ids: for every token in the sentence. We restore it from the BERT vocab dictionary
- Mask ids: for every token to mask out tokens used only for the sequence padding (so every sequence has the same length).
- Segment ids: 0 for one-sentence sequence, 1 if there are two sentences in the sequence and it is the second one .
def get_masks(tokens, max_seq_length):
"""Mask for padding"""
if len(tokens)>max_seq_length:
raise IndexError("Token length more than max seq length!")
return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))
def get_segments(tokens, max_seq_length):
"""Segments: 0 for the first sequence, 1 for the second"""
if len(tokens)>max_seq_length:
raise IndexError("Token length more than max seq length!")
segments = []
current_segment_id = 0
for token in tokens:
segments.append(current_segment_id)
if token == "[SEP]":
current_segment_id = 1
return segments + [0] * (max_seq_length - len(tokens))
def get_ids(tokens, tokenizer, max_seq_length):
"""Token ids from Tokenizer vocab"""
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
return input_ids
Prediction
With these steps, we can generate BERT contextualised embedding vectors for our sentences! Don’t forget to add [CLS]
and [SEP]
separator tokens to keep the original format!
s = "This is a nice sentence."
stokens = tokenizer.tokenize(s)
stokens = ["[CLS]"] + stokens + ["[SEP]"]
input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)
pool_embs, all_embs = model.predict([[input_ids],[input_masks],[input_segments]])