Learning word representations with sequential modelling

Word representations have been learned by matrix factorisation methods or methods that optimise for similar goals (Levy et al. 2015). However these methods are limited to exploiting only co-occurrence statistics or bag-of-word features. Nevertheless these methods are usually so computationally efficient that they can be trained with huge corpora that may contain billions of words. As in other bag-of-word models, structural nature of languages does not account for the models’ inferences. We hypothesise that by using structural architectures, specifically recurrent neural networks, derived word representations contain properties learned from the preserved sequential nature of the input text. Meanwhile the latest sequential models focus less on learning word representations but sentence representations (Kiros et al. 2015, Gan et al. 2017). Additionally sequential models like skip-thought (Kiros et al. 2015) or seq2seq (Sutskever et al. 2014) are trained to predict only the next word conditioned on the current word and previous sentence representation. We hypothesise that this limits the capability of the models to learn word and sentence semantic as matrix factorisation methods have proven that predicting words within a limited distance improves the learned word representations.


Similar to other sequential models, our model first summarises the content of the centre sentence into a fix-sized vector using a simple single-layer forward LSTM. Different to other methods, we use this vector to predict words appearing in the window of n surrounding sentences (including the centre sentence). Additionally we do not predict words in sequential order but only whether they appear in the context. The intuition behind this is we hypothesise that a good sentence presentation should be able to capture not only future but also past and presence. In fact, Gan et al. (2017) show that a composite model that decodes both current and next sentence beats the future model that decodes only the next sentence. Similarly in learning word embeddings (word2vec, glove, svd), predicting contexts including words on both sides improves the learned word embedding quality. To predict word appearances, we feed the last state of LSTM from a given sentence to a linear hidden layer. We then compute the sigmoid of the dot product of the output from this hidden layer with the predicted word’s embedding for binary classification. To reduce the computational cost, we use negative sampling with uniform distribution for negative labels.

Stateful LSTM

We make a small modification to the traditional recurrent layer where the initial states of the recurrent units are either randomly initialised or set as zeros. Our LSTM initial states are initialised as the last values from the previous sentence. The purpose of this modification is to make the LSTM layer to take the representation from the last sentence into the computation of the current sentence’s representation. However in order to achieve this, we have to trade with shuffling the training sentences so that the data can be independent and identically distributed. This makes our optimisation be no longer standard Stochastic Gradient Descent. Nevertheless the empirical results show a good convergence of the loss function and word embedding quality on the validation set. It is worth noting that in training, the LSTM is not unfolded beyond the first word of each sentence. Thus the error gradient is not back propagated to previous sentences.

Data Interleaving

In order to set the last states of the previous sentence as the initial states of the next sentence, we first read a large batch of K consecutive sentences. Each epoch of training is done by iterating through all mini-batches created from large batches. Each mini-batch has a size of 25 sentences. Mini-batch i from a large batch is formed by sentences with indices i, K/25 + i, 2K/25 + i, …, 24K/25 + i from the large batch. Each large batch has K/25 mini-batches. By iterating through interleaved sentences from mini-batches, the last states of the LSTM from the previous sentence remain as initial states of the LSTM for the next sentence.


In this experiment, we evaluate the capability of our approach to learn word representations. We train the model with the Brown corpus as it is small enough for quick iterations. We set K = 10,000, number of negative samples as 50, maximum sentence length as 50 (shorter sentences are padded with a special token), number of predicted sentences as 11 (5 on either side and the centre sentence), sentence embedding size as 300 (same as in word2vec), vocabulary size as 50,000 (selected from most popular words in the corpus, other tokens are replaced by UNKNOWN). All words are lower-cased and tokenized by nltk. The training algorithm is Adagrad. We train the model for 100 epochs. We report the spearman’s rank correlations of word similarities of our trained word representations and others to human annotated scores in different data sets:

Word embeddings trained from our method out-perform Skip-gram and other methods in WordSim353 and WordSim353 Similarity tasks with significant margins. We hypothesise that as Skip-gram is based on pure word co-occurrence statistics it does better with word relatedness but does not distinguish between relatedness and similarity. Meanwhile our method uses the whole sentence representation for prediction, underlying word semantic is inferred from the other words in the same sentence and is used for better prediction. That leads to our method’s better efficiency in learning word semantic and similarities.


Our experiment shows promising results from training word representations with our method. However the experiment is limited to Brown corpus which is very small compared to available corpora. In the future, we would like to improve the training speed of our method and compare word or sentence representations learned from our method with much bigger corpora.



Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1367– 1377, San Diego, California, June 2016. Association for Computational Linguistics. URL http: //www.aclweb.org/anthology/N16-1162.

Gan, Z., Pu, Y., Henao, R., Li, C., He, X., & Carin, L. (2017). Learning Generic Sentence Representations Using Convolutional Neural Networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2380–2390). Copenhagen, Denmark: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/D17-1253

Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-Thought Vectors. In NIPS. Montreal, Canada. http://doi.org/10.1017/CBO9781107415324.004

Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225. http://doi.org/10.1186/1472-6947-15-S2-S2

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (NIPS), 3104–3112. http://doi.org/10.1007/s10107-014-0839-0

About Trung Huynh
Trung Huynh

I am a part-time PhD student at Knowledge Media Institute, Open University under supervision of Prof Stefan Rueger. In my full-time, I work as a software engineer for Google.

Leave a Reply

You must be logged in to post a comment.