Jihua Lu and Youcheng Zhang
(1. School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China;2. Department of Electronic Engineering, Tsinghua University, Beijing 100091, China)
Abstract: Two learning models, Zolu?continuous bags of words (ZL?CBOW) and Zolu?skip?grams(ZL?SG), based on the Zolu function are proposed. The slope of Relu in word2vec has been changed by the Zolu function. The proposed models can process extremely large data sets as well as word2vec without increasing the complexity. Also, the models outperform several word embedding methods both in word similarity and syntactic accuracy. The method of ZL?CBOW outperforms CBOW in accuracy by 8.43% on the training set of capital?world, and by 1.24% on the training set of plural?verbs. Moreover, experimental simulations on word similarity and syntactic accuracy show that ZL?CBOW and ZL?SG are superior to LL?CBOW and LL?SG, respectively.
Key words: Zolu function;word embedding;continuous bags of words;word similarity;accur?acy
Word embedding is one of the hot issues in natural language processing (NLP)[1?4]. Word2vec is a learning model comprising one shallow re?cursive neural network with only one projection layer[4?5]. Through training, one word or one con?text can be transformed into a K?dimensional vector of its semantics similarity to all the other words in the corpus. The similarity in vector space can be used to represent text semantics[6].Therefore, word2vec outputs word vectors to do NLP?related work, such as clustering, classifica?tion, dimension reduction, natural driving, and looking for synonyms[7?11]. Among them, the bag?of?visual?and?depth?words (BoVDW) model for gesture recognition is an extension of the bag?of?visual?words (BoVW) model used to describe hu?man gender continuously with “bags”[10]. DW2V is proposed to obtain structural similarity between natural driving behavior and natural language for environments of large data size, in?finite diversity of driving behaviors and contextu?al dependencies in driving behaviors[8]. A novel Word2vec model that integrates contextual word learning from words and their respective context windows is proposed in Ref. [11].
The original natural translation problem is:how can we get the existence probability of one context with T words or find the most proper translate result among all the possible answers?Neither N?gram nor n?pos models consider the context relationship although avoided the zero probability problem of one?hot vectors[12]. Follow?ing the neural network language model (NNLM)of Bengio, are the log?linear, log?bilinear, and hierarchical log?bilinear models[13]. Mikolov pro?posed two new log?linear models, the continuous bag?of?words(CBOW) model and the skip?gram(SG). CBOW changed the distributed word rep?resentation to a continuous context one and ap?plyed two hierarchical log?bilinear models, Hier?archical softmax and negative sampling[5?8].
We propose two learning models by chan?ging the activation functions of CBOW and SG to gain higher quality embedding vectors. We strive to find the more efficient learning model that bears either advantages of state?of art word embedding, i.e., accuracy and similarity. Modify?ing the activation functions should be the first choice. By the deductions of the renewal weights and the word vector, our models have a similar complexity to CBOW. Simulations reveal that our proposed ZL?CBOW and ZL?SG can im?prove both semantic similarity and accuracy without enhancing their complexity.
We propose two learning models, ZL?CBOW and ZL?SG, illustrated in Fig. 1, which are built upon CBOW and SG. Unlike LL?CBOW and LL?SG, which add weights to the continuous word representation[7,12], we apply Zolu to obtain the word embedding following the sum of word vec?tors in the projection layer as depicted in Fig. 1.

Fig. 1 Basic ideas of ZL?CBOW and ZL?SG models
The left and right figures in Fig.1 illustrate the process of ZL?CBOW and ZL?SG, respect?ively. The projection layer parts of ZL?CBOW and ZL?SG can be the same as CBOW and SG or be weighted like LL?CBOW and LL?SG in Refs. [2?3] and Ref. [7]. The main difference between ZL?CBOW and CBOW or LL?CBOW is that the Zolu function is applied as the non?lin?ear activation function not the sigmoid.
For ZL?CBOW and ZL?SG, we adopt the ac?tivation function of Zolu defined as[12]


Hierarchical softmax (HS) and negative?sampling (NS) are two main features of word embeddings proposed by Mikolov in Ref. [3].Either SG or CBOW can apply HS and NS in similar steps. Therefore, we detail NS in ZL?CBOW and HS in ZL?SG, and their counter?parts in CBOW and SG of word2vec.
For CBOW and ZL?CBOW, the training ob?ject is to predict one target word vector repres?enting one or several words for some given con?text. By exploiting NS, the maximization probab?ility of output could be expressed as



The training object of SG is contrary to CBOW, which is predicting the most similar or the nearest word context from the whole diction?ary. However, their object probabilities for max?imizing are similar, which are briefly illustrated as follows.
For SG, by exploiting the HS in Ref. [6], we get the probability for maximization as


In this section, we provide the quality differ?ences through using some data sets from “code.google.com/p/word2vec/source/”. First, for the similarities, we use the word “Meat” to test the nearest words for different learning methods; i.e.,we find the nearest neighbors of “meat” for three different word embedding models shown in Tab. 1.

Tab. 1 Results of nearest words
We observe that the proposed ZL?CBOW outputs five kinds of meat. However, the fourth and fifth and the fifth words of LL?CBOW and CBOW, respectively are not meat. In addition,the accuracies of ZL?CBOW and CBOW for dif?ferent application environments over different word sets is depicted in Tab. 2. For different data sets, the accuracies of ZL?CBOW and CBOW of Word2vec are listed with the sizes of windows varying from 2 to 9 words.

Tab. 2 Improvements of accuracy over all data sets
From Tab. 2, we observe that the proposed ZL?CBOW outperforms CBOW over all the data sets with increases ranging from 1.24% to 8.43%.The accuracies of ZL?CBOW and CBOW are lis?ted as middle two lines in Tab.2, respectively.Also, based on the statistics, we conclude that the proposed ZL?CBOW outperforms CBOW with an average enhancement of (5.74+8.43+7.5+7.3+4.94+5.52+5.21+3.6+2.63+1.92+1.71+1.47+ 1.68+1.24)/14%=4.14%. Moreover, the ac?curacies versus running time and batch size are compared to test the performances of the pro?posed methods as follows.
From Tab. 3, we observe that ZL?CBOW has comparative complexity with CBOW. For word.sh and analogy.sh, the complexities of ZL?CBOW are 7.05%, equal to (17.3–16.16)/16.16=7.05%, and 10.08%, equal to (18.34–16.66)/16.66.

Tab. 3 Running time comparisons of ZL-CBOW and CBOW over word.sh and analogy.sh
From Tab. 4, we conclude that the proposed ZL?CBOW method outperforms all the listed batch sizes of 100, 200 and 300 words, with 4.75%(86.17%–81.42%), 3.86% (86.36%–82.50%), and 3.75% (84.98%–81.23%) accuracies, respectively.

Tab. 4 Accuracies of ZL-CBOW and CBOW versus batch size
First, we study in depth the principles of Word2vec, especially focusing on the accuracy and similarities of word embedding. Then, aim?ing to further improve the efficiency of word em?bedding, we propose two learning models, ZL?CBOW and ZL?SG. We modify the renewal pro?cess of maximizing probability with the Zolu function. The essential ideas of the proposed methods are further enhancing the slope of the activation function and zeroing the renewal slope when the child node result of the Huffman tree is false. The proposed ZL?CBOW outperforms CBOW with an average accuracy enhancement of 7.61% for window sizes ranging from 2 to 9.For ZL?SG, the accuracy improvement range from 1.24% for plural?verbs to 8.43% of “capital?world” dataset in word2vec. Moreover, different batch sizes are compared for ZL?CBOW and CBOW. The integration of ZL?SG and ZL?CBOW with their counterparts in Ref. [13], i.e.,LL?SG and LL?CBOW, could further enhance the efficiency of word embedding. Also, our work can be extended to the embeddings of sentences and documents.
Journal of Beijing Institute of Technology2020年4期