Paraphrase Generation With Deep RL
Paraphrase Generation With Deep RL
Paraphrase Generation With Deep RL
Seq2Seq baseline models with Adam optimizer mean value of reward, and we set λ as 0.1 by grid
for a fair comparison. In supervised pre-training, search.
we set the learning rate as 0.1 and initial accumu-
lator as 0.1. The maximum norm of gradient is set 4.4 Results and Analysis
as 2. During the RL training, the learning rate de- Automatic evaluation Table 2 shows the per-
creases to 1e-5 and the size of Monte-Carlo sam- formances of the models on Quora datasets. In
ple is 4. To make the training more stable, we use both settings, we find that the proposed RbM-
the ground-truth with reward of 0.1. SL and RbM-IRL models outperform the baseline
Evaluator We use the pretrained GoogleNews models in terms of all the evaluation measures.
300-dimension word vectors 3 in Quora dataset Particularly in Quora-II, RbM-SL and RbM-IRL
and 200-dimension GloVe word vectors 4 in Twit- make significant improvements over the baselines,
ter corpus. Other model settings are the same as which demonstrates their higher ability in learn-
in Parikh et al. (2016). For evaluator in RbM- ing for paraphrase generation. On Quora dataset,
SL we set the learning rate as 0.05 and the batch RbM-SL is constantly better than RbM-IRL for
size as 32. For the evaluator of Mφ in RbM-IRL, all the automatic measures, which is reasonable
the learning rate decreases to 1e-2, and we use the because RbM-SL makes use of additional labeled
batch size of 80. data to train the evaluator. Quora datasets contains
We use the technique of reward rescaling as a large number of high-quality non-paraphrases,
mentioned in section 3.3 in training RbM-SL and i.e., they are literally similar but semantically dif-
RbM-IRL. In RbM-SL, we set δ1 as 12 and δ2 as 1. ferent, for instance “are analogue clocks better
In RbM-IRL, we keep δ2 as 1 all the time and de- than digital” and “is analogue better than digi-
crease δ1 from 12 to 3 and δ3 from 15 to 8 during tal”. Trained with the data, the evaluator tends to
curriculum learning. In ROUGE-RL, we take the become more capable in paraphrase identification.
exponential moving average of historical rewards With additional evaluation on Quora data, the eval-
as baseline reward to stabilize the training: uator used in RbM-SL can achieve an accuracy of
87% on identifying positive and negative pairs of
bm = λQm−1 + (1 − λ)bm−1 , b1 = 0 paraphrases.
where bm is the baseline b at iteration m, Q is the Table 3 shows the performances on the Twitter
corpus. Our models again outperform the base-
3
https://code.google.com/archive/p/ lines in terms of all the evaluation measures. Note
word2vec/
4
https://nlp.stanford.edu/projects/ that RbM-IRL performs better than RbM-SL in
glove/ this case. The reason might be that the evaluator
of RbM-SL might not be effectively trained with sentence. Compared to RbM-SL with an error of
the relatively small dataset, while RbM-IRL can repeating the word scripting, RbM-IRL generates
leverage its advantage in learning of the evaluator a more fluent paraphrase. The reason is that the
with less data. evaluator in RbM-IRL is more capable of measur-
In our experiments, we find that the training ing the fluency of a sentence. In the fourth ex-
techniques proposed in section 3.3 are all neces- ample, RL-ROUGE generates a totally non-sense
sary and effective. Reward shaping is by default sentence, and pointer-generator and RbM-IRL just
employed by all the RL based models. Reward cover half of the content of the original sentence,
rescaling works particularly well for the RbM while RbM-SL successfully rephrases and pre-
models, where the reward functions are learned serves all the meaning. All of the models fail
from data. Without reward rescaling, RbM-SL in the last example, because the word ducking
can still outperform the baselines but with smaller is a rare word that never appears in the training
margins. For RbM-IRL, curriculum learning is data. Pointer-generator and RL-ROUGE generate
necessary for its best performance. Without cur- totally irrelevant words such as UNK token or vic-
riculum learning, RbM-IRL only has comparable tory, while RbM-SL and RbM-IRL still generate
performance with ROUGE-RL. topic-relevant words.
Human evaluation We randomly select 300 sen- 5 Related Work
tences from the test data as input and generate
Neural paraphrase generation recently draws at-
paraphrases using different models. The pairs of
tention in different application scenarios. The
paraphrases are then aggregated and partitioned
task is often formalized as a sequence-to-sequence
into seven random buckets for seven human asses-
(Seq2Seq) learning problem. Prakash et al. (2016)
sors to evaluate. The assessors are asked to rate
employ a stacked residual LSTM network in the
each sentence pair according to the following two
Seq2Seq model to enlarge the model capacity.
criteria: relevance (the paraphrase sentence is se-
Cao et al. (2017) utilize an additional vocabu-
mantically close to the original sentence) and flu-
lary to restrict word candidates during generation.
ency (the paraphrase sentence is fluent as a natural
Gupta et al. (2018) use a variational auto-encoder
language sentence, and the grammar is correct).
framework to generate more diverse paraphrases.
Hence each assessor gives two scores to each para-
Ma et al. (2018) utilize an attention layer instead
phrase, both ranging from 1 to 5. To reduce the
of a linear mapping in the decoder to pick up word
evaluation variance, there is a detailed evaluation
candidates. Iyyer et al. (2018) harness syntac-
guideline for the assessors in Appendix B. Each
tic information for controllable paraphrase gen-
paraphrase is rated by two assessors, and then av-
eration. Zhang and Lapata (2017) tackle a simi-
eraged as the final judgement. The agreement be-
lar task of sentence simplification withe Seq2Seq
tween assessors is moderate (kappa=0.44).
model coupled with deep reinforcement learning,
Table 4 demonstrates the average ratings for in which the reward function is manually defined
each model, including the ground-truth references. for the task. Similar to these works, we also pre-
Our models of RbM-SL and RbM-IRL get bet- train the paraphrase generator within the Seq2Seq
ter scores in terms of relevance and fluency than framework. The main difference lies in that we
the baseline models, and their differences are use another trainable neural network, referred to
statistically significant (paired t-test, p-value < as evaluator, to guide the training of the generator
0.01). We note that in human evaluation, RbM-SL through reinforcement learning.
achieves the best relevance score while RbM-IRL There is also work on paraphrasing generation
achieves the best fluency score. in different settings. For example, Mallinson et al.
Case study Figure 3 gives some examples of gen- (2017) leverage bilingual data to produce para-
erated paraphrases by the models on Quora-II for phrases by pivoting over a shared translation in an-
illustration. The first and second examples show other language. Wieting et al. (2017); Wieting and
the superior performances of RbM-SL and RbM- Gimpel (2018) use neural machine translation to
IRL over the other models. In the third exam- generate paraphrases via back-translation of bilin-
ple, both RbM-SL and RbM-IRL capture accu- gual sentence pairs. Buck et al. (2018) and Dong
rate paraphrasing patterns, while the other models et al. (2017) tackle the problem of QA-specific
wrongly segment and copy words from the input paraphrasing with the guidance from an external
Figure 3: Examples of the generated paraphrases by different models on Quora-II.
QA system and an associated evaluation metric. as pointed by Finn et al. (2016a); Ho and Ermon
Inverse reinforcement learning (IRL) aims to (2016). However, there are significant differences
learn a reward function from expert demonstra- between GAN and our RbM-IRL model. GAN
tions. Abbeel and Ng (2004) propose apprentice- employs the discriminator to distinguish gener-
ship learning, which uses a feature based linear ated examples from real examples, while RbM-
reward function and learns to match feature ex- IRL employs the evaluator as a reward function
pectations. Ratliff et al. (2006) cast the problem in RL. The generator in GAN is trained to maxi-
as structured maximum margin prediction. Ziebart mize the loss of the discriminator in an adversarial
et al. (2008) propose max entropy IRL in order to way, while the generator in RbM-IRL is trained
solve the problem of expert suboptimality. Recent to maximize the expected cumulative reward from
work involving deep learning in IRL includes Finn the evaluator.
et al. (2016b) and Ho et al. (2016). There does not 6 Conclusion
seem to be much work on IRL for NLP. In Neu In this paper, we have proposed a novel deep re-
and Szepesvári (2009), parsing is formalized as inforcement learning approach to paraphrase gen-
a feature expectation matching problem. Wang eration, with a new framework consisting of a
et al. (2018) apply adversarial inverse reinforce- generator and an evaluator, modeled as sequence-
ment learning in visual story telling. To the best to-sequence learning model and deep matching
of our knowledge, our work is the first that applies model respectively. The generator, which is
deep IRL into a Seq2Seq task. for paraphrase generation, is first trained via
Generative Adversarial Networks sequence-to-sequence learning. The evaluator,
(GAN) (Goodfellow et al., 2014) is a family which is for paraphrase identification, is then
of unsupervised generative models. GAN con- trained via supervised learning or inverse rein-
tains a generator and a discriminator, respectively forcement learning in different settings. With
for generating examples from random noises a well-trained evaluator, the generator is further
and distinguishing generated examples from real fine-tuned by reinforcement learning to produce
examples, and they are trained in an adversarial more accurate paraphrases. The experiment re-
way. There are applications of GAN on NLP, sults demonstrate that the proposed method can
such as text generation (Yu et al., 2017; Guo significantly improve the quality of paraphrase
et al., 2018) and dialogue generation (Li et al., generation upon the baseline methods. In the fu-
2017). RankGAN (Lin et al., 2017) is the one ture, we plan to apply the framework and training
most similar to RbM-IRL that employs a ranking techniques into other tasks, such as machine trans-
model as the discriminator. However, RankGAN lation and dialogue.
works for text generation rather than sequence-
to-sequence learning, and training of generator Acknowledgments
in RankGAN relies on parallel data while the This work is supported by China National 973
training of RbM-IRL can use non-parallel data. Program 2014CB340301.
There are connections between GAN and IRL
References Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong
Yu, and Jun Wang. 2018. Long text generation
Pieter Abbeel and Andrew Y Ng. 2004. Apprentice- via adversarial training with leaked information. In
ship learning via inverse reinforcement learning. In AAAI.
ICML.
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Ankush Gupta, Arvind Agarwal, Prawaan Singh, and
Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Piyush Rai. 2018. A deep generative framework for
Courville, and Yoshua Bengio. 2017. An actor-critic paraphrase generation. In AAAI.
algorithm for sequence prediction. In ICLR. Jonathan Ho and Stefano Ermon. 2016. Generative ad-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- versarial imitation learning. In NIPS.
gio. 2015. Neural machine translation by jointly
learning to align and translate. In ICLR. Jonathan Ho, Jayesh Gupta, and Stefano Ermon. 2016.
Model-free imitation learning with policy optimiza-
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and tion. In ICML.
Noam Shazeer. 2015. Scheduled sampling for se-
quence prediction with recurrent neural networks. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai
In NIPS. Chen. 2014. Convolutional neural network architec-
tures for matching natural language sentences. In
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, NIPS.
and Jason Weston. 2009. Curriculum learning. In
ICML. Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke
Zettlemoyer. 2018. Adversarial example generation
Igor Bolshakov and Alexander Gelbukh. 2004. Syn- with syntactically controlled paraphrase networks.
onymous paraphrasing using wordnet and internet. In NAACL.
Natural Language Processing and Information Sys-
tems, pages 189–200. David Kauchak and Regina Barzilay. 2006. Paraphras-
ing for automatic evaluation. In NAACL.
Christian Buck, Jannis Bulian, Massimiliano Cia-
ramita, Andrea Gesmundo, Neil Houlsby, Wojciech Diederik Kingma and Jimmy Ba. 2015. Adam: A
Gajewski, and Wei Wang. 2018. Ask the right ques- method for stochastic optimization. In ICLR.
tions: Active question reformulation with reinforce-
ment learning. In ICLR. Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017.
A continuously growing dataset of sentential para-
Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li. phrases. In EMNLP.
2017. Joint copying and restricted generation for
paraphrase. In AAAI. Alon Lavie and Abhaya Agarwal. 2007. Meteor: An
automatic metric for mt evaluation with high levels
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- of correlation with human judgments. In Proceed-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger ings of the Second Workshop on Statistical Machine
Schwenk, and Yoshua Bengio. 2014. Learning Translation.
phrase representations using rnn encoder-decoder
for statistical machine translation. Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and
Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Dan Jurafsky. 2017. Adversarial learning for neural
Lapata. 2017. Learning to paraphrase for question dialogue generation. In EMNLP.
answering. In EMNLP. Chin-Yew Lin. 2004. Rouge: A package for automatic
John Duchi, Elad Hazan, and Yoram Singer. 2011. evaluation of summaries. In ACL-04 workshop.
Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang,
Learning Research, 12(Jul):2121–2159. and Ming-Ting Sun. 2017. Adversarial ranking for
language generation. In NIPS.
Chelsea Finn, Paul Christiano, Pieter Abbeel, and
Sergey Levine. 2016a. A connection between gen- Shuming Ma, Xu Sun, Wei Li, Sujian Li, Wenjie Li,
erative adversarial networks, inverse reinforcement and Xuancheng Ren. 2018. Word embedding at-
learning, and energy-based models. NIPS 2016 tention network: Generating words by querying dis-
Workshop on Adversarial Training. tributed word representations for paraphrase genera-
tion. In NAACL.
Chelsea Finn, Sergey Levine, and Pieter Abbeel.
2016b. Guided cost learning: Deep inverse optimal Jonathan Mallinson, Rico Sennrich, and Mirella Lap-
control via policy optimization. In ICML. ata. 2017. Paraphrasing revisited with neural ma-
chine translation. In EACL.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Kathleen R McKeown. 1983. Paraphrasing questions
Courville, and Yoshua Bengio. 2014. Generative ad- using given and new information. Computational
versarial nets. In NIPS. Linguistics, 9(1):1–10.
Shashi Narayan, Siva Reddy, and Shay B Cohen. 2016. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Paraphrase generation from latent-variable pcfgs for Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
semantic parsing. In INLG. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Gergely Neu and Csaba Szepesvári. 2009. Training
parsers by inverse reinforcement learning. Machine Oriol Vinyals and Quoc Le. 2015. A neural conversa-
learning, 77(2):303–337. tional model.
Andrew Y Ng, Daishi Harada, and Stuart Russell. Xin Wang, Wenhu Chen, Yuan-Fang Wang, and
1999. Policy invariance under reward transforma- William Yang Wang. 2018. No metrics are perfect:
tions: Theory and application to reward shaping. In Adversarial reward learning for visual storytelling.
ICML. In ACL.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- John Wieting and Kevin Gimpel. 2018. Paranmt-50m:
Jing Zhu. 2002. Bleu: a method for automatic eval- Pushing the limits of paraphrastic sentence embed-
uation of machine translation. In ACL. dings with millions of machine translations. In ACL.
Ankur P Parikh, Oscar Täckström, Dipanjan Das, and John Wieting, Jonathan Mallinson, and Kevin Gimpel.
Jakob Uszkoreit. 2016. A decomposable attention 2017. Learning paraphrastic sentence embeddings
model for natural language inference. In EMNLP. from back-translated bitext. In EMNLP.
Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek Ronald J Williams. 1992. Simple statistical gradient-
Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri. following algorithms for connectionist reinforce-
2016. Neural paraphrase generation with stacked ment learning. Machine learning, 8(3-4):229–256.
residual lstm networks. In COLING.
Wei Wu, Zhengdong Lu, and Hang Li. 2013. Learn-
Chris Quirk, Chris Brockett, and William Dolan. ing bilinear model for matching queries and docu-
2004. Monolingual machine translation for para- ments. The Journal of Machine Learning Research,
phrase generation. In EMNLP. 14(1):2519–2548.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
and Wojciech Zaremba. 2016. Sequence level train- Le, Mohammad Norouzi, Wolfgang Macherey,
ing with recurrent neural networks. In ICLR. Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
Nathan D Ratliff, J Andrew Bagnell, and Martin A chine translation system: Bridging the gap between
Zinkevich. 2006. Maximum margin planning. In human and machine translation. arXiv preprint
ICML. arXiv:1609.08144.
Alexander M Rush, Sumit Chopra, and Jason Weston. Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang,
2015. A neural attention model for abstractive sen- Hang Li, and Xiaoming Li. 2016. Neural generative
tence summarization. In EMNLP. question answering. In IJCAI.
Abigail See, Peter J Liu, and Christopher D Manning. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
2017. Get to the point: Summarization with pointer- 2017. Seqgan: Sequence generative adversarial nets
generator networks. In ACL. with policy gradient. In AAAI.
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Xingxing Zhang and Mirella Lapata. 2017. Sentence
Neural responding machine for short-text conversa- simplification with deep reinforcement learning. In
tion. In ACL. EMNLP.
Richard Socher, Eric H Huang, Jeffrey Pennin, Christo- Shiqi Zhao, Xiang Lan, Ting Liu, and Sheng Li. 2009.
pher D Manning, and Andrew Y Ng. 2011. Dy- Application-driven statistical paraphrase generation.
namic pooling and unfolding recursive autoencoders In ACL.
for paraphrase detection. In NIPS.
Shiqi Zhao, Cheng Niu, Ming Zhou, Ting Liu, and
Yu Su and Xifeng Yan. 2017. Cross-domain semantic Sheng Li. 2008. Combining multiple resources to
parsing via paraphrasing. In EMNLP. improve smt-based paraphrasing model. In ACL.
Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell,
Sequence to sequence learning with neural net- and Anind K Dey. 2008. Maximum entropy inverse
works. In NIPS, pages 3104–3112. reinforcement learning. In AAAI.