内容简介:介绍如何使用Sequence to Sequence Learning(Seq2Seq)实现神经机器翻译(Neural Machine Translation,NMT)之前我们通过序列标注模型实现了中文分词,序列标注属于Seq2Seq的一种这次我们使用Seq2Seq实现NMT,由于输入语句和输出语句都包含多个词并且数量不一定相同,所以对应上图中的第四种情况
介绍如何使用Sequence to Sequence Learning(Seq2Seq)实现神经机器翻译(Neural Machine Translation,NMT)
原理
之前我们通过序列标注模型实现了中文分词,序列标注属于Seq2Seq的一种
这次我们使用Seq2Seq实现NMT,由于输入语句和输出语句都包含多个词并且数量不一定相同,所以对应上图中的第四种情况
最简单的做法是,先将整个输入语句编码成固定长度的向量表示,然后再逐步进行解码输出对应的翻译语句,Encoder和Decoder都可以使用RNN来实现
在RNN类型上可以选择LSTM或GRU,也可以考虑使用多层LSTM、双向LSTM等扩展
也可以考虑Attention机制,对于输入序列每个输入得到的输出,计算注意力权重并加权
- 不仅仅使用Encoder最后一步的输出,而且使用Encoder每一步的输出,和图像标题生成中的小块类似
- Decoder每次进行生成时,先根据Decoder当前状态和Encoder每一步输出之间的关系,计算对应的注意力权重
- 根据权重将Encoder每一步的输出进行加权求和,得到当前这一步所使用的上下文context
- Decoder根据context以及上一步的输出,更新得到下一步的状态,进而得到下一步的输出
在计算注意力权重时,主要有乘式和加式两类实现方案,前者称作 Luong's multiplicative style
,后者称作 Bahdanau's additive style
数据
使用小牛翻译开源社区提供的中英文平行语料,www.niutrans.com/,经过整理后,训练集共10W对数据,验证集共1K对数据,测试集共400对数据
实现
这里我们主要使用TensorFlow提供的API来实现Seq2Seq Learning、Attention和beam search等内容,参考以下项目实现, github.com/tensorflow/…
代码包括训练、验证、推断三部分
- 训练:在训练集上训练模型,并计算损失函数
- 验证:在验证集上验证模型,并计算损失函数
- 推断:在测试集上应用模型,不计算损失函数,使用beam search生成序列,并使用bleu指标进行评估
加载库
# -*- coding: utf-8 -*- import tensorflow as tf import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.utils import shuffle from keras.preprocessing.sequence import pad_sequences import os from tqdm import tqdm import pickle 复制代码
加载中英文词典,保留最常见的2W个词,其他词以 <unk>
表示
def load_vocab(path): with open(path, 'r') as fr: vocab = fr.readlines() vocab = [w.strip('\n') for w in vocab] return vocab vocab_ch = load_vocab('data/vocab.ch') vocab_en = load_vocab('data/vocab.en') print(len(vocab_ch), vocab_ch[:20]) print(len(vocab_en), vocab_en[:20]) word2id_ch = {w: i for i, w in enumerate(vocab_ch)} id2word_ch = {i: w for i, w in enumerate(vocab_ch)} word2id_en = {w: i for i, w in enumerate(vocab_en)} id2word_en = {i: w for i, w in enumerate(vocab_en)} 复制代码
加载训练集、验证集、测试集数据,计算中英文数据对应的最大序列长度,并根据mode对相应数据进行padding
def load_data(path, word2id): with open(path, 'r') as fr: lines = fr.readlines() sentences = [line.strip('\n').split(' ') for line in lines] sentences = [[word2id['<s>']] + [word2id[w] for w in sentence] + [word2id['</s>']] for sentence in sentences] lens = [len(sentence) for sentence in sentences] maxlen = np.max(lens) return sentences, lens, maxlen # train: training, no beam search, calculate loss # eval: no training, no beam search, calculate loss # infer: no training, beam search, calculate bleu mode = 'train' train_ch, len_train_ch, maxlen_train_ch = load_data('data/train.ch', word2id_ch) train_en, len_train_en, maxlen_train_en = load_data('data/train.en', word2id_en) dev_ch, len_dev_ch, maxlen_dev_ch = load_data('data/dev.ch', word2id_ch) dev_en, len_dev_en, maxlen_dev_en = load_data('data/dev.en', word2id_en) test_ch, len_test_ch, maxlen_test_ch = load_data('data/test.ch', word2id_ch) test_en, len_test_en, maxlen_test_en = load_data('data/test.en', word2id_en) maxlen_ch = np.max([maxlen_train_ch, maxlen_dev_ch, maxlen_test_ch]) maxlen_en = np.max([maxlen_train_en, maxlen_dev_en, maxlen_test_en]) print(maxlen_ch, maxlen_en) if mode == 'train': train_ch = pad_sequences(train_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>']) train_en = pad_sequences(train_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>']) print(train_ch.shape, train_en.shape) elif mode == 'eval': dev_ch = pad_sequences(dev_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>']) dev_en = pad_sequences(dev_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>']) print(dev_ch.shape, dev_en.shape) elif mode == 'infer': test_ch = pad_sequences(test_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>']) test_en = pad_sequences(test_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>']) print(test_ch.shape, test_en.shape) 复制代码
定义四个placeholder,对输入进行嵌入
X = tf.placeholder(tf.int32, [None, maxlen_ch]) X_len = tf.placeholder(tf.int32, [None]) Y = tf.placeholder(tf.int32, [None, maxlen_en]) Y_len = tf.placeholder(tf.int32, [None]) Y_in = Y[:, :-1] Y_out = Y[:, 1:] k_initializer = tf.contrib.layers.xavier_initializer() e_initializer = tf.random_uniform_initializer(-1.0, 1.0) embedding_size = 512 hidden_size = 512 if mode == 'train': batch_size = 128 else: batch_size = 16 with tf.variable_scope('embedding_X'): embeddings_X = tf.get_variable('weights_X', [len(word2id_ch), embedding_size], initializer=e_initializer) embedded_X = tf.nn.embedding_lookup(embeddings_X, X) # batch_size, seq_len, embedding_size with tf.variable_scope('embedding_Y'): embeddings_Y = tf.get_variable('weights_Y', [len(word2id_en), embedding_size], initializer=e_initializer) embedded_Y = tf.nn.embedding_lookup(embeddings_Y, Y_in) # batch_size, seq_len, embedding_size 复制代码
定义encoder部分,使用双向LSTM
def single_cell(mode=mode): if mode == 'train': keep_prob = 0.8 else: keep_prob = 1.0 cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size) cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=keep_prob) return cell def multi_cells(num_layers): cells = [] for i in range(num_layers): cell = single_cell() cells.append(cell) return tf.nn.rnn_cell.MultiRNNCell(cells) with tf.variable_scope('encoder'): num_layers = 1 fw_cell = multi_cells(num_layers) bw_cell = multi_cells(num_layers) bi_outputs, bi_state = tf.nn.bidirectional_dynamic_rnn(fw_cell, bw_cell, embedded_X, dtype=tf.float32, sequence_length=X_len) # fw: batch_size, seq_len, hidden_size # bw: batch_size, seq_len, hidden_size print('=' * 100, '\n', bi_outputs) encoder_outputs = tf.concat(bi_outputs, -1) print('=' * 100, '\n', encoder_outputs) # batch_size, seq_len, 2 * hidden_size # 2 tuple(fw & bw), 2 tuple(c & h), batch_size, hidden_size print('=' * 100, '\n', bi_state) encoder_state = [] for i in range(num_layers): encoder_state.append(bi_state[0][i]) # forward encoder_state.append(bi_state[1][i]) # backward encoder_state = tuple(encoder_state) # 2 tuple, 2 tuple(c & h), batch_size, hidden_size print('=' * 100) for i in range(len(encoder_state)): print(i, encoder_state[i]) 复制代码
定义decoder部分,使用两层LSTM
with tf.variable_scope('decoder'): beam_width = 10 memory = encoder_outputs if mode == 'infer': memory = tf.contrib.seq2seq.tile_batch(memory, beam_width) X_len = tf.contrib.seq2seq.tile_batch(X_len, beam_width) encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, beam_width) bs = batch_size * beam_width else: bs = batch_size attention = tf.contrib.seq2seq.LuongAttention(hidden_size, memory, X_len, scale=True) # multiplicative # attention = tf.contrib.seq2seq.BahdanauAttention(hidden_size, memory, X_len, normalize=True) # additive cell = multi_cells(num_layers * 2) cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention, hidden_size, name='attention') decoder_initial_state = cell.zero_state(bs, tf.float32).clone(cell_state=encoder_state) with tf.variable_scope('projected'): output_layer = tf.layers.Dense(len(word2id_en), use_bias=False, kernel_initializer=k_initializer) if mode == 'infer': start = tf.fill([batch_size], word2id_en['<s>']) decoder = tf.contrib.seq2seq.BeamSearchDecoder(cell, embeddings_Y, start, word2id_en['</s>'], decoder_initial_state, beam_width, output_layer) outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True, maximum_iterations=2 * tf.reduce_max(X_len)) sample_id = outputs.predicted_ids else: helper = tf.contrib.seq2seq.TrainingHelper(embedded_Y, [maxlen_en - 1 for b in range(batch_size)]) decoder = tf.contrib.seq2seq.BasicDecoder(cell, helper, decoder_initial_state, output_layer) outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True) logits = outputs.rnn_output logits = tf.transpose(logits, (1, 0, 2)) print(logits) 复制代码
根据mode选择是否需要定义损失函数和优化器
if mode != 'infer': with tf.variable_scope('loss'): loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y_out, logits=logits) mask = tf.sequence_mask(Y_len, tf.shape(Y_out)[1], tf.float32) loss = tf.reduce_sum(loss * mask) / batch_size if mode == 'train': learning_rate = tf.Variable(0.0, trainable=False) params = tf.trainable_variables() grads, _ = tf.clip_by_global_norm(tf.gradients(loss, params), 5.0) optimizer = tf.train.GradientDescentOptimizer(learning_rate).apply_gradients(zip(grads, params)) 复制代码
训练部分代码,经过20轮训练后,训练损失从200以上降到52.19,perplexity降到5.53
sess = tf.Session() sess.run(tf.global_variables_initializer()) if mode == 'train': saver = tf.train.Saver() OUTPUT_DIR = 'model_diy' if not os.path.exists(OUTPUT_DIR): os.mkdir(OUTPUT_DIR) tf.summary.scalar('loss', loss) summary = tf.summary.merge_all() writer = tf.summary.FileWriter(OUTPUT_DIR) epochs = 20 for e in range(epochs): total_loss = 0 total_count = 0 start_decay = int(epochs * 2 / 3) if e <= start_decay: lr = 1.0 else: decay = 0.5 ** (int(4 * (e - start_decay) / (epochs - start_decay))) lr = 1.0 * decay sess.run(tf.assign(learning_rate, lr)) train_ch, len_train_ch, train_en, len_train_en = shuffle(train_ch, len_train_ch, train_en, len_train_en) for i in tqdm(range(train_ch.shape[0] // batch_size)): X_batch = train_ch[i * batch_size: i * batch_size + batch_size] X_len_batch = len_train_ch[i * batch_size: i * batch_size + batch_size] Y_batch = train_en[i * batch_size: i * batch_size + batch_size] Y_len_batch = len_train_en[i * batch_size: i * batch_size + batch_size] Y_len_batch = [l - 1 for l in Y_len_batch] feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch} _, ls_ = sess.run([optimizer, loss], feed_dict=feed_dict) total_loss += ls_ * batch_size total_count += np.sum(Y_len_batch) if i > 0 and i % 100 == 0: writer.add_summary(sess.run(summary, feed_dict=feed_dict), e * train_ch.shape[0] // batch_size + i) writer.flush() print('Epoch %d lr %.3f perplexity %.2f' % (e, lr, np.exp(total_loss / total_count))) saver.save(sess, os.path.join(OUTPUT_DIR, 'nmt')) 复制代码
验证部分代码,验证集的perplexity为11.56
if mode == 'eval': saver = tf.train.Saver() OUTPUT_DIR = 'model_diy' saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR)) total_loss = 0 total_count = 0 for i in tqdm(range(dev_ch.shape[0] // batch_size)): X_batch = dev_ch[i * batch_size: i * batch_size + batch_size] X_len_batch = len_dev_ch[i * batch_size: i * batch_size + batch_size] Y_batch = dev_en[i * batch_size: i * batch_size + batch_size] Y_len_batch = len_dev_en[i * batch_size: i * batch_size + batch_size] Y_len_batch = [l - 1 for l in Y_len_batch] feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch} ls_ = sess.run(loss, feed_dict=feed_dict) total_loss += ls_ * batch_size total_count += np.sum(Y_len_batch) print('Dev perplexity %.2f' % np.exp(total_loss / total_count)) 复制代码
推断部分代码,测试集的bleu为0.2069,生成的英文翻译结果在output_test_diy中
if mode == 'infer': saver = tf.train.Saver() OUTPUT_DIR = 'model_diy' saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR)) def translate(ids): words = [id2word_en[i] for i in ids] if words[0] == '<s>': words = words[1:] if '</s>' in words: words = words[:words.index('</s>')] return ' '.join(words) fw = open('output_test_diy', 'w') for i in tqdm(range(test_ch.shape[0] // batch_size)): X_batch = test_ch[i * batch_size: i * batch_size + batch_size] X_len_batch = len_test_ch[i * batch_size: i * batch_size + batch_size] Y_batch = test_en[i * batch_size: i * batch_size + batch_size] Y_len_batch = len_test_en[i * batch_size: i * batch_size + batch_size] Y_len_batch = [l - 1 for l in Y_len_batch] feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch} ids = sess.run(sample_id, feed_dict=feed_dict) # seq_len, batch_size, beam_width ids = np.transpose(ids, (1, 2, 0)) # batch_size, beam_width, seq_len ids = ids[:, 0, :] # batch_size, seq_len for j in range(ids.shape[0]): sentence = translate(ids[j]) fw.write(sentence + '\n') fw.close() from nmt.utils.evaluation_utils import evaluate for metric in ['bleu', 'rouge']: score = evaluate('data/test.en', 'output_test_diy', metric) print(metric, score / 100) 复制代码
造好的轮子
以下项目提供了非常完整的接口, github.com/tensorflow/… ,通过简单的配置即可定制不同的模型,支持70多个配置项,举几个例子
--num_units --unit_type --num_layers --encoder_type --residual --attention
如果觉得配置项太繁琐,以上项目也提供好了4个配置项模板, iwslt15.json
适用于小数据集(IWSLT English-Vietnamese,13W),其他三个模版适用于大数据集(WMT German-English,4.5M)
使用以上项目训练中译英模型,只需要运行以下命令,如果是训练英译中模型,修改src和tgt的值即可
python -m nmt.nmt --src=ch --tgt=en --vocab_prefix=data/vocab --train_prefix=data/train --dev_prefix=data/dev --test_prefix=data/test --out_dir=model_nmt --hparams_path=nmt/standard_hparams/iwslt15.json 复制代码
训练结果包括以下内容
- 最后五次保存下来的模型
- train_log中包括可供tensorboard查看的events文件
- output_dev和output_test分别对应验证集和测试集的翻译结果
- best_bleu中包括在验证集上bleu score最高的五个版本模型
模型在验证集上的bleu为0.233,在测试集上的bleu为0.224
使用以下命令进行推断,把需要翻译的文本写入对应文件即可,生成的英文翻译结果在output_test_nmt中
python -m nmt.nmt --out_dir=model_nmt --inference_input_file=test.ch --inference_output_file=output_test_nmt 复制代码
对联生成
使用以下数据集, github.com/wb14123/cou… ,包括70W条对联数据
使用以下命令训练模型,将 iwslt15.json
复制一份为 couplet.json
,因为数据量更多,所以适当增加训练次数,即修改 num_train_steps
为100000
没有验证集也没有关系,用测试集替代即可,因为必填参数若不填将会报错
python -m nmt.nmt --src=in --tgt=out --vocab_prefix=couplet/vocab --train_prefix=couplet/train --dev_prefix=couplet/test --test_prefix=couplet/test --out_dir=model_couplet --hparams_path=nmt/standard_hparams/couplet.json 复制代码
output_test中的一些结果示例,每三句依次为上联、下联、生成的下联,字数、词性和词意基本都对上了
腾 飞 上 铁 , 锐 意 改 革 谋 发 展 , 勇 当 千 里 马 和 谐 南 供 , 安 全 送 电 保 畅 通 , 争 做 领 头 羊 改 革 开 放 , 科 学 发 展 促 繁 荣 , 争 做 领 头 羊 风 弦 未 拨 心 先 乱 夜 幕 已 沉 梦 更 闲 雪 韵 初 融 意 更 浓 彩 屏 如 画 , 望 秀 美 崤 函 , 花 团 锦 簇 短 信 报 春 , 喜 和 谐 社 会 , 物 阜 民 康 妙 笔 生 花 , 书 辉 煌 史 册 , 虎 啸 龙 吟 复制代码
如果需要根据没有见过的上联生成下联即进行推断,则使用之前介绍过的方法即可
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
How to Solve It
Zbigniew Michalewicz、David B. Fogel / Springer / 2004-03-01 / USD 59.95
This book is the only source that provides comprehensive, current, and detailed information on problem solving using modern heuristics. It covers classic methods of optimization, including dynamic pro......一起来看看 《How to Solve It》 这本书的介绍吧!