深度有趣 | 26 Seq2Seq机器翻译

栏目: 数据库 · 发布时间: 6年前

内容简介：介绍如何使用Sequence to Sequence Learning（Seq2Seq）实现神经机器翻译（Neural Machine Translation，NMT）之前我们通过序列标注模型实现了中文分词，序列标注属于Seq2Seq的一种这次我们使用Seq2Seq实现NMT，由于输入语句和输出语句都包含多个词并且数量不一定相同，所以对应上图中的第四种情况

介绍如何使用Sequence to Sequence Learning（Seq2Seq）实现神经机器翻译（Neural Machine Translation，NMT）

原理

之前我们通过序列标注模型实现了中文分词，序列标注属于Seq2Seq的一种

这次我们使用Seq2Seq实现NMT，由于输入语句和输出语句都包含多个词并且数量不一定相同，所以对应上图中的第四种情况

最简单的做法是，先将整个输入语句编码成固定长度的向量表示，然后再逐步进行解码输出对应的翻译语句，Encoder和Decoder都可以使用RNN来实现

在RNN类型上可以选择LSTM或GRU，也可以考虑使用多层LSTM、双向LSTM等扩展

也可以考虑Attention机制，对于输入序列每个输入得到的输出，计算注意力权重并加权

不仅仅使用Encoder最后一步的输出，而且使用Encoder每一步的输出，和图像标题生成中的小块类似
Decoder每次进行生成时，先根据Decoder当前状态和Encoder每一步输出之间的关系，计算对应的注意力权重
根据权重将Encoder每一步的输出进行加权求和，得到当前这一步所使用的上下文context
Decoder根据context以及上一步的输出，更新得到下一步的状态，进而得到下一步的输出

在计算注意力权重时，主要有乘式和加式两类实现方案，前者称作 Luong's multiplicative style ，后者称作 Bahdanau's additive style

数据

使用小牛翻译开源社区提供的中英文平行语料，www.niutrans.com/，经过整理后，训练集共10W对数据，验证集共1K对数据，测试集共400对数据

实现

这里我们主要使用TensorFlow提供的API来实现Seq2Seq Learning、Attention和beam search等内容，参考以下项目实现， github.com/tensorflow/…

代码包括训练、验证、推断三部分

训练：在训练集上训练模型，并计算损失函数
验证：在验证集上验证模型，并计算损失函数
推断：在测试集上应用模型，不计算损失函数，使用beam search生成序列，并使用bleu指标进行评估

加载库

# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.utils import shuffle
from keras.preprocessing.sequence import pad_sequences
import os
from tqdm import tqdm
import pickle
复制代码

加载中英文词典，保留最常见的2W个词，其他词以 <unk> 表示

def load_vocab(path):
    with open(path, 'r') as fr:
        vocab = fr.readlines()
        vocab = [w.strip('\n') for w in vocab]
    return vocab

vocab_ch = load_vocab('data/vocab.ch')
vocab_en = load_vocab('data/vocab.en')
print(len(vocab_ch), vocab_ch[:20])
print(len(vocab_en), vocab_en[:20])

word2id_ch = {w: i for i, w in enumerate(vocab_ch)}
id2word_ch = {i: w for i, w in enumerate(vocab_ch)}
word2id_en = {w: i for i, w in enumerate(vocab_en)}
id2word_en = {i: w for i, w in enumerate(vocab_en)}
复制代码

加载训练集、验证集、测试集数据，计算中英文数据对应的最大序列长度，并根据mode对相应数据进行padding

def load_data(path, word2id):
    with open(path, 'r') as fr:
        lines = fr.readlines()
        sentences = [line.strip('\n').split(' ') for line in lines]
        sentences = [[word2id['<s>']] + [word2id[w] for w in sentence] + [word2id['</s>']]
                     for sentence in sentences]
        
        lens = [len(sentence) for sentence in sentences]
        maxlen = np.max(lens)
        return sentences, lens, maxlen

# train: training, no beam search, calculate loss
# eval: no training, no beam search, calculate loss
# infer: no training, beam search, calculate bleu
mode = 'train'

train_ch, len_train_ch, maxlen_train_ch = load_data('data/train.ch', word2id_ch)
train_en, len_train_en, maxlen_train_en = load_data('data/train.en', word2id_en)
dev_ch, len_dev_ch, maxlen_dev_ch = load_data('data/dev.ch', word2id_ch)
dev_en, len_dev_en, maxlen_dev_en = load_data('data/dev.en', word2id_en)
test_ch, len_test_ch, maxlen_test_ch = load_data('data/test.ch', word2id_ch)
test_en, len_test_en, maxlen_test_en = load_data('data/test.en', word2id_en)

maxlen_ch = np.max([maxlen_train_ch, maxlen_dev_ch, maxlen_test_ch])
maxlen_en = np.max([maxlen_train_en, maxlen_dev_en, maxlen_test_en])
print(maxlen_ch, maxlen_en)

if mode == 'train':
    train_ch = pad_sequences(train_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
    train_en = pad_sequences(train_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
    print(train_ch.shape, train_en.shape)
elif mode == 'eval':
    dev_ch = pad_sequences(dev_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
    dev_en = pad_sequences(dev_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
    print(dev_ch.shape, dev_en.shape)
elif mode == 'infer':
    test_ch = pad_sequences(test_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
    test_en = pad_sequences(test_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
    print(test_ch.shape, test_en.shape)
复制代码

定义四个placeholder，对输入进行嵌入

X = tf.placeholder(tf.int32, [None, maxlen_ch])
X_len = tf.placeholder(tf.int32, [None])
Y = tf.placeholder(tf.int32, [None, maxlen_en])
Y_len = tf.placeholder(tf.int32, [None])
Y_in = Y[:, :-1]
Y_out = Y[:, 1:]

k_initializer = tf.contrib.layers.xavier_initializer()
e_initializer = tf.random_uniform_initializer(-1.0, 1.0)

embedding_size = 512
hidden_size = 512

if mode == 'train':
    batch_size = 128
else:
    batch_size = 16

with tf.variable_scope('embedding_X'):
    embeddings_X = tf.get_variable('weights_X', [len(word2id_ch), embedding_size], initializer=e_initializer)
    embedded_X = tf.nn.embedding_lookup(embeddings_X, X) # batch_size, seq_len, embedding_size
    
with tf.variable_scope('embedding_Y'):
    embeddings_Y = tf.get_variable('weights_Y', [len(word2id_en), embedding_size], initializer=e_initializer)
    embedded_Y = tf.nn.embedding_lookup(embeddings_Y, Y_in) # batch_size, seq_len, embedding_size
复制代码

定义encoder部分，使用双向LSTM

def single_cell(mode=mode):
    if mode == 'train':
        keep_prob = 0.8
    else:
        keep_prob = 1.0
    cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
    cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=keep_prob)
    return cell

def multi_cells(num_layers):
    cells = []
    for i in range(num_layers):
        cell = single_cell()
        cells.append(cell)
    return tf.nn.rnn_cell.MultiRNNCell(cells)
    
with tf.variable_scope('encoder'):
    num_layers = 1
    fw_cell = multi_cells(num_layers)
    bw_cell = multi_cells(num_layers)
    bi_outputs, bi_state = tf.nn.bidirectional_dynamic_rnn(fw_cell, bw_cell, embedded_X, dtype=tf.float32,
                                                           sequence_length=X_len)
    # fw: batch_size, seq_len, hidden_size
    # bw: batch_size, seq_len, hidden_size
    print('=' * 100, '\n', bi_outputs)
    
    encoder_outputs = tf.concat(bi_outputs, -1)
    print('=' * 100, '\n', encoder_outputs) # batch_size, seq_len, 2 * hidden_size
    
    # 2 tuple(fw & bw), 2 tuple(c & h), batch_size, hidden_size
    print('=' * 100, '\n', bi_state)
    
    encoder_state = []
    for i in range(num_layers):
        encoder_state.append(bi_state[0][i])  # forward
        encoder_state.append(bi_state[1][i])  # backward
    encoder_state = tuple(encoder_state) # 2 tuple, 2 tuple(c & h), batch_size, hidden_size
    print('=' * 100)
    for i in range(len(encoder_state)):
        print(i, encoder_state[i])
复制代码

定义decoder部分，使用两层LSTM

with tf.variable_scope('decoder'):
    beam_width = 10
    memory = encoder_outputs
    
    if mode == 'infer':
        memory = tf.contrib.seq2seq.tile_batch(memory, beam_width)
        X_len = tf.contrib.seq2seq.tile_batch(X_len, beam_width)
        encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, beam_width)
        bs = batch_size * beam_width
    else:
        bs = batch_size
    
    attention = tf.contrib.seq2seq.LuongAttention(hidden_size, memory, X_len, scale=True) # multiplicative
    # attention = tf.contrib.seq2seq.BahdanauAttention(hidden_size, memory, X_len, normalize=True) # additive
    cell = multi_cells(num_layers * 2)
    cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention, hidden_size, name='attention')
    decoder_initial_state = cell.zero_state(bs, tf.float32).clone(cell_state=encoder_state)
    
    with tf.variable_scope('projected'):
        output_layer = tf.layers.Dense(len(word2id_en), use_bias=False, kernel_initializer=k_initializer)
    
    if mode == 'infer':
        start = tf.fill([batch_size], word2id_en['<s>'])
        decoder = tf.contrib.seq2seq.BeamSearchDecoder(cell, embeddings_Y, start, word2id_en['</s>'],
                                                       decoder_initial_state, beam_width, output_layer)
        outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder,
                                                                            output_time_major=True,
                                                                            maximum_iterations=2 * tf.reduce_max(X_len))
        sample_id = outputs.predicted_ids
    else:
        helper = tf.contrib.seq2seq.TrainingHelper(embedded_Y, [maxlen_en - 1 for b in range(batch_size)])
        decoder = tf.contrib.seq2seq.BasicDecoder(cell, helper, decoder_initial_state, output_layer)
        
        outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder, 
                                                                            output_time_major=True)
        logits = outputs.rnn_output
        logits = tf.transpose(logits, (1, 0, 2))
        print(logits)
复制代码

根据mode选择是否需要定义损失函数和优化器

if mode != 'infer':
    with tf.variable_scope('loss'):
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y_out, logits=logits)
        mask = tf.sequence_mask(Y_len, tf.shape(Y_out)[1], tf.float32)
        loss = tf.reduce_sum(loss * mask) / batch_size

if mode == 'train':
    learning_rate = tf.Variable(0.0, trainable=False)
    params = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, params), 5.0)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).apply_gradients(zip(grads, params))
复制代码

训练部分代码，经过20轮训练后，训练损失从200以上降到52.19，perplexity降到5.53

sess = tf.Session()
sess.run(tf.global_variables_initializer())

if mode == 'train':
    saver = tf.train.Saver()
    OUTPUT_DIR = 'model_diy'
    if not os.path.exists(OUTPUT_DIR):
        os.mkdir(OUTPUT_DIR)
        
    tf.summary.scalar('loss', loss)
    summary = tf.summary.merge_all()
    writer = tf.summary.FileWriter(OUTPUT_DIR)
        
    epochs = 20
    for e in range(epochs):
        total_loss = 0
        total_count = 0
        
        start_decay = int(epochs * 2 / 3)
        if e <= start_decay:
            lr = 1.0
        else:
            decay = 0.5 ** (int(4 * (e - start_decay) / (epochs - start_decay)))
            lr = 1.0 * decay
        sess.run(tf.assign(learning_rate, lr))
        
        train_ch, len_train_ch, train_en, len_train_en = shuffle(train_ch, len_train_ch, train_en, len_train_en)
        
        for i in tqdm(range(train_ch.shape[0] // batch_size)):
            X_batch = train_ch[i * batch_size: i * batch_size + batch_size]
            X_len_batch = len_train_ch[i * batch_size: i * batch_size + batch_size]
            Y_batch = train_en[i * batch_size: i * batch_size + batch_size]
            Y_len_batch = len_train_en[i * batch_size: i * batch_size + batch_size]
            Y_len_batch = [l - 1 for l in Y_len_batch]

            feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
            _, ls_ = sess.run([optimizer, loss], feed_dict=feed_dict)
            
            total_loss += ls_ * batch_size
            total_count += np.sum(Y_len_batch)

            if i > 0 and i % 100 == 0:
                writer.add_summary(sess.run(summary, 
                                            feed_dict=feed_dict), 
                                            e * train_ch.shape[0] // batch_size + i)
                writer.flush()
        
        print('Epoch %d lr %.3f perplexity %.2f' % (e, lr, np.exp(total_loss / total_count)))
        saver.save(sess, os.path.join(OUTPUT_DIR, 'nmt'))
复制代码

验证部分代码，验证集的perplexity为11.56

if mode == 'eval':
    saver = tf.train.Saver()
    OUTPUT_DIR = 'model_diy'
    saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR))
    
    total_loss = 0
    total_count = 0
    for i in tqdm(range(dev_ch.shape[0] // batch_size)):
        X_batch = dev_ch[i * batch_size: i * batch_size + batch_size]
        X_len_batch = len_dev_ch[i * batch_size: i * batch_size + batch_size]
        Y_batch = dev_en[i * batch_size: i * batch_size + batch_size]
        Y_len_batch = len_dev_en[i * batch_size: i * batch_size + batch_size]
        Y_len_batch = [l - 1 for l in Y_len_batch]
        
        feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
        ls_ = sess.run(loss, feed_dict=feed_dict)
        
        total_loss += ls_ * batch_size
        total_count += np.sum(Y_len_batch)

    print('Dev perplexity %.2f' % np.exp(total_loss / total_count))
复制代码

推断部分代码，测试集的bleu为0.2069，生成的英文翻译结果在output_test_diy中

if mode == 'infer':
    saver = tf.train.Saver()
    OUTPUT_DIR = 'model_diy'
    saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR))
    
    def translate(ids):
        words = [id2word_en[i] for i in ids]
        if words[0] == '<s>':
            words = words[1:]
        if '</s>' in words:
            words = words[:words.index('</s>')]
        return ' '.join(words)
    
    fw = open('output_test_diy', 'w')
    for i in tqdm(range(test_ch.shape[0] // batch_size)):
        X_batch = test_ch[i * batch_size: i * batch_size + batch_size]
        X_len_batch = len_test_ch[i * batch_size: i * batch_size + batch_size]
        Y_batch = test_en[i * batch_size: i * batch_size + batch_size]
        Y_len_batch = len_test_en[i * batch_size: i * batch_size + batch_size]
        Y_len_batch = [l - 1 for l in Y_len_batch]
        
        feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
        ids = sess.run(sample_id, feed_dict=feed_dict) # seq_len, batch_size, beam_width
        ids = np.transpose(ids, (1, 2, 0)) # batch_size, beam_width, seq_len
        ids = ids[:, 0, :] # batch_size, seq_len
        
        for j in range(ids.shape[0]):
            sentence = translate(ids[j])
            fw.write(sentence + '\n')
    fw.close()
    
    from nmt.utils.evaluation_utils import evaluate
    
    for metric in ['bleu', 'rouge']:
        score = evaluate('data/test.en', 'output_test_diy', metric)
        print(metric, score / 100)
复制代码

造好的轮子

以下项目提供了非常完整的接口， github.com/tensorflow/… ，通过简单的配置即可定制不同的模型，支持70多个配置项，举几个例子

--num_units
--unit_type
--num_layers
--encoder_type
--residual
--attention

如果觉得配置项太繁琐，以上项目也提供好了4个配置项模板， iwslt15.json 适用于小数据集（IWSLT English-Vietnamese，13W），其他三个模版适用于大数据集（WMT German-English，4.5M）

使用以上项目训练中译英模型，只需要运行以下命令，如果是训练英译中模型，修改src和tgt的值即可

python -m nmt.nmt --src=ch --tgt=en --vocab_prefix=data/vocab --train_prefix=data/train --dev_prefix=data/dev --test_prefix=data/test --out_dir=model_nmt --hparams_path=nmt/standard_hparams/iwslt15.json
复制代码

训练结果包括以下内容

最后五次保存下来的模型
train_log中包括可供tensorboard查看的events文件
output_dev和output_test分别对应验证集和测试集的翻译结果
best_bleu中包括在验证集上bleu score最高的五个版本模型

模型在验证集上的bleu为0.233，在测试集上的bleu为0.224

使用以下命令进行推断，把需要翻译的文本写入对应文件即可，生成的英文翻译结果在output_test_nmt中

python -m nmt.nmt --out_dir=model_nmt --inference_input_file=test.ch --inference_output_file=output_test_nmt
复制代码

对联生成

使用以下数据集， github.com/wb14123/cou… ，包括70W条对联数据

使用以下命令训练模型，将 iwslt15.json 复制一份为 couplet.json ，因为数据量更多，所以适当增加训练次数，即修改 num_train_steps 为100000

没有验证集也没有关系，用测试集替代即可，因为必填参数若不填将会报错

python -m nmt.nmt --src=in --tgt=out --vocab_prefix=couplet/vocab --train_prefix=couplet/train --dev_prefix=couplet/test --test_prefix=couplet/test --out_dir=model_couplet --hparams_path=nmt/standard_hparams/couplet.json
复制代码

output_test中的一些结果示例，每三句依次为上联、下联、生成的下联，字数、词性和词意基本都对上了

腾 飞 上 铁 ， 锐 意 改 革 谋 发 展 ， 勇 当 千 里 马
和 谐 南 供 ， 安 全 送 电 保 畅 通 ， 争 做 领 头 羊
改 革 开 放 ， 科 学 发 展 促 繁 荣 ， 争 做 领 头 羊

风 弦 未 拨 心 先 乱
夜 幕 已 沉 梦 更 闲
雪 韵 初 融 意 更 浓

彩 屏 如 画 ， 望 秀 美 崤 函 ， 花 团 锦 簇
短 信 报 春 ， 喜 和 谐 社 会 ， 物 阜 民 康
妙 笔 生 花 ， 书 辉 煌 史 册 ， 虎 啸 龙 吟
复制代码

如果需要根据没有见过的上联生成下联即进行推断，则使用之前介绍过的方法即可

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

How to Solve It

Zbigniew Michalewicz、David B. Fogel / Springer / 2004-03-01 / USD 59.95

This book is the only source that provides comprehensive, current, and detailed information on problem solving using modern heuristics. It covers classic methods of optimization, including dynamic pro......一起来看看《How to Solve It》这本书的介绍吧!

码农工具

深度有趣 | 26 Seq2Seq机器翻译

原理

数据

实现

造好的轮子

对联生成

How to Solve It

Base64 编码/解码

URL 编码/解码

XML、JSON 在线转换