Streamlining Language Technology from Idea to Deployment

栏目: IT技术 · 发布时间: 5年前

内容简介：The birth of Natural Language Processing (NLP) is often accredited to Alan Turing’s famous test: can a human distinguish a human conversation from a computer conversation? Seven decades later, the NLP field continues to see vast progress. Attempts to analy

“I propose to consider the question, can machines think?”

-Alan Turing, 1950

Introduction

The birth of Natural Language Processing (NLP) is often accredited to Alan Turing’s famous test: can a human distinguish a human conversation from a computer conversation? Seven decades later, the NLP field continues to see vast progress. Attempts to analyze, structure and generate human languages have been tackled by rule-based, statistical and, most recently, neural-net based approaches.

eBay is unique with its unmatched inventory in many languages and categories, as well as user-generated input and queries. For a global ecommerce company, NLP continues to be at the core of our business such as:

Language representation models that improve search relevance and ranking
Machine translation that connects buyers and sellers, even if they speak different languages
Named entity recognition (NER) that extracts product aspects and brand names in a query
Spelling correction that guides our customers to their intended items

Within both eBay and NLP research communities, these tasks have been approached in several different programming languages and toolkits. Of note, Python-based toolkits having found most traction in the last couple of years.

PyNLP Overview

To streamline and facilitate NLP usage at eBay, we built an internal framework called PyNLP. It grafts into the AI platform Krylov (see Figure 1).

Streamlining Language Technology from Idea to Deployment Figure 1: Simple layered representation of PyNLP and the “Krylov” AI platform components they interact with.

Several existing pain points led to these efforts, which we group into Exploration, Customization and Deployment.

Exploration

Out-of-the-box NLP approaches almost never work when applied to ecommerce data. When looking at inventory items such as, Hello Kitty Cheer Leader Plush Doll TY 6” !!! PINK & WHITE, it seems clear that a standard parts-of-speech analyzer optimized on belles-lettres will not detect the brand “Hello Kitty.” Instead, it may assume a greeting interjection and struggle to detect that 6” is the product size. PyNLP combines models trained on e-commerce aware data and implementations that make best use of eBay’s infrastructure and rich data to handle above input.

Another exploration hurdle is discovering and utilizing NLP models trained by other teams across eBay. PyNLP aims to serve as the single source of truth for NLP models. Onboarding new models and accessing existing models are streamlined via a pythonic interface which are designed to work smoothly on top of the existing eBay AI infrastructure. Standardized interfaces allow for easy and intuitive access for exploration as well as for batch processing. This ensures that our users only need to think through more details when they actively want to switch the implementations themselves.

Customization

The sheer number of publicly available NLP solutions is already overwhelming. Each toolkit typically comes with its own software environment requirements and often, also includes sparse documentation. While this is true for many software products out there, it’s especially true in machine learning. This is because open-source software often stems from a research perspective and the code can accrue debt.

If a user wants to train on a downstream task, it is easy to pass the training data through all available models and directly compare the performance to find the best matching solution. For the models itself, PyNLP shows how to customize many of them via QuickStart training and fine-tuning recipes on Jupyter notebooks. The requirements for each model are encapsulated and maintained in docker build files.

Deployment

New language modelling paradigms, such as neural embeddings (e.g., BERT ^{^[1]} ), can take several days to retrain with suitable data on a significantly double-digit GPU rack so they should be handled with prudence. It is crucial that these efforts, already during development state, can be shared across all teams rapidly. PyNLP makes onboarding easy for researchers to explore and for model developers to share by providing templates and libraries which help connect to PyNLP’s core. With a vast collection of best practices and helper tools, we eliminate roadblocks from a PyNLP microservice prototype towards a production-ready service in a Kubernetes environment.

Below, we explore how we tackle these challenges by laying out three use cases: exploration of existing solutions, customizing specific models, and how to converge prototypes into deployed services.

Use Case: Exploration

This use case assumes that a tech-savvy applied researcher wants to explore existing solutions for a specific task, -here- a Named Entity Recognition (NER) service.