Apache Spark在大规模分布式自然语言处理的应用

36大数据 • 9年前扫码分享

我是创始人李岩：很抱歉！给自己产品做个广告，点击进来看看。

TripAdvisor基于自然语言构建回归模型预测用户对每个问题回答“是”或“不是”的概率。不仅用带有标签的地点评价数据训练模型，还使用了大量未标记的数据。基于Spark技术，处理所有这些数据的过程显得简洁易懂。

大数据
我们TripAdvisor公司拥有大量的用户评价数据，据最近的一次公告，大约有几亿条。我是从事机器学习相关的工作，在机器学习中我们常喜欢做的一件事就是堆砌大量数据来分析。

最近我一直在研究一个有趣的问题，我想给大家介绍一下。在这篇博文里，我先会引入问题，以及解决它的技术支持手段。在后续的博文里，我将深入剖析算法本身。如果你最近浏览过Tripadvisor网站，也许会注意到我们给站点内的宾馆、餐厅和景点都贴上了不同的元数据标记（我们称之为标签）。其中一些是我们从各种数据源搜集的简单是非问答结果。如“这家宾馆是否提供游泳池？”，“这是一家意大利餐馆吗？”等等。

如果有可靠的数据源，这些信息对我们很有帮助，可是如果没有呢？也许在世界的某个角落你始终无法获得可靠的数据源（毕竟我们的业务遍布全球）。有时候又会遇到主观性问题，像“这是个浪漫的宾馆吗？”所有的宾馆经理可能都给出一个肯定的答案。顾客却不一定这么认为。那么该如何对付这些问题呢？其实，当你拥有像我们这么优质的客户基础时，你直接问他们就够了。这段时期，我们一直让用户在“发表评价”的表格最后或者其它地方再回答一些简单的是非题。平均每个评价里，用户能回答三个问题。这些答案对我们来说是非常有用的，同时也能惠及站内的其它用户。

最近，我正在努力挖掘这些答案的最大潜在价值，这个项目被称之为“自动贴标签”。任何时候如果对用户的答案没有把握，你可以直接去询问用户那些问题，但这会浪费他们时间，结果的覆盖率也有局限性。为了避免上述情况，我们基于自然语言构建了回归模型，来预测用户对每个问题回答“是”或者“不是”的概率。这样我们只有在给出所有数据仍不能确定用户答案的时候才去询问他们。

具体说来，就是当用户在填写评价表时，我们尝试在见到用户填写的内容前，去预测该用户对给定的贴标签问题回答“是”的概率值。注意，这与看到评价表的内容再做预测是有区别的。因为我们不是想预测当前这位用户是否度过了一个浪漫之夜，亦或宾馆是否给他们带来了家的温馨感觉。我们是想知道下一位客户是否在这家宾馆能有上述那些体验。我发现当人们经历一段非常浪漫的时光后，他们就会基于自己的体验给宾馆一个浪漫的标签，而不在意宾馆的其它方面品质。通过预测某个特定地区的赞成投票比例，我们能平滑这种噪音的影响，从而得到对某个宾馆、饭店、景点的浪漫程度、温馨程度等方面的客观预测。

我们采用半监督形式的逻辑回归来处理这个问题。模型主要包含“词袋”类型的特征，来自用户提交的对于宾馆各方面的文字评价。由于它属于一种半监督方法，所以我们不仅用带有标签的地点评价数据训练模型，还使用了大量未标记的数据。另外，在实际应用模型预测最终结果时，我们还需要读取和处理所有评价记录。在这之上，我们得到了上百个不同的标签。

大家思考一会儿。上百万条的评价记录乘以上百个标签。算法本身就很炫酷，我也将在另一篇博文里详细介绍。今天，我想先介绍算法依赖的技术方法。我们使用Spark技术来实现这个算法。Spark是一款卓越的数据分布式计算引擎，它能把数据分散到集群的所有节点进行计算。它和Map/Reduce有两个重要的区别：

Spark程序代码更容易阅读和理解，因为一切都是逐步展开，没有太多的模板规则。比如，对比Spark和Map/Reduce对Word Count（大数据领域的“Hello World”）的实现过程。
Spark的操作都在内存中完成，只在需要的时候把数据写出到磁盘。

基于Spark技术，处理所有这些数据的过程就显得简洁易懂。我们仅需把所有文字评价读入分散在集群各个节点的内存中，然后迭代地每次处理一个标签。原来最耗时的反复读文件和转换数据格式步骤，现在只需要在开头处理一次就够了。

整个数据处理过程被分为三个阶段。

数据集生成 ：读入对所有地点（饭店、宾馆、景点）有投票的所有评价，然后给每个标签生成一个数据集。这一步包括特征选择、生成特征向量和准备交叉验证数据集。（后两步对解决本问题至关重要。）
训练模型 ：对每个标签，调整规则化参数并训练模型。这个原本“尴尬的并行”阶段被Spark的并行计算操作完美地解决了。我只需要把数据集广播到各个节点，并且并行我想调整的参数。每个任务仅返回我用来选择模型的误差估计值。
应用： 把所有评价读入内存。没错，是所有的。迭代地给每个地点的每个标签打分，把结果存储到Hadoop文件系统。再用“加载数据”把他们导入Hive，这一步本质上就是一个HDFS的简单重命名操作。

所有步骤都极度高效地完成。Spark让我方便地控制哪些内容需要保留在内存中，哪些不再有用的需要涮出。我还能选择数据在节点的分区方式。我确保数据基于地点的ID分区，使得reduction和grouping步骤的节点间数据交换最少。除此之外，我可以使用真正的、易读的Java语言。不必再拘泥于使用某些类SQL语言，或者Map/Reduce要求的大量模板规则和易混淆的代码。

我对Spark提供集群操控功能真的十分满意，强烈推荐大家也用一下。

英语原文：

Here at TripAdvisor we have a lot of reviews, several hundred million according to the last announcement. I work with machine learning, and one thing we love in machine learning is putting lots of data to use.

I’ve been working on an interesting problem lately and I’d like to tell you about it. In this post, I’ll set up the problem and the underlying technology that makes it possible. I’ll get into the algorithm itself in follow-up posts.

If you’ve been to Tripadvisor recently, you may notice that we have various pieces of meta data that we attach to hotels, restaurants, and attractions that appear on the site. Some of these (we call them tags) are simple yes or no questions that we get from various data sources. Does a hotel have a pool? Is this an Italian restaurant? Etc.

This information is great when you can get a reliable source, but what happens when you can’t? Maybe there’s a particular part of the world where you don’t have a good data source (we deal with all parts of the world, after all). Maybe there’s a very subjective question, like “is this a romantic hotel?” Any hotel manager would probably say yes to that question. A visitor may disagree.

So how do you get around these problems? It turns out that when you have as awesome a user base as us, you can just ask them. For a while now, we’ve been asking our users these simple yes or no questions at the end of our “write a review” form and other places. On average, they answer about 3 questions per review . That is incredibly useful to us, and by extension to other visitors on the site.

For the last little bit, I’ve been working on making the best possible use of these answers on a project we call “Auto Tagging”. You can simply ask a reviewer one of these tagging questions whenever you are not sure of an answer, but that’d waste their time and limit the coverage of the results. Instead, we build regression models based on natural language to predict the probability a user will answer “yes” or “no” to each question. That way we only have to ask if we are not sure of the answer given all the data at our disposal.

Specifically, we try to predict the probability that a user, when filling out a review form that we haven’t seen yet, would answer “yes” to the given tagging question. Note that is different from trying to predict a “yes” on a review form given the text on the form. That’s because we are not trying to predict whether the current reviewer had a romantic stay or if the hotel was very family friendly to them. We are trying to predict whether the next visitor to the hotel will have that experience. I’ve found that when people have a very romantic stay, they may vote the hotel romantic based just on their experience, and not necessarily any properties on the hotel. By predicting the expected proportion of future yes votes at a particular location, we can average out this noise and get an honest estimate of just how romantic, family-friendly, or whatever a particular hotel, restaurant, or attraction is.

We solve this problem using a semi-supervised form of logistic regression. A large portion of the model consists of “bag of words” type features from user submitted reviews on the properties. Since it is a semi-supervised technique, not only do we use the reviews on locations that we have tag votes on during training, we also use a large chunk of unlabeled data. Also, when applying the model to get the end results, we need to read and process all our reviews. On top of that, we have hundreds of different tags.

Think about that for a second. Millions and millions of reviews times hundreds of tags. The algorithm itself is pretty cool and I’ll talk about it in another blog post. Today, I’d like to talk about the technology that makes a solution possible.

We use Spark to power this algorithm. Spark is an excellent data parallel engine that allows you to spread your data among all the nodes in your cluster. It’s different than Map / Reduce in two important ways:

It’s a lot easier to read and understand a Spark program because everything is laid out step by step without a lot of boilerplate. For example, check out the difference in implementing a word count (the “hello world of big data”) in Spark and Map / Reduce.
It allows you to operate in memory, spilling to disk only when needed.

Given Spark as a base, it’s a pretty straightforward process working with all this data. We just read the reviews into memory spread across a bunch of the nodes in the cluster, and iteratively work one tag at a time. The large chunk of the time it takes to read and process the reviews into a useable format is just done once at the beginning of the process.

The entire process is broken down into three stages.

Dataset generation: We read all the reviews for all locations (restaurants / hotels / attractions) that have votes on them and generate a data set for each tag. This consists of feature selection, creating the feature vectors, and choosing the cross-validation sets. (The last 2 steps are actually nontrivial for this problem.)
Training: For each tag, profile the regularization parameter and train the model. This is an “embarrassingly parallel” stage that’s done pretty straightforwardly with Sparks “parallelize” operation. I just broadcast the data set to each node and parallelize the parameters I want to profile. Each task simply returns the out of sample error estimate that I use to select the model.
Application: Read all the reviews into memory. Yes, all of them . Iteratively score each tag on each location, dumping the results to Hadoop’s file system. Import these into Hive with “load data”, which is basically a quick HDFS rename operation.

This is all done remarkably efficiently. Spark gives me control over what to keep in memory and what to flush when it is no longer needed. I can also choose the way data is partitioned among the nodes. I make sure to partition by location ID so that reductions and grouping require very little inter-node communication.

On top of that, I’m able to use real, readable, Java code. I don’t have to restrict myself to working in some dialect of SQL, or deal with large amount of boilerplate and confusing code required with Map / Reduce.

I’m really happy with the control Spark gives me over our cluster and highly recommend you check it out.

译者/赵屹华，计算广告工程师@搜狗，前生物医学工程师，关注推荐算法、机器学习领域。via：csdn

End.