大数据分析本身的工业化

数据观 • 7年前扫码分享

我是创始人李岩：很抱歉！给自己产品做个广告，点击进来看看。

顾名思义，工业化意味着自动化，能够实现事半功倍的效果。以前，农民用牛犁一块地需要花费几天时间，但现在用拖拉机只需要几个小时。同样，现在企业可以也用先进的算法“耕耘”大片的“数据田地”。把见解作为可交付产品的工厂也许是对此更恰当的比喻。例如，设想有一条流水线，使你可以进行数据的收集、整理、分类，准备好供建模、分析和产生见解所用。这就是我们正在迈进的方向吗？是的。这是必要的吗？没错。

原因在于，为了更好地利用大数据的体量、速度和多样性，让 大数据 为自己服务，企业需要流程、结构和透明度，而工业化提供了这三样东西。如果你真的想从数据中提取价值，并使你的公司像一台润滑效果良好的机器那样顺畅运转，你必须具备规模化的能力，但规模化的能力是大数据最大的难题之一。工业化是解决之道。工业化的基本定义就是堪称革命性的规模化能力，而规模化几乎总是意味着使向来手动完成的工作自动化。流水线就是明显的例子。

流水线方法的基础是建立一套支持 数据分析 的流程。这是一种协作的方法，需要跨职能合作和C级高管努力推动公司上下参与其中。但从数据中获取见解的流程如何实现自动化？

让我们来看看制造业的工业化，这是流程的最初起源。多年来，生产经理强调质量控制和流程改进。如果想使数据分析工业化，就需要对数据分析及受其驱动的经营活动采取同样的质量控制措施。你制定的任何解决方案都应该考虑以下几点：

· 数据管理：这里涉及的考虑是，数据科学家在创建分析数据集时，应该确保数据一脉相承，提供适当的治理，避免陷入不可识别资产的数据沼泽。应同样对待的还有文档、记录、代码、数据样本、修改日志，以及确保资产整理妥当，可随时用于消费。

· 开发：这里指的是将跟可视化和数据浏览界面一起整合进同一工作台的建模工具。再有就是知识管理，要通过这种方法来存储你正在创建的模型的信息。

· 部署：这部分涉及到生产模型的创建，而这些模型将在以后用在经营活动中。对此需要模型管理，比如维护版本历史信息，训练数据集以供审核，以及推广模型的相关流程。还应该着重强调效率和受控执行。数据平台为分析处理的工作提供了很多选择，但必须保证模型被部署到另一个平台上时，业务逻辑依然如昔。

· 维护：操作系统堪称流程的“书立”。你最初从应用系统获得数据，你的分析则是最终交付产品，将被应用和操作流程所使用。由于这些流程所固有的操作依赖性，因此应该实行严格的路径规定，包括为所有的活动创建操作日志，以及在发生模型偏移时记录异常情况。

随着数据和分析工具的激增，企业将继续寻求庞大数据集的力量，因为有数据就有见解，有见解就有价值。但想要做到这一点，就必须把工业化的准则融入到数据分析中。

只要那些流程的设计和实施做到了着眼全局而非各自为政，当分析得到了带动和长期持续下去的保证时，所谓的“工业化”便已成形。而这就是所谓的分析运维（Analytics Ops），在数据科学领域又被称作为开发运维（Dev Ops）。凭借数据分析的工业化改造，只要处理速度达到了一定水平，企业就能降低成本，加快创新，为市场带来新的能力。

原文：

Industrialization, by definition, implies automation. It lets you do more with less. Just as a farmer can now plow a field with a tractor in a couple of hours instead of days with a horse, organizations can potentially plow through vast fields of data with advanced algorithms. Perhaps a better analogy is the factory—a manufacturing plant where the deliverables are insights. Imagine, for example, an assembly line that allows you to collect data, sort it, classify it, and prepare it for modeling, analysis, and insight generation. Is that where we are headed? Yes. And is it necessary? Also yes.

Here’s why. What organizations need in order to expand their access to big data’s volume, velocity, and variety—and make it work to their advantage—are three things that industrialization has baked into it: process, structure, and transparency. If you really want to get value from your data and run your organization like a well-oiled machine, you have to be able to scale. However, the ability to scale is one of the biggest conundrums of big data. The answer is industrialization. Industrialization is defined by its transformative ability to scale, and scaling almost always means automating what has traditionally been done by hand. Think assembly line.

An assembly line approach is based on defining a set of processes that support analytics. It’s a collaborative approach that requires cross-functional alignment and a commitment from the C-suite to drive participation. But how do you automate the process of gleaning insights from data?

Let’s look at how industrialization happens in manufacturing, where the processes were originally developed. Manufacturing managers have insisted on quality controls and process refinement for years. If our industry is going to industrialize analytics, we need to apply the same types of quality control measures for the analytics and the operations that they power. Any solution you build should take into account the following:

• Data management: This involves the creation of analytic data sets by data scientists in a manner that captures lineage, provides appropriate governance, and avoids the dreaded data swamp of unrecognizable assets. It also includes documentation, notes, code, data samples, and a change log as well as checks and balances to ensure that the assets are ready for consumption.

• Development: This refers to modeling tools that are integrated into a single workbench with visualization and interfaces designed for data exploration. It also includes knowledge management to store information about the models you are building.

• Deployment: This is where the production model is created that will later be used for operations. It requires model management, such as maintaining version history and training data sets for auditing, and model promotion processes. An emphasis should be placed on efficiency and controlled execution. Data platforms offer many options for analytical processing, but this approach must promise that business logic stays intact if the model is deployed on another platform.

• Maintenance: Operational systems are the bookends of the process. You source data at the beginning from your operational system and your analysis is the end deliverable that is consumed by the application or operational process. Strict rules of the road should be in place due to the operational dependencies inherent in these processes, including operational logging of all scoring activities, and a process to log irregularities when model drift occurs.

As the availability of data continues to explode and the tools to analyze it proliferate, companies will continue to seek the power that big data sets promise because where there is data, there are insights, and where there are insights, there is value. But in order to get there, we need to embed the principles of industrialization into the process.

When these processes are designed and implemented on the whole—not in piecemeal format—“industrialization” will start to occur where analytics are driven and sustained over time. This is Analytics Ops, or in other words, Dev Ops for data science. With industrialization of analytics, once a certain velocity is achieved, companies can ultimately lower costs, speed innovation, and bring new capabilities to market.

Article image: Abstract in industry. (source: Roland Keates on Flickr).

原文:Theindustrialization of analytics

来源:https://www.oreilly.com/ideas/the-industrialization-of-analytics?imm_mid=0e72a8&cmp=em-data-na-na-newsltr

编译:车品觉

点击此处进入「车品觉」在数据观的专栏>>>