ECiDA: Evolutionary Changes in Data Analysis

F.J. Blaauw, R. Overbeek, T. Albers, J. Vlek, M. Maessen, J. Gooijer, E. Lazovik, F. Arbab, A. Lazovik.

Mar 20, 2019

Abstract

Modern data analysis platforms all too often rely on the fact that the application and underlying data flow are static. That is, such platforms generally do not implement the capabilities to update individual components of running pipelines without restarting the pipeline, and they rely on data sources to remain unchanged while they are being used. However, in reality these assumptions do not hold: data scientists come up with new methods to analyze data all the time, and data sources are almost by definition dynamic. Companies performing data science analyses either need to accept the fact that their pipeline goes down during an update, or they should run a duplicate setup of their often costly infrastructure that continues the pipeline operations.

In this research we present the Evolutionary Changes in Data Analysis (ECiDA) platform, with which we show how evolution and data science can go hand in hand. ECiDA aims to bridge the gap that is present between engineers that build large scale computation platforms on the one hand, and data scientists that perform analyses on large quantities of data on the other, while making change a first-class citizen. ECiDA allows data scientists to build their data science pipelines on scalable infrastructures, and make changes to them while they remain up and running. Such changes can range from parameter changes in individual pipeline components to general changes in network topology. Changes may also be initiated by an ECiDA pipeline itself as part of a diagnostic response: for instance, it may dynamically replace a data source that has become unavailable with one that is available. To make sure the platform remains in a consistent state while performing these updates, ECiDA uses a set of automatic formal verification methods, such as constraint programming and AI planning, to transparently check the validity of updates and prevent undesired behavior.

In earlier work, we showed that an initial implementation of ECiDA on top of the Apache Spark ecosystem performed well and introduced an acceptable amount of overhead to the data pipeline [@Lazovik2016; @Albers2018]. The platform is built in collaboration with a large, Dutch water company and is developed with their use cases in mind. ECiDA will, for example, be used to (i) improve water distribution monitoring and automation, (ii) enable the prediction of water quality, and (iii) determine structural reliability of pipes in order to perform predictive maintenance. These use cases emphasize different aspects and a variety of issues that might arise in a practical setting, and ensure ECiDA is built as a generic data science solution, which should therefore be applicable to any data science project.