Two weeks ago I was in Halifax, Nova Scotia attending the 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017). KDD is one of the largest and most respectable conferences in the data science community. It offers a wealth of knowledge on the latest research in data science, data mining, big data, and predictive analytics. Researchers and professionals come together to learn and discuss novel ideas and technologies to solve challenging problems. My areas of interest lied in time series analysis, IoT streaming, and big data. So here are a few of the interesting things I learned in those areas.
Time Series Analysis
One of my favorite workshops was titled, "Mining and Learning from Time Series." In this workshop, the challenges of modern time series to traditional time series analysis techniques were explained, including time series that are multivariate, high-dimensional, heterogeneous, spatiotemporal, or have sparse or irregular sampling. After the challenges were addressed the next generation of temporal mining algorithms was discussed.
One of these next generation temporal mining algorithms I found of particular interest was matrix profiling. A Matrix Profile is a data structure that annotates a time series and can be used for a wide variety of problems including motif discovery, anomaly detection, rule discovery, segmentation, and more. The key claim made during this presentation was that given the Matrix Profile, most time series data mining problems are trivial or easy.
There is a lot to learn about Matrix Profiles that I won’t go into here, but at a very high level the Matrix Profile records the distance of the subsequence in the time series, at the ith location, to its nearest neighbor under z-normalized Euclidean Distance. Below is a visualization of a simple synthesized time series (in red) and its Matrix Profile depicted as a companion time series (in blue).
Example Time Series and Corresponding Matrix Profile (Abdullah & Keogh, 2017)
In this example, both the pattern in the time series and the equivalent pattern in the Matrix Profile are easy to recognize, but in more complex time series the Matrix Profile can identify patterns and sequences that are not identifiable through visualization alone. If you are interested in more about Matrix Profiles, I highly recommend this tutorial PowerPoint.
Related to time series analysis, I also attended a Machine Learning for Survival Analysis workshop. Survival analysis, also known as time-to-event analysis, is an analysis technique that takes a series of observations and attempts to estimate when a particular event of interest will occur in the future. While survival analysis originated in the medical field to determine how long a patient would survive, there are many other practical applications of survival analysis in other fields such as marketing, engineering, education, and financing.
Survival analysis at its core is made of five components:
- Birth event - when the event started
- Death event - when the event ends
- Time scale - seconds, minutes, weeks, days, years, etc.
- Right censored - “death event” is not observed
- Left censored - “birth event” is not observed
- Survival function - function to compute the probability of an event occurring after some specified time
There are quite a few different methods for survival analysis, including statistical methods as well as machine learning methods. Some examples of statistical methods for survival analysis include Kaplan-Meier, Cox regression, and linear regression. Examples of machine learning methods for survival analysis include survival trees, Bayesian methods, neural networks, and support vector machines. Each of these methods have their own specific use cases, however one of the most commonly used model for survival analysis is a form of Cox regression known as the Cox Proportional Hazards model. The Cox Proportional Hazards model also includes a hazard function in addition to the survival function that shows the event rate at time t conditional on survival until time t or later.
Massive Online Analytics
Moving on from time series and into big data / streaming, I learned about a new tool for data stream mining called MOA (Massive Online Analytics). MOA is related to the WEKA project and performs big data stream mining in real time as well as large scale machine learning on streams. It offers a wide assortment of machine learning algorithms such as classification, regression, clustering, and outlier detection. There is also a distributed stream mining alternative related to MOA called Apache SAMOA that can be easily used with distributed stream processing engines such as Apache Flink or Apache Storm.
The KDD 2017 Conference was a great experience and provided a wealth of knowledge in so many areas of data science. It offered cutting-edge practical solutions to problems commonly faced by companies and individuals seeking to integrate data science and analytics into their processes. What are your thoughts on some of the topics discussed?