In my previous blog post I discussed several key takeaways from the 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017), including time series analysis, survival analysis, and massive online analytics. In this post, I will take a deeper dive into survival analysis. I’ll go into detail on what survival analysis is, how it originated, the components involved, different methods that can be utilized, real-world examples, and open source libraries available.
What is Survival Analysis?
Survival analysis, also known as time to event analysis, is an analysis technique that takes a series of observations and attempts to estimate when a particular event of interest will occur in the future. Survival analysis originated in the medical field to determine how long a patient would survive. However, there are many other practical applications of survival analysis in other fields such as marketing, engineering, education, and financing.
Survival analysis at its core is made of six components:
- Birth event
- Death event
- Time scale
- Survival function
Birth/Death Event: For every event analyzed, there is a birth event and a death event. The birth event is the start of the event, while the death event is the end.
Time Scale: The time scale is simply the rate at which time passes for the events. This could be seconds, minutes, weeks, days, months, years, etc.
Censorship: In survival analysis, events can be censored or uncensored. Uncensored events are ones where both the birth and death event are observed. Censored events are therefore ones where either the birth or death event are not observed. Left censored events occur when the birth event is not observed and right censored events occur when the death event is not observed. The image below provides a helpful depiction of left and right censored events.
Illustration of Survival Data (Michal Pesta)
Survival Function: The survival function is a function to compute the probability of an event occurring after some specified time. It relates time to the probability of surviving beyond a given time point. There are many different estimators that can be utilized to compute this probability. Two popular survival function estimators are the Kaplan Meier estimator and the Nelson Aalen estimator. I’m not going to go into detail on these estimators here, but if you’re interested in learning more about them this UCSD lecture is a good resource.
You may wonder why complex estimators are required for a survival function. If we observed the survival time (both birth and death event) of all subjects, why not simply estimate the probability of an event occurring by the ratio of patients surviving beyond time t and the total number of patients? Censorship is why. In the presence of censoring, this simple estimator cannot be used because the numerator is not always defined.
There are quite a few different methods for survival analysis, including statistical methods as well as machine learning methods. Some examples of statistical methods for survival analysis include Kaplan-Meier, Cox regression, and linear regression. Examples of machine learning methods for survival analysis include survival trees, Bayesian methods, neural networks, and support vector machines. Each of these methods have their own specific use cases, however one of the most commonly used model for survival analysis is a form of Cox regression known as the Cox Proportional Hazards model. The Cox Proportional Hazards model also includes a hazard function in addition to the survival function that shows the event rate at time t conditional on survival until time t or later.
Since survival analysis originated in the clinical field, it is only fitting to provide a real-world example of survival analysis applied to this field. Let’s look at a clinical research study on cancer where there are two treatment options, a placebo and a developmental drug. Data from a study like this should include patient information, when the patient entered the study, the treatment type, and the survival time of the patient. Using this data, survival analysis could potentially predict the time to death. Survival analysis of this study may also indicate a difference in survival times of the treatment subgroups.
On a less grim note, survival analysis can also be applied to predict the survival time of industrial machinery. When survival analysis is applied to the engineering field, it is often called reliability analysis instead. The reliability analysis example will be based on my graduate capstone project to predict the remaining useful life of a turbofan engine used in large aircrafts. Data for this example was provided by NASA and includes hundreds of sensor readings throughout a variety of simulated normal and failure runs of the turbofan engine. The goal is to use reliability analysis to accurately predict how much longer an engine is guaranteed to function properly and as a secondary goal to know when it needs to undergo maintenance.
Open Source Projects
Have a problem that Survival Analysis could solve? Let us know and get the conversation started!