Can we train AI to predict who survives the Titanic disaster? Kaggle, an online data science community, seems to think so. Many have taken up the challenge, but very few (less than 5%) manage to achieve accuracies higher than 80%. Can we do better?
Spoiler alert: Yes we can, and I’m going to use this opportunity to show you that expert analysis and deep understanding of your data at a fundamental level is absolutely crucial to building successful ML models.
To summarise the challenge, we are given the morbid task of predicting how likely a person was to survive the sinking of the Titanic, given a set of passenger data:
Ticket class (1st – 3rd)
Harper, Mr. Henry Sleeper
Number of siblings or spouses aboard
|Number of parents or children aboard|
Embarkation point; Southampton, Cherbourg, or Queenstown
Models are created using a dataset that contains the “Survived?” column. The models are then tested by uploading them to Kaggle, which gives us a final accuracy score on an ‘unseen’ data set.
Kaggle provides a benchmark submission for us to start with, where they have simply predicted that all the males in the dataset die, and all the females survive. This ‘dumb’ model gives us a baseline accuracy to aim for of 76.6%. Remember that it was “women and children first” to the lifeboats (and floating doors).
Now it is our turn to try something a bit more intelligent.
“THROW ALL DATA INTO MACHINE LEARNING MODEL” APPROACH
Let’s try creating a model in the wrong way, and blindly throw our data at an ML model without any considerations. We won’t do a review of the data and we’ll only do some basic data engineering to deal with missing values, feed the data into a model, cross our fingers, and hope for the best. Think of this as the “2 hours before the deadline and it’s 10pm” model.
We train a simple ML model and predict the survival outcomes on the unseen dataset. This achieves a final accuracy of…
A significant improvement on… hold up… that’s much worse than the 76.6% achieved by the dumb model we started with. I think this illustrates that blindly throwing data into a model does not yield good results.
Let’s be smarter about this… We need some ‘expert insight’ to understand the context and meaning behind the data.
“DEEP UNDERSTANDING OF DATA” APPROACH
We should think about each of the input tags and do some feature engineering. Often, the best first step when creating an ML model is to have a play with the data – plot it out in graphs, have a look at some samples of raw data, use some basic statistical analysis.
Ticket class: This is a proxy for social class. Higher class people were on the upper decks, rather than shovelling coal, and were more likely to survive.
Name: This can contain useful information like title and surname, but we need to extract that information first.
Title: Titles tell us if a person is considered an adult or a child. This is especially useful because age is missing for 20% of the data. Non-standard titles are converted to Mr/Mrs/Master/Miss by using the age and sex tags.
Surname: Often, if someone survives, their entire family will have survived. We can link these people by extracting surnames. Using this feature will create a horrendously overfit model (that is, only suited to this specific data), but are we expecting this model to generalise to the sinking of other early 20th century passenger ships? I sure hope not.
Sex: Made obsolete with title.
Age: We could try using a separate ML model to ‘fill in’ the missing ages, but it’s often difficult enough to tune one model, let alone two, and we want to be practical. We will ignore this data in lieu of title.
Number of siblings / spouses: Useful to know the size of the group the passenger was with.
Number of parents / children: Another proxy for group size.
Ticket number: The usefulness of this tag is hidden behind the assumption that ticket number is unique to each person. Not so! Ticket numbers are shared by groups who (likely) purchased them at the same time. This can be seen in the data for the Allison family and their companion Miss Cleaver (their nanny?):
|Siblings / Spouses||Parents / Children|
Allison, Mr. Hudson Joshua Creighton
Allison, Miss. Helen Loraine
Allison, Master. Hudson Trevor
Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
|Cleaver, Miss. Alice||22||0||0|
Ticket group size: We can add a ‘group size’ based on the amount of same-ticket numbers. This ensures that passengers like Miss Cleaver are not considered ‘lone’ passengers (who are more likely to die) just because their number of siblings/spouses/parents/children are zero.
Fare: Fare seems like an objective measure of social class. However, it has a hidden secret designed to trip up data scientists. Let’s examine a large family travelling in third class; the Anderssons. They all have the same ticket number and fare:
Andersson, Master. Sigvard Harald Elias
Andersson, Miss. Ebba Iris Alfrida
Andersson, Miss. Ellis Anna Maria
Andersson, Miss. Ingeborg Constanzia
Andersson, Miss. Sigrid Elisabeth
Andersson, Mr. Anders Johan
Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren)
However, this fare is much more than many first and second class passengers paid! Did the poor Anderssons get swindled? No! The fare is actually the sum of all tickets with the same ticket number. If we divide the fare for the Anderssons by the number of tickets, the fare would instead be 4.468 per person. This is more in line with other third class passengers.
Deck letter: The cabin number gives an indication of a passenger’s location in the ship. The exact cabin numbers are likely not very useful, but we can extract the deck letter A-G. People on the upper decks are more likely to have survived.
Embarkation point: May provide more evidence of social class, or location in the ship.
To summarise, we have the following inputs:
- Ticket class
- Title (new)
- Surname (new)
- # siblings / parents
- # parents / children
- Ticket group size (new)
- Adjusted fare (new)
- Deck letter A-G (new)
- Embarkation point
Now let’s use these features to train the ML model. We will use the same ML model architecture as in the previous run.
This achieves a final accuracy of…
An improvement of nearly 10% on the previous model, and 6.2% on the benchmark! This places in the top 1.5% of models on the Kaggle leaderboard (at the time of writing), many of which use much more complex ML architectures. What relief! I am allowed to keep my job as an ML engineer.
WHAT DOES THIS HAVE TO DO WITH AI PREDICTING WIND TURBINE FAILURE?
What I’ve demonstrated here is that having expert knowledge of your data can dramatically improve results when using ML models. You can’t expect an ML model to figure everything out for itself!
When it comes to creating predictive maintenance models, we often receive dozens or even hundreds of SCADA tags to use. We can use algorithms that sift through these tags to try and find correlations and relevant information, but this is no match for expert knowledge that can inform us of the physical processes behind the data streams.
At ONYX, we have over a decade of experience in monitoring wind turbines and other rotating machinery. We can quickly pinpoint pertinent data streams, fine-tune models, and receive immediate feedback on faults from in-house experts. This expertise makes our models quicker to build, and more importantly, more likely to accurately identify faults. We leverage this expertise to build more effective and accurate ML models every time.
My next blog post will show an example of a fault detected on a real wind turbine using ML. Much like the Titanic challenge, there are icebergs waiting to catch us out!
Jonathan Hodson – Machine Learning Engineer