One of the main Machine Learning applications in healthcare is the identification and diagnosis of diseases which are considered hard-to-diagnose. This includes anything from cancers, which are tough to diagnose during the initial stage, to many genetic diseases.
Problems to solve
- How to predict if a patient is likely to have breast cancer?
- How to detect a risk from characteristics of breast mass cell nuclei?
- How to help doctors to be more performant in their diagnosis?
- Can machine learning help in these matters and how accurate predictive models can be to detect breast cancer?
Benefits of TADA
Doctors and medical staff could use predictive models to help them in their diagnosis. However, they are not data scientists and they may not have the required skills in machine learning or the coding experience to build models. Moreover, most data handled by these professionals are Small Data, meaning that their historical data often contains a limited number of patients and surely not hundreds of thousands (Big Data). Traditional machine learning tools work well with Big Data but do not perform well with Small Data.
MyDataModels allows domain experts, in this case doctors and researchers, to automatically build predictive models out of their collected Data. No training is required, and they can use their collected data directly without a need to normalize it or handle outliers. No feature engineering is required. Thanks to this limited data preparation, we obtained results from this specific dataset in a few clicks and in less than a minute, from a regular laptop.
MyDataModels brings a self-service solution for those who have Small Data and no data scientists.
Conclusion
Breast cancer is the most common cancer among women, accounting for 25% of all cancer cases worldwide. It affected 2.1 million people in 2015. Early diagnosis significantly increases the chances of survival. However, research indicates that most experienced physicians can diagnose cancer with 79% accuracy, while 91% correct diagnosis is achieved using machine learning techniques.
In this breast cancer prediction use case, the results obtained from MyDataModels’ predictive models are satisfying with a 97% accuracy rate.
The medical world could make more use of machine learning to detect diseases in general, and breast cancer in particular. This would allow doctors, who are not data experts, to spend less time on data analysis and more time on providing the right treatment to their patients, faster.
Case study
Solution
Automated Machine Learning solutions consist of predicting the future with historical data.
To predict a future result, you must bring your descriptive data and the past results obtained.
TADA allows you to simply create a relevant predictive model from your data and apply it to future data.
In this case, the descriptive data comes from a digitized image of a fine needle aspirate of a breast mass.
The goal of the dataset is to predict if a tumor is malignant or benignant (B/M)
To generate a model, the steps are the following:
- Create your project and load your data as a CSV table (with data in rows and variables in columns).
- Select the variable you want to predict, called Goal.
In this case, the Goal is the variable "Diagnosis" (a visualization of the variable is provided). - Select your data for the model generation. This step is called "Creating the Variable set" and allows you to manually select the descriptive variables you want to use. By default, they are all selected.
TADA identifies the relevant descriptive variables by itself, which affects the calculation time required to create the model.
The fewer variables selected, the faster the model creation. - Create your model.
At creation, default values are proposed to you: Name of models, Population, Iteration. You only need to validate the default values to start model generation. Best practices are at your disposal to guide you in the choice of these parameters. Depending on the size of the descriptive data file, this step can take between a few seconds and ten minutes. Once the model is created, you can see the results of the model using metrics and charts so you can judge its relevance.
Note:
To apply a model that you think is relevant, you can:
- Retrieve the associated mathematical formula and apply it (for example on Excel)
- Retrieve the source code of the formula and use it by yourself (Valid only on TADA
paying offers). The source code is available in R, Java, C ++ and soon Python. - In order to use our "Predict" feature on the product, you will have to upload your file containing the data to be predicted. You will be returned a downloadable file containing the given data, with
the calculated predictions.
Dataset information
The screenshot below shows an extract of the public dataset.
Each row is a patient and each column is a variable.
Variables:
-ID number
- Variables 4 to 32 are patient’s descriptive variables
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
-Diagnosis (M = malignant, B = benignant) is our goal.
Task Type: binary classification
Number of variables: 32
Number of rows: 569
Goal: Diagnosis (M = malignant, B = benignant)
Weight : Positive class (B) 63%, Negative class (M) 37%
Variables are computed from a digitized image of a fine needle aspirate (FNA) of a breast
mass. They describe characteristics of the cell nuclei present in the image.
[K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two
Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].
Results
The results of the model are available after the generation of the model.
They present the performance of the predictive model.
The type of predictive model and the measurement indicators of the associated model are
related to the Goal (Variable to be predicted) and the values of this variable.
The type of model you make is shown on the model results display.
According to the type of the Goal (in our case, the Goal is "Diagnosis"), we can make three types of predictions:
- Binary classification: Discrete value taking only two values (yes / no for instance, M and B in this case)
- Multiclass classification: Discrete value taking more than two values (for instance a status of state with values like: On, Risk of breakdown, Down, etc.)
- Regression: Continuous value that can take an infinite number of values (a temperature, a pressure, a turnover, the price of a house for instance)
At the generation of the model, and according to the practices and state of the art of Machine Learning, your dataset will be divided into three parts by TADA:
- A training part, which represents 40% of your dataset, and allows to train a certain number of formulas,
- A validation part, which represents 30% of your dataset, which validates and selects the best formulas found in the previous step,
- A test part which represents the last 30% of the model and which corresponds to the test of the formulas approved by the preceding stage. The performance measurement and the evaluation of your model should mainly be done on this partition (Standard and state of the art of Machine Learning) because the present data were not used in the learning and validation phase of the machine learning model and serve just to measure its performance.
ACC (Accuracy) represents the overall accuracy rate of the model: it is the percentage of classes that are well distributed (here we have 97.67% predictions that are correct)
TPR (True Positive Rate) represents the accuracy rate of the prediction of the positive class, i.e. of the "yes/B" class
TNR (True Negative Rate) represents the accuracy rate of the prediction of the negative class, i.e. of the "No/M" class
MCC (Matthew's Correlation Coefficient) represents the good prediction as a whole, that is, if we were able to divide the predictions between the two classes.
Confusion matrix
Here, the confusion matrix represents a visual way of interpreting the metrics.
In this case, TADA predicted 99 times that a patient had no cancer and was wrong once (We missed 1 cancer).
In parallel, TADA predicted 73 times that a patient had cancer and was wrong 3 times (We told 3 persons that they had cancer while they actually didn’t).
Ready to use TADA?
You don't have immediate data?
No problem, data are available to make your trial as relevant as possible!
Try it now!