Real estate valuation is at the core of buying and selling properties. Getting the right price to sell a good is paramount to successfully close a deal. On the contrary, getting that initial offer number wrong can lead to either drastically undersell one’s property or simply not being able to find an acquirer.
Problems to solve
- How can we predict the value of a house or an apartment? Which method can be used for a quick and objective appraisal?
- Can machine learning help in these matters and how accurate can predictive models be to predict real prices?
Benefits of TADA
Real estate is the largest asset class in the world. It makes up, on average, 5.1% of any institutional portfolio (Andonov, Eichholtz, and Kok [2013]).
Finding the true market value of a property is an essential skill for appraisers, and it ensures a fair negotiation. Real estate professionals and investors can use predictive models to get realistic market values.
However, they are not data scientists and may not have the skills in machine learning nor the coding experience to build models. Moreover, they mostly handle Small Data, where historical data contain few hundreds or thousands of properties, but rarely millions (aka Big Data) in the same area. The machine learning tools that work well with Big Data may not perform as well with Small Data.
By using an automated machine learning solution such as TADA, real estate professionals can now evaluate more quickly and accurately the price of their goods. Machine Learning holds great promise for real estate.
MyDataModels allows real estate professionals to build predictive models from Small Data automatically and without training. They can use their collected data directly, without normalization and outlier’s management nor feature engineering. Thanks to this limited data preparation, the results from this specific dataset were obtained with a few clicks in less than a minute on a regular laptop.
The results obtained from MyDataModels’ predictive model are satisfying: in average, we make a mistake of up to $3,600 in our predictions.
Conclusion
Pricing is key in real estate, as people are willing to get the most accurate price for the property they wish to purchase or sell. Hence, it is vital for real estate agencies to provide them with a precise estimation of the property price.
To successfully determine property prices, real estate specialists need to extract value from all the available information to complement their domain expertise and help them close deals with their customers.
In this real estate valuation use case, the results obtained from MyDataModels’ predictive models are very satisfying with an average error of 16%.
By using an automated machine learning solution such as TADA, real estate specialists can now easily estimate the value of a property according to different environmental factors and data. This prediction is made quickly, with great precision, which allows them to move forward fast and provide their customers with the most accurate valuation for the property they wish to purchase or sell.
Case study
Solution
Automated Machine Learning solutions consist of predicting the future with historical data. To predict a future result, you must bring your descriptive data and the past result obtained.
TADA allows you to simply create a relevant predictive model from your data and apply it to future data.
In this case, the descriptive data is houses’ information.
The goal of the dataset is to predict the price of a house: it’s a regression task, meaning that the purpose of the model is to predict a numerical value.
To generate a model, the steps are the following:
- Create your project and load your data as a CSV table (with data in rows and variables in columns).
-
Select the variable you want to predict, called Goal.
In this case, the Goal is the variable "Y_house_of_price_unit_area" (a visualization of the variable is provided). -
Select your data for the model generation. This step is called "Creating the Variable set" and allows you to manually select the descriptive variables you want to use. By default, they are all selected.
TADA identifies the relevant descriptive variables by itself, which affects the calculation time required to create the model.
The fewer variables selected, the faster the model creation. -
Create your model.
At creation, default values are proposed to you: Name of models, Population, Iteration. You only need to validate the default values to start model generation.
‘Best practices’ are at your disposal to guide you in the choice of these parameters.
Depending on the size of the descriptive data file, this step can take between a few seconds and ten minutes.
Once the model is created, you can see the results of the model using metrics and charts so you can judge its relevance.
Note:
To apply a model that you think is relevant, you can:
- Retrieve the associated mathematical formula and apply it (for instance on Excel)
- Retrieve the source code of the formula and use it by yourself (Valid only on TADA paying offers). The source code is available in R, Java, C ++ and soon Python.
- In order to use our "Predict" feature on the product, you will have to upload your file containing the data to be predicted. You will be returned a downloadable file containing the given data, with
the calculated predictions.
Dataset information
Each row is a house and each column is a variable which can be used in the model.
Historical values are shown in the last column of the table (“Y_house_price of unit area”).
Task Type: Regression
Number of variables: 7
Number of rows: 290
Goal: Y= house price of unit area
Assessing the market value of real estate is a daunting task. The real estate market is exposed to many fluctuations in prices because of existing correlations with many variables, some of which cannot be controlled or might even be unknown. Housing prices can increase rapidly (or, in some cases, drop very fast).
Appraisers still manually evaluate the value of assets that are sometimes worth billions of dollars by comparing an asset to a small set of previously transacted reference buildings that are somehow comparable.
Machine Learning holds great promise for real estate pricing models.
This case study is based on real data from a public dataset originally found in the the UCI data repository (https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set).
The objective is the evaluation of the price of a house per square meter in Taiwan dollars (In the original dataset the surface unit used is the ‘Ping’, corresponding to 3.3 square meter. However, we kept the original unit).
The figure below shows an extract of this public dataset.
The model uses the variables as follows:
X1=the transaction date (for instance, 2013.250=2013 March, 2013.500=2013 June, etc.)
X2=the house age (unit: year)
X3=the distance to the nearest MRT station (unit: meter)
X4=the number of convenience stores in the living circle on foot (integer)
X5=the geographic coordinate, latitude (unit: degree)
X6=the geographic coordinate, longitude (unit: degree)
The Goal
Y= house price of unit area (square meter).
Results
The results of the model are available following the generation of the model.
They present the performance of the predictive model.
The type of predictive model and the measurement indicators of the associated model are related to the Goal (Variable to be predicted) and the values of this variable.
The type of model you make is shown on the model results display.
According to the type of the Goal (in our case, the Goal is "Y_Price_of_unit_area"), we can make three types of predictions:
- Binary classification: Discrete value taking only two values (yes / no for instance)
- Multiclass classification: Discrete value taking more than two values (for instance a status of state with values like: On, Risk of breakdown, Down, etc.)
- Regression: Continuous value that can take an infinite number of values (a temperature, a pressure, a turnover, the price of a house, etc.)
At the generation of the model, and according to the practices and state of the art of Machine Learning, your dataset will be divided into three parts by TADA:
- A training part which represents 40% of your dataset, it allows to train a certain number of formulas,
- A validation part, which represents 30% of your dataset, which validates and selects the best formulas found in the previous step,
- A test part which represents the last 30% of the model and which corresponds to the test of the formulas approved by the preceding stage. The performance measurement and the evaluation of your model should mainly be done on this partition (Standard and state of the art of Machine Learning) because the present data were not used in the learning and validation phase of the machine learning model and serve just to measure its performance.
How good is this model?
The metrics yielded by TADA under the Metrics heading are shown in the table below and refer to a run of one minute.
We can make a few observations.
- The Maximum error - defined as the difference between the actual value and the predicted one - can be negative.
- Now, for every regression task, e.g. where a numeric value is predicted or ‘fitted’- we may judge the error of the model with respect to the standard deviation 𝜎 (the ‘spread’) of the data used to construct the model. A model displaying a prediction error of the same order of magnitude, indicates a good prediction.
- For our starting data, the spread of the price per unit surface is 𝜎= $4,260. This is a considerable spread around the mean value of $11,500 /sqm (or the close median value of 11,650). On the other hand, the regression results are of the same order of magnitude (with an RMSE around 3,600 dollars per square meter in fact). Thus, the model can be judged as acceptable, within the limits of the initial data quality.
The last point means that the model produces a prediction which is not more uncertain than the original data. Thus, it can only be as good as the data used to generate it. Clearly, we cannot do better and, in fact, enhance the quality of the initial data - nobody can!
Ready to use TADA?
You don't have immediate data?
No problem, data are available to make your trial as relevant as possible!
Try it now!