The research on factors influencing house value-take California as an example

. Housing price is a popular and important topic in today’s society. This article aims to find the factors that have impacts on the housing price. To find the relationships between factors, this article uses Multiple Linear Regression as the method to perform a significant analysis of factors. 1000 samples of California’s block groups in 1990 are selected for this research. Based on the assumption, this research chooses 8 explanatory variables for the analysis. Because of the relationships between explanatory variables, the article also adds interaction terms between latitude and longitude, and population and total bedrooms to solve the multicollinearity problem among explanatory variables. To optimize model analysis effectiveness, this research compares the significance, VIF value, and GVIF value of explanatory variables. The analysis result shows that the geographical location (Latitude and longitude), the housing median age, the total bedrooms, the population, and the median income make significant impacts on the housing value. Among these factors, the median income is the main factor.


Introduction
Housing prices, one of the most valued economic indicators in today's society, seriously affect the daily and economic life of the population.Housing has significant meaning for each individual [1].It is also one of the important factors in people's health and social welfare [2].At the same time, housing also has significant impacts on economic development [3].The demand for this behavior has made the house value become an important topic for the society.At the same time, the estate is also a popular financial product, which makes house value more significant for research.For nearly 40 years, U.S. home prices have been growing fast, and sometimes unstable.Understanding factors influencing the house value is important for predicting the housing price and further economic influences.This paper aims to use California's house value as an example to predict the median housing price based on the different possible factors.
In real society, housing prices have complex relationships with many kinds of potential factors.In academia, finding the factors influencing house value is also a popular topic.Mao et al. used the King County Houses Sales data to analyze the geographic feature for housing price prediction.Multiple linear regression methods and 10-fold cross-validation are used in their research, and the influence of the number of bedrooms, latitude, and longitude is examined to be the feature for housing price.Their analysis is predictive and quite understandable [4].Graha et al. used exploratory data analysis of the changes in population to find the relationship between population and house using data in Lisbon.Their research examined the relationship between different groups of people and different types of housing.Population and housing price is proven not only have a two-sided relationship in their article.This analysis is complicated but accurate [5].Hao et al. also used California's housing data to examine the possible features of housing prices.They used multiple linear regression to analyze the relationships between housing price and median house value, median income, median housing age, total rooms, total bedrooms, population, and households [6].Paul-Francois et al. use linear and nonlinear ARDL models to evaluate the effects of the economic, financial, and political risk factors of country risk on the prices of different segments of houses [7].Na Li et al. have done research on the effect of policies like macroeconomic regulation and control or the two-child policy on Chinese housing prices [8].Onur Özsoy et al. use CART to approach and the results indicate that sizes, elevators, the existence of security, the existence of central heating units, and the existence of view are the most important variables crucially affecting housing prices in Istanbul [9].Wei-Shong Lin et al, compared different factors' influence on housing price in The Northeast of America and the West [10].In summary, this article will use the multiple linear regression model to analyze the influence of longitude, latitude, housing median age, total rooms, total bedrooms, population, households, and median income on California's housing prices.

Data source
This research uses the dataset from the Kaggle website (California House Price).The dataset was based on the California Census by the US Census Bureau in 1990.The US Census Bureau uses block groups as the smallest geographical units for the samples in this dataset, and this dataset contains 20,640 block groups (samples).This research selected 1,000 of them randomly as samples.

Data preprocessing
The original dataset has 207 null values for total bedrooms.To fix this, all the null value is filled using the median value for these variables.At the same time, one variable (ocean proximity) is a categorical variable.This research chose to remove this variable.Eventually, 1,000 of 20,640 samples in the original dataset are chosen randomly to use as the dataset for this research.The data contains 8 explanatory variables (longitude, latitude, housing median age, total rooms, total bedrooms, population, households, and median income) and 1 target variable (median house value).The symbols and the meanings of each variable are shown in Table 1.The multiple linear regression model is a model to explain the linear relationship between the target variable and more than one explanatory variable.It uses ordinary least squares to estimate the regression coefficients for each explanatory variable so that the Residual Sum of Squares between the actual target variables and predicted target variables is minimized.

Multiple linear regression
To increase the accuracy of the prediction, a process of checking factors that have no or weak correlation to the median housing price is needed.The result of this process is shown in Fig. 1.

Figure 1. Relevance Analysis Between Dependent Variables and Independent Variable
From Figure 1, the Pearson test shows the correlation coefficient between all the factors and median house value.The data shows that the median income has the strongest positive relationship with the median house value; Total Rooms, Housing Median Age, Households, Total Bedrooms, and Longitude have positive relations from strong to weak; And Latitude and Population shows negative relationship with the median house value.
After the Pearson test, a multiple regression analysis was applied to the dataset.The general mathematical model for multiple linear regression is: Where  0 is the intercept t(constant), and  is the residual.both exceed the significant level.This means that t6 explanatory variables have a significant impact on the target variable Y.The impact of  4 and  7 are not significant.Therefore, these two variables were taken out from later analysis for the model's prediction accuracy.The relevant multiple linear regression equation can now be obtained from Table 3.
The equation is used to predict the original data and get several results: The correlation coefficient R of this model is approximately 0.798; the coefficient R-squared for fitting multiple linear regression is 0.637; and the adjusted R-squared is 0.6348.This indicates that this multiple linear regression equation has a certain ability to explain the relationship between the target variable and explanatory variables.After linear regression analysis, a Normalized P-P Plot in Figure 2 was constructed using the processed dataset.As the plot shows, the overall pattern of measured cumulative probability and expected cumulative probability follows approximately a straight line, which means the data fit a normal distribution.

Interaction terms
Considering the fact that in a model without interaction terms, one explanatory variable's changes depending on the value of another explanatory variable may result in a high bias in the estimated regression coefficient.An analysis of interaction terms among factors that influence median house value is necessary for the model prediction accuracy.The longitude and latitude may combine as position as a single factor, and the population and total bedrooms probably have a strong link between values.To solve this problem the coefficient of the interaction terms must be added to the previous equation: 1  2 and  5  6 are interaction terms of the combination of Latitude-Longitude and Total Bedroom-Population, each has their own correlation coefficients  7 and  8 .And the result of multiple linear regression model of the equation with interact terms are shown in Table 4.
The result of this analysis is shown in Table 5. (5)

Conclusion
This research randomly selected 1000 samples from the dataset of the California Census by the US Census Bureau in 1990.The selected samples are preprocessed, which has 1 target variable and 8 explanatory variables.This research uses multiple linear regression analysis as the method to get an accurate and detailed relationship between variables.During the analysis, 4 multiple linear regression analyses are used for comparison to find the possible relationships between explanatory variables and the target variable.To get further relationships, The research also adds the interaction terms between explanatory in multiple linear regression analysis.In result, the factors that make significant impacts on the house values are the geographical location (Latitude and longitude), the housing median age, the total bedrooms, the population, and the median income.From these factors, the median income is the main factor.Total rooms and households cannot be proved to have impacts on the house values in this research.
With this result, the government and related companies can have a reference to adjust strategies for society.Individuals can have a reference to get an ideal budget for housing from different angles.However, this research cannot fully explain the relationships between factors.The sample size is relatively small, and the factor number is also relatively small.At the same time, the datasets are from 1990.This means some time-sensitive factors will affect the accuracy of the results.To improve this, the newest data and more factors should be considered for further study.

Authors Contribution
All the authors contributed equally and their names were listed in alphabetical order.

Figure 2 .
Figure 2. Normalized P-P plot of regression standardized residuals.

Table 1 .
Symbols and Meanings of Variables.This paper uses a multiple linear regression model to analyze the factors influencing housing values.In order to get the optimized model, the research compares the accuracy of several multiple linear regression models using different combinations of variables.

Table 2
shows the regression coefficient of the multiple linear regression equation model.The Pvalues of the T-test of  1 ,  2 ,  3 ,  5 ,  6 ,  8 did not exceed 0.001.However, the P-values of  4 and  7

Table 2 .
Table 3 is the regression coefficient of the multiple linear regression table without the data of total rooms and households.Linear regression coefficient table.

Table 3 .
Linear regression coefficient table without  4 and  7

Table 4 .
Linear regression coefficient table with interact termsAs the influence of the interaction terms,  1  2 and  5  6 , been considered in the model and shows a positive effect on median house value prediction, some of the other factors became less significant in prediction such as  1 .The model's accuracy is improved after taking  1 out of the table.Based on several tests on other factors, only  1 shows negative Effect on data prediction and other factors contribut to it.As the result, the model been adjust again and this time only  2 ,  3 ,  5 ,  6 ,  8 ,  1  2 , and  5  6 are kept.The equation of the multiple linear regression: