Analyzing Classified Listings at an E-Commerce Site by Using Survival Analysis

Sahibinden.com is a leading e-commerce site in Turkey where sellers (buyers) may advertise their goods (needs) with or without a fee. Since it generates a large volume of traffic to the classified car listings, the site plays an important role for determining the market value of the used cars. In this study, we first randomly selected 200 car classifieds from 950 new classified ads on the day of February 22, 2012. We then observed these listings on a daily basis for a month to determine the possible updates and deletions of the ads. We assume that if an ad is taken out it means that the car has been sold. In addition to the cars’ features, we observed the posted price and the number of daily views of the ads throughout the data collection. Therefore one can construct survival models to study the effects of the features and price of a car on the life of the ad. In other words, it is possible to study that what features and price levels expedite the sales of used cars.


INTRODUCTION
Automotive industry is one of the leading contributors to GDP in the developed countries. Considering that automobiles form 70% of this industry (in terms of number of units), the importance of the automobile trade is obvious (Onat, 2007).
In addition, second hand (used) car (Wikipedia, 2013) sales have far exceeded the new car sales in many countries which shows the importance of the used car sales in world economy (Asilkan, 2009). For example, used car sales have a volume of over twice as much as the new car sales in the U.S.A (Lee, 2006).
As Internet has become very important medium for the secondhand car sales market in developed countries since the most people have access to the internet in these countries and second-hand car dealers and buyers can reach the other party readily on this environment. As a developing country, the amount of purchases made over the Internet is rapidly increasing in Turkey. For instance, there has been an increase of almost 20 percent in January-Februay period of 2013 compared to same period of previous year which has brought the annual monetary volume of total internet sales in Turkey to 5.2 billion Turkish lira (Dünya, 2013).
Sahibinden.com is one of the leading e-commerce sites in Turkey with a number of more than 2 million ads. According to the data from Sahibinden.com, the number of vehicles sold or rented within the first three months of 2013 over the same period of the previous year increased by 17 percent to be around 347,000 vehicles.
Considering that one vehicle is sold or rented every 23 seconds, the volume of the used vehicle listings (advertisements or ad) at Sahibinden.com is very significant. Indeed, the automobile listings in the category of Vehicles have a 59% share in terms of number of listings among 16 different types of vehicle categories and the number of listings in the category of vehicles within all categories has a 42% share of listings at Sahibinden.com e-commerce site.
A particular car listing may stay active for a number of days. Indeed, the time that the listing stays active can be considered as a random variable. There might be various reasons that an ad can be taken from the web site. However, it might be acceptable to assume that the particular car might have been sold by the time the ad is removed. The main purpose of this study is to determine the important automobile characteristics that affect directly the automobile sales. To achieve this, we collected data (Gökçegöz, 2012) by observing a random sample of automobile listings at Sahibinden.com for 30 days. We then analyzed the data to determine the impacts of the various car features on the sales. In Section 2, we give the background information on survival analysis. We then give details of the data used in this paper in Section 3. Statistical analysis of the data is given in Section 4. We then conclude the paper in Section 5.

SURVIVAL ANALYSIS
Survival Analysis is used for in the analysis of the data which are obtained at the realization of a predetermined event (such as death, failure etc.) at any time. The main challenge encountered in the analysis of survival data is that by the time predetermined event has occurred we may no longer observe the object to collect the data. In other words, object may survive for a longer period that observations are no longer collected. These cases are called as right censored observations and mostly have longer survival times (Nelson, 1982). Analysis of such data has been one of the main problems of the statisticians. Like the rest of the data, censored observations should be used correctly to achieve better results.
There are various approaches for solving problems related to the survival analysis. In one of these approaches, survival analysis is conducted by using a variety of parametric survival distributions. Another approach is based on the nonparametric distribution analysis which can be used without any prior statistical distribution assumptions. In this study, the outcomes of the analyses are presented by both parametric nonparametric approaches.
Because there are two major analysis methods, the analysis of censored survival data leads to the problem of choices. Of these, the advantages of the non-parametric method of analysis are simple calculations and understandability of the outcomes. In nonparametric analysis method, Kaplan-Meier (Kaplan & Meier, 1958) is one of the commonly used calculation methods. On the other hand, parametric models are unbiased even if underlying distribution hypothesis is no longer valid as they are robust methods.
Parametric modeling will yield superior results when the preferred parametric distribution matches with the data. However, censored data particularly may result in poor outcomes when used in conjunction with the parametric methods. In short, best suitable survival analysis methods have been utilized in this paper to overcome problems that stem from real life data.

Survival Function
Survival time is the time interval for a person who is exposed to a specific disease until he heals or dies. Survival time of the individual or the system, indicated by T, is a random variable. The probability of an individual to live more than a certain time t is called the survival function. The survival function is given by following equation (Wang et al., 2002), where F(t) is the cumulative distribution function of survival time and f(u) is the probability distribution function. Equivalently, hazard function can be defined based on survival function as, which specifies the instantaneous failure rate at time t (Wang et al., 2002). Both survival and hazard functions are used extensively for analyzing the survival data in practice.

DATA COLLECTION
The data used in this paper were gathered from Sahibinden. Automobile listings at Sahibinden.com are presented with car features that indicate used or new automobile, price, brand, model, type, mileage, color, engine capacity, engine power, fuel type, gear type, body type, transmission, warranty status, trade-in options, and the responsible party (owner or dealer). Number of page view (i.e. seen by site visitors) is also shown on listing pages. All of these features were collected by a special software developed to fetch the web pages of 200 random listings used in this paper. The cleaned data were stored in an Excel file for further processing.
The data gathering software was run every day to collect data for each listing to detect price changes, number of viewers and the status of the listing i.e. whether it was removed or not since the previous day. If a particular listing was no longer accessible, it was then assumed that the car was sold. Then the variable representing the death (failure) is assigned 1. The death time, t, was noted for that particular listing. The removal of the listing corresponds to the death or failure in our survival analysis approach.
Once the 30-day long data collection task was completed, the dataset was preprocessed to convert prices given in foreign currencies to Turkish Lira (TL) based on the exchange rates on the day of original data collection. Some other minor discrepancies were also resolved during the data preprocessing step.
We give the summary charts in Figure 1 and Figure 2 to depict the content of the dataset. Figure 1 summarizes model year of the cars in the dataset. Basically most of the cars are used less than ten years. Figure 2 summarizes the composition of the dataset from the brand point of view. Again "1" represents the sold cars and "0" represents the unsold cars in Figure 2. It is easier to see the distribution of the cars by the brand for the dataset collected.

ANALYZING THE DATA
In this section we report our results on analyzing the data by survival and regression analyses. Figure 3 depicts the empirical and nonparametric survival function based on 200 observations.
As a result of the work carried out to determine the distribution analysis on the dataset with Minitab, it is found that the lognormal distribution is the most suitable distribution for the available data. Therefore, parametric analysis was performed by using lognormal distribution. Figure 4 depicts lognormal survival function. Figure 5 shows nonparametric survival function generated by Kaplan-Meier method.    We also analyzed the effects of the listing (car) features on the survival function. Figure 6 depicts the effect of engine size (whether less than 1600 cc or not) in the top plot and the type of the seller (whether dealer or not) in the bottom plot. The cars with less than 1600 cc are sold quicker than the larger engine sizes. Somehow the car listings posted by dealers were removed earlier than listings by owners which may indicate faster sales by the dealers.
The effects of the engine types on the survival functions are depicted in Figure 7. The top plot shows the effect of diesel engines that first 20 days non-diesel cars are sold quickly. In the last 10 days of the period, diesel cars are sold much faster.
Notice that there are also cars with LPG engines. We see very little difference in survival functions of gasoline vs. non-gasoline engines at the bottom plot of Figure 7.

Regression Models
Any analysis without proper models to determine the important factors that affect the time to death (in our case time to sell the car) will not be complete (Kleinbaum & Klein, 2005 R-Squared values are reasonable for all the regression models and p-values that they are significant. The partial regression coefficients are in line with the expectations. For example, the variable ViewRatio has negative signs in the models which indicate the more a car listing is seen on average per day, the sooner it will be sold. Partial regression coefficients of Diesel are higher than the other engine types which indicate that it takes longer time to sell diesel cars on the average. As the sign of the partial regression coefficients of Dealer are negative, it takes less time to sell the cars listed by the dealers. The variable KM has both positive and negative signs in these six models presented in Table 1 which realistically determines that KM may not be significant at all. Again small size engines have negative signs which indicates that it is easier to sell cars with smaller engines.

DISCUSSION AND CONCLUSION
We presented the statistical analysis results of data collected from an e-commerce site about car listings. We successfully implemented methods from survival analysis to analyze such data. We then implemented regression models to analyze the factors that affect the time to sell the car (or remove the car listing). Survival functions and the regression models agree with each other's outcomes.
Since price data are also collected in our study, we can easily determine the price elasticity of the used cars as prices may vary from day to day for a given car listing. We can also construct classification models to predict that a certain vehicle will be sold within a specified time period or not. We plan to conduct such studies in out forthcoming paper.