Ye Li,Fan Zhang ,Bo Gan ,and Chengzhong Xu ,2
(1.Shenzhen Institutesof Advanced Technology,Chinese Academy of
Sciences,Shenzhen 518055,China;
2.Department of Electrical and Computer Engineering,Wayne State
University,MI 48202,USA)
Abstract Smart refueling can reduce costs and lower the possibility of an emergency.Refueling intelligence can only be obtained by min?ing historical refueling behaviors from big data;however,with?out devices,such as fuel tank cursors,and cooperation from drivers,these behaviors are hard to detect.Thus,detecting refu?eling behaviors from big data derived from easy-to-approach trajectories is one of the most efficient retrieve evidences for re?search of refueling behaviors.In this paper,we describe a com?plete procedure for detecting refueling behavior in big data de?rived from freight trajectories.This procedure involves the inte?gration of spatial data mining and machine-learning tech?niques.The key part of the methodology is a pattern detector that extends the naive Bayes classifier.By drawing on the spa?tial and temporal characteristics of freight trajectories,refueling behaviors can be identified with high accuracy.Further,we present a refueling prediction and recommendation system to show how our refueling detector can be used practically in big data.Our experiments on real trajectories show that our refuel?ing detector is accurate,and the system performs well.
Keyw ords spatial data mining;trajectory processing;big data
R esearch into smart refueling has become more im?portant as drivers become more sensitive to fluctu?ating fuel prices.Previous research has been heavi?ly focused on the use of theoretical models to ana?lyze refueling behavior in small sets of experimental data[1]-[5].However,intelligence on refueling behaviors can also be directly obtained from massive trajectories.Collective refu?eling intelligence is useful because refueling behaviors are summarized as a result of smart decisions by experienced freight drivers.Traffic systems using collective intelligence have performed well,both in scientific research and in re?al-world applications[6].However,without additional devices and cooperation from drivers,refueling records are hard to col?late and are often inaccurate.Traditional methods for monitor?ing refueling behaviors,such as floating cursors in a fuel tank,are not widely used because of the cost of equipment and main?tenance.Thus,there are not huge volumes of refueling records,and we have to turn to trajectories to obtain data.Embedded GPSequipment has become common in the freight and logis?ticsindustries.
With elaborate algorithms,patterns can be recognized in tra?jectories;for example,we can identify whether a car is moving or not.If a car hasstopped near a gasstation,it islikely refuel?ing.However,the subtle distinction between refueling and ca?sual parkingcannot bemade.
We have designed algorithms that can accurately identify re?fueling stops along trajectories.To benefit drivers,we have al?so built a refueling prediction and recommendation system based on collective intelligence.The main motivations of our research are no additional cost and powerful collective intelli?gence.We design the refueling-detection algorithms only by mining mass data from trajectories,and no additional devices are required.GPSis already installed in most cars;hence,our algorithms can be widely used without having to install,test,and maintain new devices.This saves costs.We were also in?spired to design our data-mining algorithms partly because of successful use of collective intelligence in taxi services.Simi?larly,in the logistics and freight industries,refueling behaviors derived from collective intelligence can reveal industrial ad?vantagesand disadvantages.
Given enough spatial,temporal,and other information,a de?tector can follow simple rules,including rules about spatial constraints and constraints on traveled distance,to identify po?tential refueling stops.However,such methods rarely work well in practice because of a lack of valid prior knowledge.One crucial problem of all naive methodsis the lack of reliable prior knowledge and training data set.To our knowledge,there are no publicly available records on refueling behavior,espe?cially for the freight industry.Some data on refueling has been collected as part of research into fuel consumption and vehicle efficiency rather than refueling behavior[1].Without custom?ized sensors in the fuel tanks of cars,such information can on?ly be obtained if drivers are willing to cooperate.This contrib?utes to data inaccuracy and deficiency.Also,for rules-based decisions,prior knowledge extends further than simply routes and spatial distribution of gas stations.A solution therefore needs to be found for reliably collecting prior knowledge about refueling behavior.As well as valid prior knowledge,there are no training data sets because nothing can be directly observed from the spatial trajectories.Detectors cannot be optimized without training.Prior knowledge may offer the foundation for constructingatrainingdataset.
The major drawback of naive methods is that they neglect the mutual influence of consecutive stops at nearby gas sta?tions.For example,there may be two gas stations that are spa?tially and temporally close along a trajectory.The car has trav?eled such a distance from its last refueling stop that it is likely the driver will refuel at one of these gas stations.The detectors must classify one of these two eligible candidates as the refuel?ing stop.Using a naive method,the first eligible candidate where the car firstly stopped by is classified as a refueling stop,and the result of this classification affects the other candi?date.If the first stop near the first gas station is treated as the refueling stop,then the traveled distance from the last refuel?ingstop isdiminished for thenext stop,and thismakesthesec?ond candidate ineligible,even though the stop may be a possi?ble alternative refueling stop[2].The multiple hypothesis mechanism can handle thisproblem.
To overcome the above problems,we introduce several ad?vanced data-mining and AI techniques,and we elaborate our algorithms.First,we propose a hill-climbing algorithm in local search and optimization.Thisalgorithmisused togeneratereli?able prior knowledge about refueling behavior and construct the training data set.Then,we use a naive Bayes classifier al?gorithm to detect refueling stops with regard to multiple hy?potheses.Last,we suggest that our detector has good extensi?bility by suggesting possible applications.Through abundant evaluationsand data analysis,we show therobustnessof our al?gorithms.The main contributionsof thispaper are
·a new method of inferring reliable prior knowledge about re?fueling behavior.This prior knowledge is derived from his?torical freight trajectories
·a method for determining freight refueling patterns based on our observationsand processingof spatial trajectory datasets
·a solution to accurately detecting refueling stops along a sin?gle trajectory.This solution isbased on naive Bayes.
·a real-time stream processing and batch-processing frame?work for processingand miningtrajectory data
·a practical application for recommending refueling alterna?tivestodrivers.Thisapplication isan extension of our detec?tion algorithms.
The remainder of this paper is organized as follows:In sec?tion 2,we define terms used in this paper and outline prob?lems.In section 3,we describe our methods.In section 4,we introduce a prediction and recommendation application based on our proposed algorithms.In section 5,we describe the de?sign of our system.In section 6,we present our experimental results.In section 7,we discuss related work.Section 8 con?cludesthepaper.
The data set for freight trajectories is similar to that for non-freight trajectories[7].However,the interests of freight and non-freight sectors are slightly different,and both look for different patterns.This means that different measurements are used,and there are differences between freight GPS records and non-freight GPSrecords.
Freight GPSrecord:A GPSrecord data reported by a freight car.In our research,a GPSrecord ricontains a time stamp ts,longitude lon,latitude lat,and odometer reading odm and is given by ri={tsi,loni,lati,odmi}.
Single freight trip:An integrated,temporally continuous se?ries of freight GPSrecords form a single freight trip,which is given by T={r1,r2,...,ri}.A freight trajectory must have a distinct origin and destination.The time stamp and odometer reading of the first freight GPSrecord are minimums.Typical?ly,a trip starts after an unusually long period of parking and ends with another unusually long period of parking that is also considered thestart of thenext trip.
Gas station location:The gas station location is given by gsi={gsidi,loni,lati,rfcounti}.The last attribute rfcountiis the number of refueling behaviors that occur at a gas station within a certain timeframe of the data set.The GPScoordinates of a gas station are an approximate location generated after map digitalization.
Vehicular stop:Vehicular stops emerge along the freight tra?jectories.A vehicular stop contains a set of continuous GPSre?cords.A vehicular stop has the following attributes:locations,time stamp for when the stop begins,stopping interval,and minimum bound rectangle(MBR).This vehicular stop is given by VSi={xi,yi,tsi,tii,,where(Xi,Yi)is the expected coordinates of the stop,ts is when the stop be?gins,and tiiis how long the stop lasts.The boundary of the MBR of GPSrecords for this stop is given by MBR=,y ,y ).We divide the vehicular stops into the following two definitions.
Refueling stop:This is a specific kind of vehicular stop that occurs when the driver refuels during a stop.Refueling stops are distinct from other stops in that they only occur around gas stations.A refueling is given by rsi={υsi,gsk}where the j th ve?hicular stop isidentified asthe i th refueling stop,and the driv?er refueled at gasstation k.
Casual stop:a stop that is not a refueling stop.The main dif?ference between a casual stop and a refueling stop is that the former may occur near a gas station or not;however,the latter occur at a gasstation.A casual stop isgiven by csi={υsi}.
With enough prior definitions,we can now state our prob?nimaximinlem:Sets of vehicular stops extracted from trajectories must be correctly classified as refueling stopsor casual stops.
In the following,we base our algorithms and applications on thedefinitionsabove.
When a given a vehicular stop is fed to the classifier,the classifier can correctly identify this stop as either casual or re?fueling.This is the goal of our algorithm.To optimize the clas?sifier,reliable prior knowledge needs to be retrieved.Essential prior knowledge is factors affecting the classifier parameters.For our data set,these factors can be observed from odometer readings and temporal intervals.With prior knowledge,it is possible to predict how far a car has to travel from the refuel?ingstop beforeit hastorefuel.It should alsobepossibletopre?dict how longtherefuelingstop will last.
The predicted distance a car travels from the last refueling stop before refueling isυ,and the time of a refueling stop isτ.These parameters tend to converge to two constants for the same route.The greatest distance a car can drive on a full tank is mainly determined by the load of the car,the capacity of the tank,the driver's driving habits,highway conditions,and dis?tribution of gas stations.In long-distance transport,refueling behavior is usually unique for the same driver using the same vehicle on the same route.The driver is always optimizing their decisions to gain the maximum benefit.A refueling pro?cess comprises pumping the gas and paying.Hence,the time of the whole process only decreases by a relatively small amount.
Because the patterns are known,a reasonable method is re?quired to calculate these variables.Authentic patterns can be mined from data obtained from real refueling stops.We pro?pose a naive method with hill climbing algorithm to approxi?mate these variables.Algorithm1 isthenaivemethod for a giv?enυ.

Algorithm1.Naive Bayes(VS;υ)Require:VS={vs 1,vs 2,...,vsn},aset of vehicular stops,when vsi={x,y,odm,t i,gsid,gsdis,lastdis,r i=0}.υ,the spatial parameter.Ensure:VS={vs 1,vs 2,...,vsn},where vsi.r i isdetermined.RS={rs 1,rs 2,...,rs m 1},aset of refuelingstops.CS={cs 1,cs 2,...,cs m 2},aset of casual stops.1:lastodm=0,m 1=0,m 2=0,RS←null,CS←null;2:for i=1;i≤n;i++do 3:vsi.lastdis=vsi.odm-lastodm;

4:if vsi.lastdis>υthen 5: lastodm=vsi.odm;6: vsi.ri=1;7: rsm 1=vsi;8: RS:add(rsm 1);9: m 1=m 1+1;10:else 11: csm 2=vsi;12: CS.add(csm 2);13: m 2=m 2+1;14:end if 15:end for 16:return{VS,RS,CS};
The main drawback of the naive method is that it cannot dis?tinguish between nearby refuelingstops and casual stops.How?ever,it can give reliable answers for highly distinguishable events.For example,there is a car which stops consecutively at two gas stations at a distance of more thanυapart and with?out other stops near them.In this case,we can confidently de?tect two reliable refueling stops and mine the patterns.Never?theless,we have to eliminate indistinguishable neighbors of therefuelingstops.
3.1.1 Eliminating Indistinguishable Neighbors
For each refueling stop detected using the naive Bayes meth?od,there are neighboring stops that can be either refueling stops or casual stops.These neighboring stops are located at a distance d0and time t0fromtherefueling stop.These neighbor?ing stops are indistinguishable,so we proceed to the elimina?tion of the pool of this detection and its neighbors without con?sidering these indistinguishable stops in our hill climbing algo?rithm.Then,we obtain the remaining set of refueling stops RSremainand casual stops CSremain.
After eliminating unreliable detections,we calculate the dis?tancebetween thepatternsfromreal dataυusing

Ifιis small enough,we can confidently say thatυcorre?sponds to authentic patterns.Initially,υcannot be determined fromraw trajectories data set because of the first problem men?tioned in section 1.To overcome this problem,we propose a hill-climbing algorithm to approximate the bestυwith the low?est possibleι.The hill-climbing algorithm is shown in Algo?rithm2.

Algorithm2.Hill-Climbing({VS},υinit,increment)Require:{VS}={VS 1,VS 2,...,VSn},a set of trajectory,when VS i={vs 1 i,vs 2 i,...,vs ni}υinit,an randomly initial spatial parameter.increment,aconstant standsfor theincrement in iterations.

Ensure:υbest,thebest spatial parameter.1:υbest=υinit;2:whileυcurrent≠υbest do 3:υcurrent=υbest;4:ι1←DistCal({VS},υcurrent);5:ι2←DistCal({VS},υcurrent-increment);6:ι3←DistCal({VS},υcurrent+increment);7:υbest←argmin{ι1,ι2,ι3,};8:end while 9:returnυbest;

Algorithm3.DistCal({VS},V)Require:{VS}={VS 1,VS 2,...,VS n},aset of trajectory,when{VS i}={vs 1 i,vs 2 i,...,vsni}υ,a given spatial parameter.Ensure:ι,the distance between real data and given spatial parame?ter.1:ι←null;2:for i=1;i≤n;i++do 3:{VS i,RS i,CS i}←NaiveMethod(VS i,υ);4:if RS i=null then 5: ιcurrent=|VSi.allodm-υ|;6: else 7: {RS ,CS }Eliminations(VSi,RS i,CSi);8: ιcurrent=|RS .lastdis-υ;|9: end if 10:ι:add(ιcurrent);11:end for 12:returnι;i Remain i Remain i Remain
The algorithm extractsυbest,which produces the smallestι from{υ,υ-increment,υ+increment}in line 7.The sub-al?gorithm DistCal is shown in Algorithm 3.The eliminations pro?cess in the algorithm conforms to the definition of eliminating indistinguishableneighborspreviously given.The hill-climbing algorithm starts from a randomly initiated spatial parameterυinitfromand then iterates untilυ reaches a local minimum.This algorithm can generate a reli?able spatial parameter because of its convergence.The number of vehicular stops in a single trajectory with VSiis given by nmax.Then,regardless of howυchanges,the number of detected refueling stops nrlies in[0,nmax].Asυdecreases,nrincreases until it reaches nmax.However,υcontinues to decrease,which makeincreasewhen(VSi×allodm)/nmaxis a constant.The results are the same whenυcontinually in?creases.Hence,the best approximatedυcan be obtained from the real data set asιdecreases.
3.1.2 Constructingthe Training Data Set
Whenυbesthas been determined,we can construct the train?ing data set for the naive Bayes classifier.First,we use the na?ive method on the real data set withυ=υbest.Then,we elimi?nate indistinguishable neighbors in the outcome{VS}and ob?tain the training data set,which comprises{RSremain}and{CSre?main}.The set{RSremain}is the set of positive examples in the na?ive Bayes classifier,and{CSremain}is the set of negative exam?plesin thena?ve Bayesclassifier.
3.2 A Naive Bayes Classifier with Multiple Hypotheses
Detecting refueling stops can be thought of as a classifica?tion problem.An event is unambiguously classified according to evidence derived from observations.A classifier should be able to correctly identify whether a given vsi is a refueling stop or casual stop by observing the time of the stop and distances traveled fromlast refueling.
3.2.1 A Naive Bayes Classifier Framework
We use a naive Bayes classification framework[8]-[12],which isbased on Bayes'formula[13],[14]and isgiven by

where Y is the class variable and X is the attribute for each Xicomprising d attributes[9].Fromour observations,X={x1,x2},where d=2,x1=vsi×lastdis and x2=vsi×ti;and Y={y1,y2},where y1=0 is a casual stop,and y1=1 is a refueling stop.Then,classification problemcan be briefly expressed as vsi×(P(Y=yi|X)).However,beforeusing(2),P(Y),
P(Y|X)has to be calculated first.In all circumstances,P(X)isaconstant[8],[9].
Given a raw data set,P(Y)can be directly observed and cal?culated.However,construction of training data set requires in?distinguishable neighbors to be eliminated.Hence,we have to eliminate the effects of data loss.There are nppositive exam?ples and nnnegative examples,so we know that we have elimi?nated nvindistinguishable stops,of which there are nrrefueling stopsthat have been detected using the naive Bayes method.

If we assume Gaussian distributions of continuous attributes x1,x2(section 3.1),the class-conditional probability for any at?tribute Xican be expressed as

whereμijcan be estimated using the sample mean xiof Xifor all training data that belong to the class yj,andσ2ijcan be esti?mated using the sample variance of the same subset of training data[8].
The naive Bayes classifier is suitable for detecting refueling stops based on similar observations from our data set and the characteristicsof thenaive Bayesclassifier[9].
The naive Bayes Classifier is not sensitive to isolated noise points because the class-conditional probabilities are estimat?ed over all data.A single observation of location data may be erroneous due to errors in the GPS receivers,odometer,and other sensors.However,these noises are eliminated by averag?ingout.
The naive Bayes classifier is not sensitive to irrelevant attri?butes.In our data set,the distance from last refueling and the length of timespent at astop areonly loosely related.
3.2.2 Multiple-Hypotheses Framework
A naive Bayesclassifier doesnot yet solve the problemof in?distinguishable neighbors.We propose a framework that com?bines multiple hypotheses with a naive Bayes classifier to over?comethisproblem.
For aset of stopsand neighboringstops(including both refu?eling and casual stops)within a distance of d0and time t0,it needs to be determined whether there is only refueling stop in a set or whether there is no refuelingstop in a set.
For a set of neighboring stops VS of sizes m,there are m+1 possibleresultsfromthefollowingequation:

where k=0 indicates that none of m is a refueling stop,and k≠0 indicates that the k th vehicular stop is a refueling stop.We solve the indistinguishable neighbor problem by calculat?ing all possible situations in(4)and determining the highest probability by usingargmax{P′}1×(m+1).

Algorithm 4 the Naive Bayes Classifier and Multiple Hypothe?sis Combined Framework Require:VS={υs 1,υs 2,...,υsn},whenυsi={x,y,odm,ti,gsid,gsdis,lastdis,ri=0}.{μij},{σij},the set of parametersfrom the training data.d 0,t 0,twoparameterstodetermineneighbors.Ensure:S={υs 1,υs 2,...,υsn},whenυsi.ri isdetermined.1:for i=1;i≤n;i++do 2:NeighborSet i 1×m←GetNeighborSet(υsi);3: if NeighborSet i 1×m=null then 4: υsi.ri=argmax(P(Y=y j|X));5: else 6: k=argmax{P′}1×(m+1);y j k

7: for j=0;j<m;j++do 8: if j+1≠k then 9: υsi+j.ri=y 1=0;10: else 11: υsi+j.ri=y 2=1;12: end if 13: end for 14: i=i+m;15:end if 16:end for 17:return VS;
In this section,we describe an application that uses the pre?viously mentioned algorithms to output reliable indications of refueling behavior.Combining gas station background informa?tion and collective intelligence derived from drivers,the appli?cation,called refueling prediction and recommendation,pro?vides optimized refueling choices.The application is straight?forward.With well-designed logical models,we show how refu?eling behavior can be determined by mining driver intelligence fromfreight trajectory data.
Given real-time information about the car,and given back?ground information from other sources,the application tells a driver when and wherethey should refuel.
The real-time information of a car changes when the car needs refueling.This information includes the distance from where the car last refueled.This attribute is exactly the same as that used in the above algorithms.However,here it is moni?tored in real time in order to provide instant evidence for our recommendation.The real-time spatial attribute of a specific time stamp t is dt.The statistical spatial constant parameters μcandσcare the successions of the definition in section 3.2.1.These constant parameterscan be computed from the set of de?termined refuelingstopsby usingtheabovealgorithms.
Background information can also dominate the recommenda?tion.Background information has two major parameters which represent the popularity of one gas station and the distance of one gas station from the real-time location of one car.The pop?ularity of one gas station can be obtained by using the Naive Bayes Classifier with Multiple Hypothesis Algorithm on a mas?sivedataset and then collectingtheresults.It emitsapossibili?ty show that how likely a refueling will happen on one specific gas station.And the distance of one gas station from the instant location of one car can be used to estimate how likely the car will take a refueling action when it arrives at this gas station,in other words,refuelingprediction.Thepopularity of agassta?tion i and the distance between this gas station and the car's instant location at time t aredenoted by Pgiand dti,respectively.
By integrating real-time and background information,we can model the problem in this prediction and recommendation system.Given dtof a car at time t,what is the probability Ptgithat the car will refuel at gasstation i with background informa?tion Pgiand dti?This probability can be computed using

In order to recommend a specific gas station to this car at time t,we can compute(5)for all gas stations and recommend the gas station with the highest probability argmax{Ptgi}.In this real application,we construct the set of candidate gas stations with direction and distance thresholds in order to eliminate in?valid candidatesfor efficiently computing(5).
Our framework of a refueling detection and recommendation/prediction system comprises data preprocessor,knowledge dis?covery,and intelligence generator(Fig.1).This design is gen?eral and scalable.We embed our refueling detector and recom?mendation/prediction intotheframework.
The data preprocessor contains a spout and bolts in a Storm cluster that is especially designed to preprocess data from spa?tial trajectories.The spout continually emits GPS measure?ments;however,the level of the bolts is determined by logical complexity of the preprocessing tasks.In our system,these bolts clean unqualified GPSrecords,reorganize the trajectory,and extract vehicular stops.This module constantly receives GPSrecords and transforms them into trajectories,and the last bolts extract vehicular stops along each trajectory.The output of thelast boltsisfed tothenext knowledge-discovery module.
The knowledge-discovery module is built on the Hadoop batch-processing platform.This module is mainly responsible for carrying out the most computationally expensive tasks in the system.The batch-processing platform runs these tasks routinely,or only in certain conditions,at low frequency and with latency,unlike the real-time data preprocessor and intel?ligence generator components.In our case,the naive Bayes classifier with multiple hypotheses algorithm is implemented in MapReduce and run in the Hadoop clusters.The output of algorithms is stored in database systems as knowledge and ex?perience extracted frombig data.
The intelligence generator comprises real-time applications that interact with users.It also comprises a spout and bolts,which are located inside the Storm cluster.The intelligence generator and data preprocessor may even share the same spout.However,the data preprocessor digests all the data from the source while the intelligence generator extracts and puri?fies real-time information from the data source with some of bolts to handle dataselections and transformations.Further,af?ter real-time information is captured,the application in a bolt is connected to database systems where knowledge and experi?ence are stored.By combing real-time information and back?ground knowledge,applications in the intelligence generator can accurately provide recommendations in real time.In our case,the bolts in the intelligence generator monitor real-time information for each car,and our recommendation/prediction algorithm is applied to every bolt.Then,a recommendation is made as to whether the driver should refuel.
These three components are highly integrated and are all built on scalable,fault-tolerant distributed systems on Hadoop and Storm platforms.Comprehensive use of these platforms means our system can provide both real-time services and complex data-mining services.Additional applications can be easily integrated into the existing framework in the form of Ha?doop MapReduce jobs or Storm bolts,taking the advantage of reusableresources.

▲Figure1.Integrated system framework.
The GPSdata set was collected from 14,800 cars involved in the logistics and freight industries in China.Each car constant?ly reported its location about every 15 s.The data was collect?ed from October 2009 to April 2012,and real-time data is still incrementing by several Gigabytes every day.Tables 1 shows the data information of the dataset,Tables 2 shows the fields in each record.
Because of the variety of freight routes,our trajectory data set covers most of China.Fig.2 shows an example route from our data set.This route originates in Shenzhen and ends in Shanghai.Thedrivingdistanceisabout 2000 km.
Besides,we have location information of gas stations all round the country.These massive data of gas stations contain location records of 114,000 stations in China.The number of gas stations matches our knowledge of the number of all gas stations in China.Fig.3 shows the spatial distributions of all thesegasstations.
In the initial part of our experiment,we selected a part of our data set of 1000 cars.This raw data was 12 GB.First,we input this data into the data preprocessor module and collected the output of several vehicular stops from different routes.Sec?ond,we ran our Naive Bayes Classifier with Multiple Hypothe?sis algorithm withυinit=700.Then,we obtained the result of a set of determined refueling stops with spatial statistical param?etersμcandσc.Third,we randomly chose 3000 instant records from different cars and ran our prediction and recommendation application.The input for was instant information from these cars and background information generated in previous proce?dures.By the end of the three stages,we had obtained 3000 in?stant records from different cars and routes and recommended a gasstation for each car.

▼Table1.Dataset descriptions

▼Table2.Data fields

?Figure2.A routefrom Shenzhen to Shanghai.
Given the 3000 instant records and 3000 gas stations gener?ated by our application,we designed an indicator called drift to measure the differences between ground truth and assigned recommendations.For the i th instant record,gr(i)is the recom?mended gas station,and gt(i)is the actual gas station where the car refueled.Then,driftican be obtained for the ith instant record:

The distance(gr(i),gt(i))function returns driving distanc?es,recommended gas station gr(i),and actual gas station gt(i),which are calculated from historical trajectories instead of straight-linedistances.
Fig.4 shows the cumulative distribution function of driftimeasurements for all 3000 samples.This result shows that our recommendation/prediction system is precise,and nearby alter?native gas stations can be accurately recommended for refuel?ing.More than 50%of our recommendations were close to the actual gas stations with no drifts.More than 60%of the recom?mendations drifted within 100 kilometers.Almost 85%of rec?ommendations drifted within 300 kilometers.Assuming an av?erage speed of 120 km/h on highway in China,we provided 85%of users with alternative refueling gas stations within a driving time of around 2.5 hours.This is acceptable because the assigned alternative gas stations are always more attractive todriversbecausethey provide better services.
However,15%of recommendations were far away from the actual refueling spots.These should be considered failed rec?ommendations.Possible causes of failed recommendations are unavoidable randomcorruption in raw GPSrecordsand insuffi?cient experimental data.Although some unqualified records could be removed in the data-cleaning process,defective odometer readings and abrupt discontinuance of trajectories af?fect the precision of the refueling detector.This may have a detrimental impact on our recommendation system.A partial data set in the big data may cause bias in when calculating the popularity of gas stations.This leads to a lack of equilibrium in the statistical distribution.Accuracy can be improved by drop?ping inaccurate raw data and adding more reliable raw data in tradeof larger systemscale.
Much previous research has been centered on obtainingknowledge from large numbers of trajectories and has resulted in the creation of route recommendation systems and traffic prediction systems[6],[7].Research on refueling behavior has alsobeen done[1],[3]-[5],[8],[10],[11],[13].

Figure3.?Spatial distribution of gasstationsin China.

▲Figure4.Cumulativedistribution of drift.
In[1],refueling behavior of gasoline-car drivers was ana?lyzed,and it wasfound that refuelingbehavior isstrongly corre?lated to characteristics how a car is driven.This research spurred us to use algorithms that detect refueling stops based on the driving distance.Analysis of refuel availability in[3],[4]model of optimization of refueling stations locating in[5]presented research on spatial distribution of gasoline stations and quantified refuel availability in order to optimizing the lo?cating of gasoline stations.Our refueling recommendation/pre?diction system evaluates refueling availability with spatial dis?tributions of gasoline stations combined with their popularity among all drivers.However,such previous research work did not use a huge amount of trajectories data,which means that these research results limited in a small range both spatially and temporally.In comparison,our system is based on a big volume of real trajectories data.Moreover,we introduced the classification methodologies for data mining[8]-[11]into our algorithmstoachievehigh accuracy under thescopeof dataun?certainty.
A smart route service system for taxicab drivers based on a large amount of historical taxi trajectories was proposed in[6],as well as a graph model to estimate traffic conditions.Our re?search work shared the same designing ideas with[6]in the da?ta preprocessing model.However,we further integrated batch processing platform Hadoop and real-time stream processing platform to process big data for real-time applications.Our ar?chitecture and framework have a systematically advantage in generality and scalability.
In this paper,we have integrated machine-learning and da?ta-mining techniques to create a refueling detection algorithm called Naive Bayes Classifier with Multiple Hypotheses.This algorithm detects whether a car has refueled.We have also de?signed a practical application to recommend refueling alterna?tives.This application is based on real-time information taken from a vehicular trajectory and background knowledge about the distribution of gas stations and driver experience.Further?more,we have developed a general framework for integrating Storm real-time processing platform and Hadoop batch-pro?cessing platform in order to process and data mine the trajecto?ry under the scope of real-time big data.Our experiment shows that,our recommendation/prediction system recom?mendsacceptablegasstation alternativeswith low drifts.
Acknowledgment
This work was supported by a grant from the Science Tech?nology and Innovation Committeeof Shenzhen Municipality.