999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Analyzing Cross-domain Transportation Big Data of New York City with Semi-supervised and Active Learning

2018-10-23 08:05:44HuiyuSunandSuzanneMcIntosh
Computers Materials&Continua 2018年10期

Huiyu Sun and Suzanne McIntosh

Abstract: The majority of big data analytics applied to transportation datasets suffer from being too domain-specific, that is, they draw conclusions for a dataset based on analytics on the same dataset. This makes models trained from one domain (e.g. taxi data) applies badly to a different domain (e.g. Uber data). To achieve accurate analyses on a new domain, substantial amounts of data must be available, which limits practical applications. To remedy this, we propose to use semi-supervised and active learning of big data to accomplish the domain adaptation task:Selectively choosing a small amount of datapoints from a new domain while achieving comparable performances to using all the datapoints. We choose the New York City (NYC)transportation data of taxi and Uber as our dataset, simulating different domains with 90% as the source data domain for training and the remaining 10% as the target data domain for evaluation.We propose semi-supervised and active learning strategies and apply it to the source domain for selecting datapoints. Experimental results show that our adaptation achieves a comparable performance of using all datapoints while using only a fraction of them, substantially reducing the amount of data required. Our approach has two major advantages: It can make accurate analytics and predictions when big datasets are not available, and even if big datasets are available, our approach chooses the most informative datapoints out of the dataset, making the process much more efficient without having to process huge amounts of data.

Keywords: Big data, taxi and uber, domain adaptation, active learning, semi-supervised learning.

1 Introduction

New York City (NYC) transportation data is available to the public on a number of different domains: Taxi, Uber, etc. and many works [Deri, Franchetti and Moura (2016); Freire,Silva, Vo et al. (2014); Sun and McIntosh (2016); Poulsen, Dekkers, Wagennar et al.(2016); Aslam, Lim and Pan (2012); Cramer and Krueger (2016); Sun, Hu, McIntosh et al.(2018)] have been proposed to take advantage of it to perform various analytics and tasks such as predicting traffic, classifying datasets and offering services. But most works rely on data from the same domain to draw conclusions. And when the analytics and models are applied to a different domain, often the results are not very promising. This is due to variations between data sets leading to difference in the trained models. It is difficult to accurately adapt the model trained on one datasets (domain) to a different dataset without training instances available.

Many works investigated learning approaches on big data using a machine learning or neural network based model [Kasun, Zhou, Huang et al. (2013); Najafabadi, Villanustre,Khoshgoftaar (2015); Chen and Lin (2014); Wu, Zhu, Wu et al. (2014); Suthaharan (2014);Belkin, Matveeva and Niyogi (2004); Lv, Duan, Kang et al. (2015)]. Most of these analytics are a form of supervised learning, which requires large quantities of data.

Figure 1: Processing pipeline of our semi-supervised and active learning approach

By contrast, active learning methods [Sun, Grishman and Wang (2017); Akbik, Thilo and Christoph (2014); Sun and Grishman (2018); Sun, Grishman and Wang (2016)] aim to achieve a high performance with as little labeling effort as possible. Semi-supervised[Deri, Franchetti and Moura (2016)] and active learning allow us to selectively choose informative datapoints on the new domain, and using the selected data, it hopes to maximize the performance compared to the performance achieved using all the datapoints on the target domain.

The features for our model is basically the data entries for a given data point: VendorID,pick-up date time, drop-off date time, passenger count, trip distance, pick-up longitude,pick-up latitude, RateCodeID, store and forward flag, drop-off longitude, drop-off latitude,payment type, fare amount, extra, mta tax, tip amount, tolls amount, improvement surcharge, and total amount. Common entries for Uber and taxi are used as features.Variations between different datasets are distinguishable, as found out. Therefore, the features of datasets are enough to distinguish them. The label assigned to each data point is simply which transportation dataset it belongs to: Yellow taxi, Green taxi, or Uber. We investigate domain adaptation of datasets on NYC transportation data: Yellow taxi, Green taxi and Uber.

We use a semi-supervised bootstrapper to classify taxi and Uber data. Bootstrapping is proposed and has been extensively applied. Then active learning is used to select informative datapoints on the target domain. First, we take a semi-supervised approach by investigating ways to bootstrap new data. Then after the most optimized bootstrapping system is achieved,we turn our system into an active learner to select positive and negative datapoints at each iteration. The selected datapoints are given a label (yellow taxi, green taxi or Uber) and are used to classify the target domain data. The pipeline of our system is shown in Fig. 1.

2 Semi-supervised learning

In Semi-supervised learning, we use a bootstrapper that starts out with a set of seed datapoints. It then selects new data based on different sampling strategies. We first consider the edit distance strategy. Each taxi and Uber data entry can be viewed essentially as a string of characters containing time, location, fare amount, etc. The minimum edit distance [Levenshtein (1966)] between two strings (datapoints) is the minimum number of editing operations required to transform one string into the other. The generalized Levenshtein algorithm [Damerau (1964); Bard (2007); Brill and Moore(2000)] with weights and backtrace is used to compute the edit distances. The edit distance scores are normalized to fall in the range of [0, 1] with smaller scores indicating more similar datapoints. Given a set of seed datapoints, we computer the edit distances from a candidate datapoint to each of the seed datapoint. This set of edit distances form an edit distance vector, E, for a candidate datapoint. Then Eq. (1) is applied to E to give a final score, B(x), for a candidate datapoint. This is computed for each datapoint.The datapoints with the smallest scores are selected and added to the seed data set for the next iteration.

This edit distance approach is used when bootstrapping. However, another approach is used for comparison: The measure of entropy. Entropy has been used in many works[Zhu, Wang, Yao et al. (2008); Becker, Hachey, Alex et al. (2005)] to measure uncertainty [Lewis and William (1994)] When negative data are introduced, there are two labels for a datapoint: ‘+’ if it is an example of the relation, and ‘-’ if it is not. Then the entropy of a datapoint can be computed by Eq. (2), where P(+) is the probability of a datapoint being labeled positive and P(-) is the probability of a datapoint being labeled negative.

For bootstrapping, we pick the candidate data with the smallest entropy, as they are the most reliable data. For active learning, on the other hand, candidate data with the largest entropy are picked since they represent the highest uncertainty.

3 Active learning

An active learner starts out with a set of seed data. It then selects new seed data based on different sampling strategies. The basic idea is to select informative examples that can maximum the likelihood to get the optimal performance. We investigate several different active learning strategies for sample selection.

3.1 Algorithm

The active learning algorithm can be described in Fig. 2. We start with a few seed data.Then the active learning strategy is applied to a set of candidate data and those satisfying the strategy are selected. The iteration continues until a stopping criterion is met. We randomly select 90% of the taxi and Uber dataset as the source domain for training and the rest 10% as the target domain for evaluating our learner. Beside active learning, different search, ranking and information extraction methods [Sun and McIntosh (2016); Sun,McIntosh and Li (2017); Fu, Ren, Shu et al. (2015); Zhou, Yang, Chen et al. (2016); Fu,Sun, Liu et al. (2015); Yuan, Xia, Sun et al. (2017); Xia, Wang, Sun et al. (2016); Xia,Wang, Sun et al. (2014)] have been proposed.

3.2 Edit distance

An approach to quantifying similarity is using edit distance [Levenshtein (1996)]. The minimum edit distance between two strings is the minimum number of editing operations(insertion, deletion and replacement) required to transform one string into the other. It is being used in speech recognition, computational biology and other fields [Damerau (1964);Bard (2007); Brill and Moore (2000)].

For a datapoint X of length n and Y of length m, we define D(i, j) as the edit distance between X[1..i] and Y[1..j]. The edit distance between X and Y is thus D(n, m). We use the generalized Levenshtein algorithm [Levenshtein (1996); Damerau (1964)] to produce alignments and alignment scores between two datapoints. It uses backtrace to keep track of alignments. The weights to the edit distance are also considered since operations on some datapoint are more likely ranker than others. The algorithm is shown in Fig. 3. With the edit distance approach, the more similar a candidate datapoint is to the seed datapoint, the smaller the edit distance will be. This allows us to rank each candidate datapoint accordingly.

Figure 2: Active learning algorithm for selecting new datapoints on the target domain

Figure 3: Generalized Levenshtein algorithm with weights and backtrace for edit distance

3.3 Improve with entropy

We observe a limitation of our current active learning methods: any particular seed datapoints is limited in that it can only pick out a small set of the available data. Using our active learning methods to quantity similarity between a seed datapoint and potential candidates picks out different ways to describe a specific datapoint. But it sometimes fails to introduce new instances. Here we combine our edit distance methods with entropy to counter the effect of the aforementioned limitation. Given a set of candidates datapoints,first calculate a set of edit distances. Then the edit distance is first divided by the length of the datapoint. This way, long strings (datapoints) will result in an even lower value. Next we calculate the entropy of the candidate set L the same way as semi-supervised learning using Eq. (1).

Figure 4: F1 curves for different active learning methods for 10 iterations on the target domain

Entropy has been used in many works to measure uncertainty. Here, we use it as an additional metric in combination with the scores calculated using the edit distance method.The uncertainty part might seem counter-intuitive at first since we are mainly trying to discover similar datapoints. But it does have the benefit in introducing completely different datapoints.

4 Experimental results

4.1 Active learning

We start with a few seed datapoints. The default batch size we used is 1000, meaning in each active learning iteration 1000 new datapoints are selected and added to the data set.We compare the performance in terms of the F-measure achieved, specifically the F1 score.It is a measure of both accuracy and sensitivity as a combination of recall and precision calculated by Eq. (3).

We compare the performance of edit distance and entropy based active learning methods introduced in the previous section. We also use a random selection strategy for comparison.Fig. 4 shows the F1 curves for the different methods. Using the 90% split of source domain, we get a F1 of 48.62% of the target domain, which is iteration 0. Then the F1 score improves with each iteration. We performed 10 iterations for each different method.Randomly selecting datapoints clearly gets much worse performance compared to the other selectively sampling methods. This shows that there is much room for improvement in the sampling method. The Edit distance method, which is the simplest sampling strategy, gets better performance than random by a large margin. The F1 scores after the 10 iterations for the four methods are shown in Tab. 1.

Table 1: F1 scores for random selection and different active learning methods after 10 iterations. Supervised score is also shown

The entropy active learner has a similar performance compared to edit distance in the first few iterations. But after the 4th iteration, it clearly outperforms edit distance and continues to do so in the later iterations. This shows that as more and more datapoints are selected,entropy is a better method.

As expected, our combination of entropy with edit distance active learner achieved the best performance out of the four, with a better F1 in each iteration of active learning. And after 10 iterations, it beats the next best (entropy) by 5% and beats random selection by 18%. This shows the advantage of leveraging both similarity as well as uncertainty in sample selection.Using supervised method, i.e. all the datapoints from the 10% target domain, we get a F1 of 82.37%. That is the upper bound that we are trying to compare with using active learning. With edit distance + entropy active learner, we achieved F1 of 71.26 after iterations, 86.52% of supervised score. We only used a fraction of datapoints out of the new domain which contains over millions of data entries. Therefore in terms of data reduction, we used less than 0.01% of data and achieved comparable score to supervised method (86.52% of supervised scores).

4.2 Data splits

To get an overall performance of our active learner, we need to take the split into account.In our experiment, we did a random 90%/10% splits on the train and test data. We want to see the effect that different data split has on the results as some split might pick out a less or more representative data split. We did 20 of these random splits on train and test and recorded the scores on the test. Distribution of the frequency of F1 scores on the test for entropy and edit distance + entropy are shown in Fig. 5. We see that the variation in scores all somewhat conform to a normal distribution.

An overlooked aspect in these big data analytics is the substantial variance in F1 score across different random splits on the dataset as seen in the figure. The standard deviation(about 2.5% F-measure) is comparable to the difference between systems. To avoid this problem it is necessary to either completely specify the split, so eliminating the variance or take the mean of n splits (thus reducing the SD by a factor of √n ). In the active learning experiments, we took a split that achieved a score close to the average of the 20 random splits to minimize the variance in scores.

Figure 5: Frequency distributions of F1 scores after 20 random 90%/10% splits on train and test with entropy (left) and edit distance + entropy (right) active learners

5 Conclusion

In this paper, we have analyzed the possibility of domain adaptation on big datasets and found that an accurate adaptation between the domains of NYC transportation datasets is very feasible. We used semi-supervised learning to train our model, and used active learning to adapt it to a new domain. Using a fraction of labeling effort, we were able to achieve similar performance compared to using all datapoints on the new domain. We used a combination of semi-supervised and active learning to improve the performance of taxi and Uber data. We used semi-supervised learning for training the classification model. We then turned our semi-supervised approach into an active learner, which further increased the performance of our system. The active learner using a combination of edit distance and entropy performed the best. Overall, the process of bootstrapping followed by active learning achieved a F1 score of 71.26% (86.52% of supervised scores) using only a tiny fraction of the dataset, showing domain adaptation of big datasets is able to yield comparable performance to supervised methods.

主站蜘蛛池模板: 最新国产在线| 五月婷婷中文字幕| 久久精品国产电影| 午夜精品国产自在| 不卡色老大久久综合网| 草草影院国产第一页| 亚洲国产黄色| 自拍欧美亚洲| 亚洲一区毛片| 亚洲va在线∨a天堂va欧美va| 精品欧美日韩国产日漫一区不卡| 亚洲国产成人超福利久久精品| 狠狠躁天天躁夜夜躁婷婷| 五月综合色婷婷| 亚洲精品成人片在线观看 | 性色生活片在线观看| 欧美yw精品日本国产精品| 天天综合色天天综合网| 国产在线一区视频| 美女黄网十八禁免费看| 亚洲日韩精品无码专区97| 2021国产精品自产拍在线| 婷婷色丁香综合激情| 国产伦片中文免费观看| 波多野一区| 亚洲天堂自拍| 精品国产免费人成在线观看| 伊人激情综合网| 国产高颜值露脸在线观看| 狠狠干综合| 久久五月视频| 精品少妇人妻av无码久久| 91福利在线观看视频| 片在线无码观看| 99热这里只有精品5| 亚洲美女久久| 青青草91视频| 九九九精品视频| 老熟妇喷水一区二区三区| 成人一级免费视频| 啊嗯不日本网站| 欧美精品啪啪| 亚洲黄色高清| 欧美五月婷婷| 亚洲中文字幕在线观看| 国产手机在线观看| 天天色天天综合| 久热中文字幕在线| 精品天海翼一区二区| 免费人成网站在线观看欧美| 国产乱码精品一区二区三区中文 | 国产高颜值露脸在线观看| 老司机aⅴ在线精品导航| 99re在线免费视频| 亚洲成A人V欧美综合| 中文字幕va| 午夜性爽视频男人的天堂| 国产高清在线丝袜精品一区| 国产幂在线无码精品| 久久综合丝袜日本网| 欧美日韩午夜| 国产成人无码综合亚洲日韩不卡| 成年人国产网站| 久久精品无码国产一区二区三区 | a亚洲视频| 国产精品成| 波多野结衣一区二区三区AV| 国产在线自揄拍揄视频网站| 青草免费在线观看| a毛片在线播放| 精品国产www| 亚洲视频在线青青| 色综合中文字幕| 国产综合在线观看视频| 8090成人午夜精品| 日本a∨在线观看| 亚洲高清日韩heyzo| 亚洲色偷偷偷鲁综合| 性欧美在线| 久久精品只有这里有| 色噜噜久久| 久久福利网|