1. Introduction
Position services [
1], including GPS and positioning of cellular communication networks, have become widespread. Multi-source mobility trajectory datasets generated by mobile phones, networks, and vehicle navigation, contain rich information about human activities. More and more researchers are working on application based on location information, including socio-economic development and wellbeing [
2], human individual mobility characteristics [
3,
4], traffic analysis [
5,
6,
7], travel guides [
8,
9,
10,
11], user behavior modelling [
12], cross domain recommendation [
13], identifying and predicting lifestyles [
14], construction for 3D geospatial data [
15],and so on.
In real-world applications, trajectory data is typically generated by a multi-source platform, or data is temporally separated. An identification algorithm is very necessary to link the trajectory information of users at different time periods, which focuses on improving the quantity and quality of mobility data. In addition, multi-source data based identification algorithms have been extensively studied, but there are few studies specifically for asynchronous trajectory information.
In this paper, we focus on the problem of user identification based on asynchronous mobility trajectory data. In other words, it matches the trajectories from two different periods to determine whether two users’ trajectories belong to the same person. The study that determines the user to which the trajectory belongs by measuring the similarity between the two trajectories has potential application value in social security prevention. A typical application scenario is that criminals often use multiple mobile phones in turn. Without the real name system of mobile phones, it is impossible to know that these mobile phones (e.g., 2) are from the same person. When we acquire the trajectory data of the mobile phone by mobile base station or GPS positioning, if the trajectories of the two phones (we do not know the ownership in advance) are highly similar, they are likely generated by different phones of the same user. In this way, we can judge that these phones are held by the same user.
It needs to be explained that the asynchronous trajectory means two trajectories do not coincide at all time. Due to the fact that the human behavior always changes with time, there is a certain inconsistency in the mobility trajectory, which thus results in difficulties for asynchronous identification resolution. In order to facilitate the description of the method, we use the following symbols to describe the problem. The trajectory of an object can be defined as the chronological sequence as follows:
Among them, the id is the person’s identity, which is a positive integer. Each position point is represented by a three-dimensional coordinate. and , respectively, represent the longitude and latitude of the k-th recorded point of the track. denotes the recording time.
Two sets of trajectories of multiple objects at two different periods are described as
The problem to be solved is to match the trajectory in S with the trajectory in Q, provided that each trajectory in S has at least one trajectory in Q belonging to the same object. This is not a simple problem because the trajectories in S and Q do not have any overlap in time, and we only use the trajectory information of two objects to calculate the similarity without exploiting any other information. We determine the trajectory most similar to the target as the matching result.
This paper proposes a framework called TOP-N (which uses TOP-N most frequently distributed regions) to achieve mobility identification resolution. In addition, a similarity method based on probabilistic deviation, angle cosine and weighted Jaccard similarity is used to measure similarity of two trajectories. The location where the the trajectory points are frequently distributed is usually more identifiable. For example, users usually stay in residences, companies, etc. for a long time; that is, track points are frequently distributed in these areas. For the areas that the user routinely passes every day, even if the speed is very fast, but as long as the time span is large enough, they are still the areas where the track points are frequently distributed. The infrequently distributed areas are those that the user randomly visits, which have no user identity implication. So, our method first finds TOP-N most frequently distributed regions of the trajectory, called TOP-N region set. As for the set of TOP-N regions, we calculate the scalar product of all pairs of co-distributed regions and then compute the product of the moduli of two TOP-N region sets, taking the quotient of the two as its similarity, namely, cosine similarity. Obviously, the more common regions that the two trajectory distributions share, the greater their similarity. However, there is an alternative possibility. More precisely, the number of co-distributed areas in trajectories T and is n while the number of non-coinciding areas is ; the number of co-distributed areas of T and is also n while the number of non-co-distributed areas is . Assuming that > , it is very likely that is more similar to T than . Therefore, both the co-distributed and the non-co-distributed areas must be accounted for. In another situation, there are three trajectories, and their co-distribution region’s number and non-co-distributed region’s number are the same. In this case, if we want to distinguish which two trajectories are more similar, we need to consider the frequency values of the trajectory points distributed in each region. In short, when calculating the similarity between two trajectories, two sequences need to be compared and reordered to find the pairs of co-distributed regions.
The main contributions are summarized as follows.
An user identification resolution method based on TOP-N frequent distribution regions of asynchronous mobility trajectory is proposed.
The effectiveness of TOP-N method was verified on two real mobility trajectory datasets.
We prove that in the identification task of asynchronous trajectory, the length in time of the trajectory has a great influence on the identification accuracy.
2. Related Work
Our work in this paper is closely related to the topic of the user identification and similarity search based on human mobility data. In recent years, some researchers studied identity resolution in cyberspace [
1,
16,
17,
18,
19,
20], while [
21,
22,
23,
24,
25,
26,
27] exploited identity identification through trajectory information. In [
25,
26,
28,
29,
30] studied the uniqueness of the trajectories of different objects. Ref. [
25] particularly showed that it is enough to identify users in multiple trajectories based on this uniqueness. Because the behaviors and habits of the same person are generally periodic, their activity trajectories also show a certain periodicity [
30]. For example, [
14] proved that sometimes, some people’s expected lifestyles can be analysed by the trajectory periodicity.
The identification task based on mobility information is roughly divided into two scenarios of synchronization and asynchronous, and the synchronization scenario generally refers to cross-platform trajectory matching. In [
26], a robust distance function called EDWP (Edit Distance with Projections) is proposed to match the trajectory data at inconsistent and variable sampling rates by dynamic interpolation. Ref. [
23] proposed a hotspot-matrix method, using cosine of the distribution matrix of two trajectories to express the similarity. This method can achieve a certain degree of recognition accuracy, but the amount of calculation is huge. Ref. [
24] introduced a Signal-Jaccard co-filtering method for matching multi-source trajectory, which can also be used to identify asynchronous trajectory. Only considering the same distribution areas, [
22] calculates cosine of all pairs in the same regions and uses learning-to-rank to improve the identification. Particularly, a inverted-index strategy is used to reduce the computational cost. Although the computational complexity can be substantially reduced, it needs to find all uniform regions from a large number of regions of two trajectories. Ref. [
31] presented a concept to capture both spatial and temporal diversity aspects of the linkage, called K-L diversity. In [
32], based on space-time matching points, a real-time trajectory similarity measure method was proposed. Similarly, in [
21], the authors proposed to achieve identification by time-space matching trajectory points. The matching points of the two trajectories require that the points are close in both time and space. This method is only for synchronous scenarios that is unavailable in our problem.In a related work, [
27] recently proposed a method called CDTraj2vec for user trajectory matching across social networks. This method uses PV-DM (Distributed Memory Model of Paragraph) model and paragraph2vec algorithm to extract the positional access order features in the trajectory sequence, which obtains the vector representation of the user trajectory and thus represents the trajectory similarity by the vector. However, this method ignores the distribution frequency of track points and the important information of track transition status.
4. Experimental Evaluation
4.1. Description of Datasets
The experiments in this paper are conducted on two real datasets, including GeoLife [
34] and CabSpotting [
35]. For GeoLife dataset, we only use the latitude and longitude of each recorded point, while the time information associated with each point in CabSpotting dataset is used. In addition, in order to facilitate the verification of the identification accuracy, trajectories need to be tagged as their id. The labels are encoded by a series of numbers, which are used to check whether the result of matching is correct or not after the identification.
GeoLife is a GPS-recorded dataset that records the trajectory information from 182 users from April 2007 to August 2012, collected by Microsoft Research Asia. This dataset records the wide-ranging outdoor activities of the tested users, including both routines for going home and going to work and some entertainment and sports activities such as shopping, sightseeing, dining, hiking and cycling. The trajectories were recorded by different GPS loggers and GPS phones, and have a variety of sampling rates. 91.5% of the trajectories are logged in a dense representation, e.g., every 1–5 s or every 5–10 m per point. Trajectories of 112 users are roughly located in Beijing, and, thus, the latitude and longitude mostly span between 39.5–41 and 115.4–117.6, respectively. For the sake of calculation, it needs to delete all points whose latitude and longitude are out of this range. Furthermore, in order to retain more experimental data, we also keep the trajectory points in Beijing more than 30%, so a total of 164 user trajectories are used for experiments. As most of the points in the trajectory are too time-intensive (every 1–5 s per point), one point is taken by every four points in experiments. As a result, we can reduce the computational cost without affecting the identification accuracy.
Cabspotting dataset is a public GPS dataset that contains trace information of 536 taxis in San Francisco for less than a month (17 May 2008–10 June 2008). This dataset has been recently used by Piorkowski et al. [
36] to show that certain macroscopic characteristics specific to clustered mobile wireless networks are prevalent in real mobility traces. Due to the inherent characteristics of this dataset(such as common routes of taxis) and the fact that the trajectories are spatially constrained to lie on the streets, the points are less unique. Because the tracks contain a lot of noise points, we delete the high-speed moving point ahead, for instance, the points in which the adjacent speed exceeds 20 m/s.
4.2. Experimental Method
Each trajectory in GeoLife and Cabspotting datasets is divided into two segments, so that two verification sets are obtained respectively. Particularly, the GeoLife dataset is split into and , and Cabspotting dataset is split into and . During the experiments, each trajectory in and is divided into 15 equal parts. We then take 9 copies in order (1–9, 6–15) one time, so that every test trajectory can include 2 sub-trajectories for testing. For and , times of identification experiments are performed with 164 trajectories while identification experiments are performed with 536 trajectories for and . The ratio of successful identification times to total identification times is taken as the identification accuracy.
Prior to recognition, all training trajectories and test trajectories are used to acquire their TOP-N sequences. Thus, calculating the similarity of two trajectories only needs to compare their TOP-N sequences. It notes that the recognition accuracy is related to the selection of parameters, including the size of steps and of the latitude and longitude as well as the number of TOP-N areas where the trajectories are most frequently distributed. In addition, the time length of the trajectory also affects the identification accuracy. In experiments, our TOP-N method includes three variants by embedding different similarity metrics, denoted by TOP-N-PRO, TOP-N-COS and TOP-N-WJS, respectively.
4.3. Experimental Results and Analysis
4.3.1. Effects of and
We have before explained that and determine the size of the small mesh area after the partition. Here the effect of and on the experimental results is discussed briefly. The side length of the small grid is taken as 50–500 m. As a result, the corresponding value of is 0–0.006 and the value of is 0–0.005.
Experimental results on two datasets (GeoLife and Cabspotting) are shown in
Figure 1. As shown in
Figure 1, as the longitude
increases, the identification accuracy tends to decrease. The smaller the
is, the more precise the actual position of each grid region becomes. We thus can easily distinguish the actual distributed locations of the user`s trajectories. Conversely, the larger the
is, the less accurate the locations are. As a result, one grid is likely to correspond to a number of completely different physical locations. On the other hand, when
is too small, it is possible to divide one physical location into multiple grids, thus resulting in inaccurate matching. In addition, the
accuracy graph exhibits obvious oscillating situation. The fluctuation may be due to the division rule on urban areas. As mentioned before, if one real area is divided by different
, there will be multiple situations. In a simple example, two different users (
a and
b) often pass by the same large park, where the user A passes the street on the west side of the park and the user B passes through the small road on the east side. When the division by
is small, the east and west of the park are separated into two grids. When the division of
is large, they will fall into the same grid. Obviously, the division of the latter makes two users’ trajectories more similar due to the increase in their same distribution areas.
4.3.2. Effects of Top-N
In order to discuss the effect of TOP-N on the identification accuracy, we should control the values of
and
. The experimental results from
Section 4.3.1 show that as the longitude
increases, the recognition accuracy tends to decrease. Therefore, three specific
and
parameter values are selected, namely
and
,
and
,
and
, respectively, which accordingly correspond to a small grid side length of 50 m, 100 m, 150 m. We set the TOP-N value from 5 to 100 and performed an experimental analysis on
vs.
and
vs.
.
Figure 2 shows comparisons among probability deviation method, weighted Jaccard similarity and cosine method. In
Figure 2a,b we reveal the experimental results with
,
, (c) and (d) show the experimental results of
,
, and (e) and (f) takes
and
.
As seen from the six plots in
Figure 2, with the increasing number of TOP-N areas, the identification accuracy increases first, then stabilizes, and eventually tends to decrease. This is due to the fact that the number of frequently distributed areas of an object’s trajectory is limited. When TOP-N grows to a certain value and continues to increase, infrequently distributed areas are introduced. These infrequently distributed regions have weak regularity and do not exhibit periodicity. Therefore, the larger TOP-N will not only fail to improve identification accuracy, but also even reduce it. In addition, the TOP-N-COS method is significantly superior over the TOP-N-PRO method because the recognition accuracy of the TOP-N-PRO method is related to the probability of the user appearing in a given area. Moreover, it is affected by the method of dealing with the probability deviation of two trajectories appearing in non-co-distributed regions.
4.3.3. Effects of Time Length
To further explore the effect of trajectory time length on recognition accuracy, we performed three sets of experiments on the GeoLife dataset, setting the length of time to 10 days, 30 days, and 50 days, respectively. For experimental comparisons, we retained 54 trajectories in the GeoLife dataset that are longer than 50 days in length to ensure that the trajectories of each set are from the same users. We follow the same experimental method as the previous section where each trajectory is divided into two segments. According to the experimental conclusions of
Section 4.3.1 and
Section 4.3.2, and considering the compromise between the amount of calculation and the identification accuracy, we set the
,
and TOP-N = 39.
The experimental results are shown in
Table 1. We can find that the identification accuracy of the three methods substantially decreases with the shortening of the time length of the trajectory. When the trajectory time length is 50 days, the accuracy exceeds 80%. When the time length is reduced to 10 days, four methods are all reduced by about 20% in identification accuracy. This fact shows that the length of the trajectory has a great influence on the identification performance. Meanwhile, although the trajectory time is only 10 days, there is also a considerable possibility for successful identification.
4.3.4. Comparisons with Other Algorithms
To validate user identification, we performed comparisons of our method with previous representative methods. Since our approach is designed for asynchronous scenarios, many existing identification algorithms are not applicable. For example, Kondor et al. proposed to achieve identification by time-space matching trajectory points [
21]. It requires that the points of two trajectories are close in both time and space. Therefore, we compare our method with three feasible existing algorithms.
Hao et al. [
22] proposed a cross-domain trajectory similarity method (UNICORN) based on TF-IDF (term frequency-inverse document frequency). TF-IDF transforms the original user trajectory into grid representation, where each cell can be treated as a word, and each user trajectory is seen as a paragraph. Each user trajectory is transformed into a vector by using the TF-IDF method to obtain trajectory similarity.
Zhang et al. [
23] presented a very concise and effective method, called hot-matrix (HOT-MATRIX). HOT-MATRIX method produces hotspot matrix by meshing the map and counting the frequency of the trajectory appearing in each grid. Then the hotspot matrix is used to represent trajectory similarity.
Automatic user identification (AUI)presented in [
24] matches the users by co-filtering the ID pairs with signal base similarity(SIG) and weighted Jaccard similarity (WJS).
In [
27], a cross-domain trajectory matching algorithm based on paragraph2vec (CDTraj2vec) is presented. The method maps the trajectory into the grid, treats each small grid as a word and each trajectory as a paragraph, and then uses the PV-DM model in the paragraph2vec algorithm to extract the positional access order features in the trajectory sequence to obtain the vector representation of user trajectory. Finally, the similarity of the vector is used to represent the trajectory similarity.
We then show the performance evaluation of our TOP-N method. In order to find the best combination of parameters to reduce the amount of calculation and improve the recognition accuracy, we set the parameters TOP-N,
, and
to 5–100, 0–0.006, 0–0.005, and did a lot of experiments. Finally, the parameter settings for datasets
vs.
and
vs.
are shown in
Table 2 and
Table 3, respectively.
As seen from
Table 4, all methods have better performance on dataset
vs.
than
vs.
. This is due to the difference of information recorded in the two datasets. First, many trajectories in GeoLife dataset used for experiments are more than 50 days in time span, and most of the time intervals between the adjacent points are about 5 s. In contrast, the time span of Cabspotting dataset is only 23 days, and the points are sparse. Second, GeoLife dataset records the wide-ranging outdoor activities of the tested users, including not only routines for going home and going to work, but also some entertainment and sports activities such as shopping, sightseeing, dining, hiking and cycling. Instead, because Cabspotting dataset is collected by devices embedded in taxis, the routes and trajectories are spatially limited to street, thus lacking private access. Therefore, the points in the Cabspotting dataset are not so discriminative, thus making identification more challenging. For the above reasons, these two datasets result in very different identification accuracy in our experiments.
From the experimental results, we can see the advantages of the TOP-N method in the asynchronous identification resolution provided that all three TOP-N variants outperform other comparison methods. HOT-MATRIX method achieves a relatively high accuracy while the UNICORN method and the AUI method obtain relatively poor identification results. For the UNICORN method, only the co-distributed regions are considered, but the influence of the non-co-distribution areas is ignored. When the trajectories of different users are sufficiently similar, it is difficult to distinguish them. For the CDTraj2vec method, although it can extract the order features of the location point access of the trajectory, it ignores the frequency of the visiting locations. The premise that the AUI method and CDTraj2vec method can identify users is that the employed dataset contains many stay points. However, the Cabspotting dataset contains very few stop points. Therefore, AUI and CDTraj2vec are not applicable to Cabspotting dataset that contains almost no stay points.
At last, as far as the computational cost is concerned, we give a brief discussion. HOT-MATRIX method needs a very huge computation because it uses cosine of the hot-matrix of two trajectories to express the similarity. UNICORN calculates cosine of all pairs in the same regions with a inverted-index strategy to reduce the computational cost. Although the computational complexity can be considerably reduced, it needs to find all co-distribution regions from a large number of trajectories. In our method, we only consider TOP-N frequently distributed regions in calculation similarity, which thus leads to reduced computational complexity. Experimentally, the TOP-N value corresponding to the highest identification accuracy stably falls between 30 and 100.