Measuring and Understanding Crowdturfing in the App Store

Hu, Qinyu; Zhang, Xiaomei; Li, Fangqi; Tang, Zhushou; Wang, Shilin

doi:10.3390/info14070393

Open AccessArticle

Measuring and Understanding Crowdturfing in the App Store

by

Qinyu Hu

¹,

Xiaomei Zhang

^1,*,

Fangqi Li

²,

Zhushou Tang

³ and

Shilin Wang

²

¹

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

School of Eletronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

³

QI-ANXIN Technology Group Inc., Shanghai 201101, China

^*

Author to whom correspondence should be addressed.

Information 2023, 14(7), 393; https://doi.org/10.3390/info14070393

Submission received: 31 May 2023 / Revised: 4 July 2023 / Accepted: 6 July 2023 / Published: 11 July 2023

(This article belongs to the Section Information Processes)

Download

Browse Figures

Versions Notes

Abstract

:

Application marketplaces collect ratings and reviews from users to provide references for other consumers. Many crowdturfing activities abuse user reviews to manipulate the reputation of an app and mislead other consumers. To understand and improve the ecosystem of reviews in the app market, we investigate the existence of crowdturfing based on the App Store. This paper reports a measurement study of crowdturfing and its reviews in the App Store. We use a sliding window to obtain the relationship graph between users and the community detection method to binary classify the detected communities. Then, we measure and analyze the crowdturfing obtained from the classification and compare them with genuine users. We analyze several features of crowdturfing, such as ratings, sentiment scores, text similarity, and common words. We also investigate which apps crowdturfing often appears in and reveal their role in app ranking. These insights are used as features in machine learning models, and the results show that they can effectively train classifiers and detect crowdturfing reviews with an accuracy of up to 98.13%. This study reveals malicious crowdfunding practices in the App Store and helps to strengthen the review security of app marketplaces.

Keywords:

crowdturfing; App Store; review analysis; app reviews

1. Introduction

Crowdsourcing refers to outsourcing tasks to a large number of workers or volunteers, allowing them to contribute services or produce goods to achieve cumulative results [1]. Representative crowdsourcing sites include Amazon Mechanical Turk (https://www.mturk.com, accessed on 7 July 2023) and Microworkers (https://www.microworkers.com, accessed on 7 July 2023). Crowdturfing leverages the crowdsourcing operation by hiring a lot of low-paid workers to perform suspicious and malicious tasks online [2]. The purpose of crowdturfing is to increase the exposure and visibility of the item, thus improving the ranking in the search engines and attracting more genuine users to find and visit it, e.g., posting fake reviews, reposting and likes on social media, and falsifying the number of downloads and installations in mobile application markets. These malicious behaviors not only threaten the credibility and reliability of online information but also threaten the reputation and profitability of enterprises.

According to the survey in [3], 98% of consumers at least “occasionally” read online reviews when researching businesses, and 76% of people “always” or “regularly” read online reviews for businesses. It is evident that reviews are one of the important considerations of brands. However, in the online markets, most consumers do not have opportunities to learn about the goods and can only rely on the product information and other users’ reviews [4]. Enterprises that realize that user-generated content plays an important role in potential customer consumption decisions seek crowdturfing services to enhance competitiveness. As a result, the targets of their published malicious information are gradually shifting toward application stores.

Crowdturfing’s tasks in the app markets are usually to improve or reduce the ratings and reviews of apps, which then misleads potential consumers. For instance, low-paid workers will imitate genuine users’ search keywords, click to download the app after finding it, and give positive reviews in the comment area to improve rankings. Such manipulation of consumer experience is forbidden in the App Store [5], but there are still developers who escape regulation and break the rule purposely, resulting in many misleading reviews and disrupting the ecosystem of the application markets. Therefore, measuring and investigating crowdturfing in app markets will help researchers better understand the characteristics, scale, and goals of crowdturfing and make improvements accordingly. It is an important and worthwhile question to counteract this phenomenon in order to protect the rights of users and market participants.

Previous research has explored various methods of detecting crowdturfing, including supervised learning-based detection of machine-generated reviews, graph-based models for identifying crowdturfing activity and reviewers, semisupervised learning-based methods, and network analysis [6,7,8,9]. However, most studies have mainly focused on online social networks and online shopping platforms, with little focus on investigating the extent and situation of crowdturfing on application market platforms. Therefore, it is more necessary to investigate whether there are differences between the features of crowdturfing groups and those of genuine users and then provide better text mining results.

In this paper, we aim to explore the under-researched area by measuring crowdturfing in the App Store based on reviews and relationships between users. To achieve such a measurement, we first obtain review data from Apple, preprocess the data and use a sliding window and community detection to help us obtain the ground-truth data. Then, we examine the identified crowdturfing and explore the differences with genuine users’ behavioral preferences via sentiment analysis, content similarity, activity frequency, and common words. We also investigate which categories of applications they often appear in and their impact on app rankings. It was found that crowdturfing worked successfully, evading detection by app stores. Finally, we combine several features to train classification models to detect crowdturfing reviews and evaluate the classification performance of several machine learning models and feature importance, with a maximum accuracy of 98%.

In summary, the contributions of this paper are listed below:

We analyze the features of crowdturfing accounts, reviews, and behaviors, then compare them with genuine users to discover differences, giving us new insights into crowdturfing in the App Store. Our disclosure of them will help take further actions and seek more effective detection methods.
Our research on the delivery time of the crowdsourcing platform that provides reviews informs the size of the sliding window. We divide the process of constructing graphs into smaller tasks along the time dimension, which can reduce overhead and increase efficiency.
We compare six commonly used machine learning algorithms to reveal which algorithms and features are effective for the detection of crowdturfing reviews. Based on these features, our classifier models achieve the highest accuracy of 98% in detection.

The rest of this paper is organized as follows. Section 2 provides background information for our study. In Section 3, we list the related work. Section 4 briefly describes the measurement workflow and method. Section 5 describes our measurement results and analysis of crowdturfing. Section 6 evaluates the performance of the classifiers, followed by a discussion of the results in Section 7. Finally, we conclude the paper in Section 8.

2. Background

In this section, we describe the review system in the App Store and how users report suspected spam reviews to Apple. Then, we introduce the data feed from EPF Relational and explain the metadata in the App Store, which plays an important role in our method.

2.1. Review System in the App Store

The App Store by Apple Inc. is one of the dominant app markets today, offering nearly two million apps [10]. In the App Store, users can only write one review for an app they have purchased or gotten [11]. These users may be people who provide suggestions for the app or spammers who are hired to spread spam reviews. With the proliferation of spam reviews, the experience of other users and the app market environment have been seriously affected. Apple has noticed this problem [12]. Apple has strictly confined and taken steps to prevent the posting of fake reviews. If developers are found to manipulate reviews, they will be expelled from the Apple Developer Program [13]. In order to lock spam reviews, users can choose a review they want to report in the App Store and long press on the review, then a “Report a Concern” button will pop up. After tapping the button, users can choose four reason options, including “It looks like spam” [14]. In this way, users can report reviews that look like spam to Apple. However, there is usually no reply or results from Apple after submitting the report [15,16], and users do not know whether the reported reviews will be deleted or if other measures will be taken.

2.2. Metadata Definition

The Enterprise Partner Feed (EPF) is a service provided by Apple to its performance partners. By utilizing EPF, partners can automate the retrieval of data related to Apple’s products, including data feeds of the complete set of metadata from Apple Music, Music downloads, and TV and Movie downloads [17]. This enables partners to stay up to date with the latest product data and to ingest, build, and maintain databases by accessing the EPF. The full export of the EPF is generated weekly and contains iTunes metadata of artists, collections, songs, applications, and videos. Every collection, song, application, and other content in iTunes has a unique iTunes ID to distinguish them from each other. When the content is an application, this iTunes ID is also the application_id. Among them, application metadata define all iPhone and iPod touch applications in the App Store. The metadata contain many elements, such as application_id, title, artist_name, version, etc. application_id is the unique identification of an app, and each app corresponds to an application_id. version is the version number provided by the application developer. user_id is the unique identifier for each user accessing the App Store.

3. Related Work

Many studies have focused on the activities and impacts of crowdturfing from different perspectives. In this section, we present some examples of crowdturfing and application market security.

Crowdturfing. Several researchers proposed methods to detect crowdturfing with a focus on analyzing social networks. Wang et al. [18] studied two popular online crowdturfing platforms in China, analyzing detailed information on the activities offered and performed on these sites. They also revealed the activity and impact of crowdturfing on Sina Weibo and compared the sources and activities of workers in crowdturfing in different countries, but they did not explore the content of workers’ tasks in detail. On the contrary, this is the main focus of our research. In [1], a method to detect target objects for crowdsourcing tasks on Twitter by analyzing the manipulation patterns of crowdturfing customers was proposed. Then, the authors Lee et al. [19] comprehensively analyzed crowdsourcing in Fiverr and Twitter and developed models to detect crowdturfing tasks and malicious crowd workers. The authors of [20] investigated the impact of crowdsourcing activities on Instagram by analyzing the profile characteristics of social media accounts. Liu et al. [9] proposed a semisupervised learning framework capable of codetecting crowdturfing microblogs and spammers by simultaneously modeling user behavior and message content on the Sina Weibo platform. However, some features used in their model, such as retweets and number of followers, are not applicable in the application marketplace, which does not have the function of user retweeting. In our approach, we extract useful features for reviews and users’ information in the App Store.

For online review systems, Kaghazgaran et al. [21] proposed a system to uncover crowdsourced review manipulators that target online review systems. This method distinguished deceptive reviewers through community embedding structure. They used an approach for propagating suspiciousness on the graph, computing a suspiciousness score for each user based on their connections to other users in the graph. We use a community partitioning method that does not require the selection of seeds for random walk propagation at the beginning and has low time complexity for large-scale networks. In [8], the authors presented a new aggregation method, CrowdDet, to identify crowdsourced reviewers by aggregating the factors of reviewers in Review-space and Sociality-space.

Corradini et al. [22] first proposed a multidimensional model to fully investigate negative reviews in Yelp and applied the concepts and techniques of social network analysis to it. This multidimensional social network model can be used to deal with different relationships and can be dynamically selected depending on the analysis to be performed. The concept of k-bridge refers to a user who belongs to exactly k different communities in that social network. And the properties and knowledge patterns that characterize k-bridges are general in social networks and can be used to discover the best targets for a marketing campaign and new products to propose [23]. Due to Yelp’s purely crowdsourced nature, these investigations also provide new perspectives and ideas for exploring crowdsourced activities in online review systems. Cauteruccio et al. [24] used social network analysis to extract and study text patterns present in NSFW adult content on Reddit. The User Content Network can identify virtual communities of users who use the same patterns and may involve users who have never interacted with each other. This can help us understand and analyze the key elements such as information dissemination, cooperation, and decision making in the crowdturfing process through models, as well as understand the way information is disseminated and communicated among them and how collaborative relationships are formed. These approaches can provide different insights for studying crowdturfing.

With the development and massive use of mobile smartphones in recent years, researchers have begun to focus on the application markets. Li et al. [25] initially investigated the use of crowdsourcing platforms to manipulate app reviews and determined that it did exist in app marketplaces but did not propose an effective model that could identify this crowdturfing. In our study, we not only further measure the activity of crowdturfing reviews in the App Store but also combine the extracted features with multiple machine learning methods to obtain an effective detection model. The authors of [2] discovered and measured crowdturfing content hidden behind the UIs of apps and proposed a new triage methodology to identify iOS apps that may contain hidden crowdturfing UIs. They focused more on program analysis and semantic analysis in the UI and how crowdturfing blends into the app markets by hiding in the apps. Differently, we focus on how crowdturfing uses reviews to have an impact on other users and app rankings.

App markets security. Previous research has investigated multiple security issues in the application markets. In contrast to Google Play, applications with security issues in the iTunes App Store have not been explored much. In terms of detecting scam reviews, CrowdApp [26] can automatically identify important and high-quality user feedback from user reviews and check the impact of user reviews on mobile applications, helping developers to get useful information about complaints from reviews. Their focus was to design an app to collect relevant information on mobile devices and combine these data to make judgments about the fairness and accuracy of ratings and reviews. In [27], the authors proposed a deep graph neural-network-based spammer detection model and formulated a graph neural network framework to generate feature expressions for social graphs. And they tested it on two datasets, Twitter and Sina Weibo, to provide a new method to detect spam reviews.

In terms of detecting suspicious apps, Dou et al. [28] investigated the download fraud problem in the mobile app markets, then designed features and trained machine learning models to identify fake downloads and suspicious apps involved in download fraud activities. Differently, we investigate and classify fraudulent reviews and find meaningful patterns of fraud activities of crowdturfing reviewers. Hu et al. [29] detected and characterized the correlations between user comments and market policies and identified violations of apps through user feedback. They focused on finding suspicious applications that violated market policies through user reviews, while we focused more on finding suspicious users based on reviews. The authors of [30] used four parameters that include ratings, reviews, in-app purchases, and Contains ad to predict the probability of an app deceiving consumers and used various models to detect whether the app was fraudulent. However, they did not measure and analyze the detected fraudulent apps, and the machine learning methods used are basically used in this paper.

In terms of detecting fake reviewers, Rathore et al. [31] proposed a framework for detecting fake reviewer groups in app marketplaces and validated the framework in a dataset from the Google Play Store. They used a semisupervised clustering approach to identify groups of fake reviewers, while we used a supervised learning approach and then applied the model to unlabeled data for prediction, which was able to make accurate and effective predictions on the given labeled data. The MHT-GNN [32] framework can help detect click farming fraudsters in the WeChat app by capturing the spatial information and heterogeneity of heterogeneous temporal graphs. This method can effectively block click farming fraud activities. Click farming refers to the manual operation of a large number of virtual accounts. Its purpose is also to cheat the algorithm system to improve the ranking of a web page or application, so as to gain improper benefits, similar to crowdturfing. So, the approach of capturing the dynamics and diversity of data by multiview learning and graph representation learning in MHT-GNN gives some possible ideas to detect crowdturfing.

In this paper, we analyze crowdturfing review strategies and targets. Ground-truth data are obtained by community partitioning and expert classification. The differences between various features of crowdturfing reviews and genuine reviews are identified through these data. Based on these features, we train and compare multiple machine learning methods and evaluate which features are most useful in detection. Compared with the related literature, one of the main contributions of this work is an in-depth analysis of the current crowdturfing situation in the App Store, while also considering the content of posted reviews and the behavioral features of the accounts.

4. Research Method

4.1. Overview

In this section, we provide an overview of the workflow and methods for measuring crowdturfing accounts and reviews in the App Store. On the one hand, such groups of manipulated reviews mostly work in teams to accomplish tasks, and they all belong to the same or similar communities [31,33]. On the other hand, due to the huge amount of data, we need a method to improve the efficiency of the computation. We propose our method based on these facts. Figure 1 shows an overview of the research method. This method can be divided into 5 steps: (1) Collecting and preprocessing review data: We download the review data of applications in the way provided by Apple. Then, we design a preprocessing and filtering method for our dataset to retain useful content for us. (2) Using a sliding window to construct a User Graph: The window slides according to the increasing direction of the date. We created edges between users who posted reviews on the same app within a window. (3) Community detection: After the window finishes sliding, we can obtain a large User Graph structure. Then, we use the Louvain community detection method [34] on the graph. (4) Measurement and analysis: We construct a ground-truth dataset to investigate the difference between various features of crowdturfing workers and genuine users and analyze the impact and effect brought by crowdturfing on the app. (5) Training classifiers and detecting crowdturfing: We extract multiple features to use them for training the classifiers to further identify more crowdturfing reviews. We describe the methods and steps in detail.

4.2. Dataset Description

Our study starts with inspecting the reviews of apps. In order to obtain the reviews, we collect the unique iTunes IDs of apps in advance. An iTunes ID is important for artists who publish their work in iTunes because it identifies one or more works from a single artist.

4.2.1. Data Collection

Collecting IDs through the EPF Relational project. Thanks to the Apple EPF Relational project, we collected complete iOS app IDs compared with those crawling through iTunes Search API [35]. Among the metadata of EPF, the application_id element is the unique iTunes ID for an iOS app, as mentioned in Section 2.2. In order to obtain such IDs, we appled to be an Apple Enterprise Partner and obtained access to the EPF data such that we could download (https://feeds.itunes.apple.com, accessed on 7 July 2023) and parse the EPF data to obtain the IDs. In this way, we collected a total of 170,120 unique application_ids from the App Store.

Collecting reviews according to the IDs. In order to obtain reviews of iOS apps, we inspected how to use the old version (e.g., version 12.6 (https://ipsw.me/iTunes, accessed on 7 July 2023)) of iTunes to access the reviews. Then, we imitated iTunes to obtain the related reviews. We called the iTunes review interface and passed in the ID to specify the app. By using the HTTPS parser to obtain the review data, the following request can obtain the first 100 reviews of Facebook (application_id: 284882215):

https://itunes.apple.com/WebObjects/MZStore.woa/wa/userReviewsRow?id=284882215&displayable-kind=11&startIndex=0&endIndex=100&sort=1&appVersion=current, accessed on 7 July 2023

The request also indicates that we can acquire fresh reviews by setting the “sort” parameter. By taking this request, we crawled reviews daily to supplement our reviews dataset. Finally, we collected 76,473,321 reviews in Chinese from 20 July 2008 to 4 October 2022.

Collecting users according to the reviews. Each review we store in the database contains several fields, such as the body of the review, the date of the review, the user’s name, etc. Since there is a possibility of reusing or changing the user’s name, we chose to utilize the viewUsersUserReviewsUrl field, which is formatted as follows:

https://itunes.apple.com/cn/reviews?userProfileId=106****312, accessed on 7 July 2023

The number connected behind the UserProfileid is the user_id of the user who posted this review, which identifies a single user. Based on this information, we extracted 37,744,151 unique user_ids from the review data.

4.2.2. Data Preprocessing

The first challenge of our work is how to process the review data we obtained. Due to the consumption of time and space and given the size of our data, it was more appropriate to filter first. We focused on filtering based on temporal information and the features of user reviews:

If the reviews are posted before 31 December 2017, we filter them.
Filter reviews whose body length is below the predefined threshold.

The reason for this selection is, first of all, that due to the fast pace of the Internet, reviews that are too old do not provide us with much useful information today. Therefore, we chose to filter out all reviews from 2017 and before. Second, we will generate word embeddings in later work. However, having too few words does not provide us with effective information. For instance, there is a review written like this, “I LOVE IT”, with no relevant context or valuable information to help us determine anything from it.

We then performed a preliminary exploratory analysis of the data to facilitate better feature engineering and model building at a later stage. Table 1 reports the distribution of reviews and the number of different users for each year after filtering the dataset. It helps us to have a general understanding of the distribution of the dataset in terms of years. Next, we observe the data type of each column of the dataset and check to make sure that there are no missing values and no outliers (e.g., if there are ratings out of the range [1,5] or nonintegers).

We also analyzed the distribution of time. Figure 2 plots the time of posting reviews by 3500 randomly selected users over a month. It can be seen that the distribution of pixel points within different dates is roughly similar. Interestingly, we found that the number of user activities decreases significantly between 18:00 and 24:00 each day, meaning that some Chinese users do not often post reviews during this period. The date and time will play a very important role in our later work.

4.3. Problem Definition

To identify crowdturfing reviews and accounts in the app markets, let

G = (V, E, w)

be an undirected weighted graph, where V is the set of user nodes (vertices), E is the set of edges, and w denotes the weight of an edge. Each

v_{i} \in V

is the unique

u s e r_i d

who posts reviews on the app markets. An edge

e \in E

from a user node

v_{i} \in V

to another user node

v_{j} \in V

exists if both

v_{i}

and

v_{j}

post reviews on the same app. The weight w indicates the number of applications coreviewed by users of related nodes. The initial weight of each edge is 1. We also refer to graph G as User Graph. See Figure 3 for a real example.

CopyCatch [36] pioneered the work of clustering the aggregated behavior of users in online social networks. The algorithm is used to detect ill-gotten Page Likes in Facebook by finding suspicious lockstep behavior. It is a clustering method that aims to find groups that act together within the same time window in noisy data by restricting graph traversal, i.e., finding temporally-coherent near bipartite cores (TNBC) with edge constraints. The method can be extended to a bipartite graph of users to products or a general graph of the user to user connections. CopyCatch provides a serial algorithm and a MapReduce implementation and discusses variations in computation time on Hadoop clusters. However, in the serial algorithm, the computation time will be very long if the number of nodes and edges is huge.

There are two main issues here. The first is the large amount of data, and the processing steps are very time-consuming, because they need to combine the users in each app in pairs to form edges. If n users have reviewed an app, the time complexity of constructing a graph for this app is

O (n^{2})

. If there are m apps in total, the time complexity grows to

O (m \times n^{2})

. This would be unrealistic, because in the subsequent steps, CPU-bound and I/O-bound, would limit the speed of data loading and influence the speed of subsequent computations. The second problem is that there are many unnecessary edges. For example, if there is an edge between the user nodes corresponding to reviews

r_{1}

and

r_{2}

, but the two reviews were posted 4 years apart, we call this edge an unnecessary edge. In terms of time and fact, there is essentially no relationship between these two users, but the existence of such edges not only does not provide us with valid information but also adds to the computational overhead.

Based on this information, we need a method that can reduce the calculation time, remove unnecessary edges, and retain information for subsequent data updates. In our work, we decided to use a sliding window to implement our solution, which solves the two problems raised above.

4.4. Sliding Window

The use of a sliding window aims to remove unnecessary edges in the graph and improve computational efficiency. Since a sliding window is effective for finding current trends in the most recent past, and the reviews in our dataset have a time field, we can sort the reviews according to dates. The sliding window will define a date range and only read reviews posted within this range.

We first define some notations for the sliding window that slides with the date. Let W represent the sliding window and

D = {D_{1}, . . ., D_{i}, . . ., D_{n}}

be the set of dates. The dates

D_{i}

are sorted in ascending order, and the window also slides according to the increasing direction of the date. The step width of each slide is t day, and if the window size is s, then the date range covered by W is

[D_{i}, D_{i + s - 1}]

, and each slide will have an

\frac{s - t}{s}

overlap with the previous window. Figure 4 shows a schematic diagram of the sliding window, where reviews are sorted from oldest to newest by date of posting, with the window gradually sliding in the direction of the new date until it reaches the latest date.

To explain how the window works exactly after each slide, we use the reviews of

A P P 1

as an example to describe the workflow of the window sliding twice from date

D_{1}

to

D_{5}

, and then compare it with the process of not using the sliding window. As shown in Figure 5, the process is briefly described as follows:

Window W starts from the date $D_{1}$ , and the current date range is $[D_{1}, D_{3}]$ . There exists a user who posts a review in $D_{1}$ and another user who posts a review in $D_{2}$ . Since both comment on $A P P 1$ , an edge is created between the two vertices. User Graph G currently has 1 edge.
The window slides back one day with a step width of 1, and the current date range is $[D_{2}, D_{4}]$ . There are two new users who post reviews on the new date $D_{4}$ . Thus, edges are generated between new vertices and between old vertices that exist in the current range, i.e., excluding the vertex corresponding to $D_{1}$ that is no longer in the current window date range. User Graph G currently has 4 edges.
The window slides to $D_{5}$ and the current date range is $[D_{3}, D_{5}]$ . New vertices in $D_{5}$ are connected with vertices in $D_{4}$ . User Graph G currently has 6 edges.

Then, keep sliding backward and generating edges according to the rules until W slides to the last date

D_{n}

. If the sliding window is not used, the constructed graph will be a complete graph. After sliding twice, we can construct a graph G with 4 fewer edges than a complete graph. It can be seen that the more times the window slides, the greater the number of reduced edges. Also, if there are more vertices, the number of reduced edges will be greater. According to our estimation, the original need to construct more than 100 million edges can be reduced to tens of millions by this method. Now, the detailed process of sliding and constructing the graph can be depicted in the Algorithm 1.

Algorithm 1: Framework of sliding window and constructing graph

Different window sizes affect the ability to capture trends. In the later phase, we mainly want to obtain some crowdturfing communities. So, in order to set the most appropriate value for this parameter, we used Google to search for relevant search terms and found many paid review providers, such as https://asoworld.com, https://appreviews.ninja, https://proreviewsapp.com accessed on 7 July 2023, etc. By comparing the promotional content on these providers’ websites, we found that the delivery time varies from 24 h to 30 days, and most of them guarantee 7 days of delivery. So, we decide to select a window size of 7 days.

4.5. Detecting Communities

When the window W finishes all the sliding, we can obtain a User Graph G with huge and complex relationships over a period of time. We formulate the problem as a community detection problem on an undirected graph with attributed vertices and edges. By having this user relationship graph, we can implement community detection using the Louvain community detection method, which can obtain disjoint partitions with good quality and a strong colluding relation between users.

In order to improve the utilization efficiency of each vertex and computation speed in the graph, we first obtain the degree of each vertex and the weight of each edge. We find the nodes and edges whose degree and weight exceed the set thresholds. Then, we use the Louvain community detection method. The community detection method is composed of two steps. The first step is to find small communities to locally optimize the modularity. The second step is to aggregate the vertices in each small community and use these aggregated vertices to build a new network. Then, the two steps are iterated until the modularity of the network is maximized.

5. Measurement

In this section, we investigate the crowdturfing in the ground-truth data and compare some of their features with those of genuine users, respectively.

5.1. Ground-Truth Data

Since spammer groups give a good context for judgment and comparisons, it is much easier to label fake review groups than to label individual review spammers [37]. Therefore, we further manually check the results of community partition based on this fact. We invited three experts who are very familiar with the content of crowdturfing to evaluate the validity of the results. Before the evaluation, we developed evaluation criteria for judging crowdturfing communities to reduce the influence of human bias.

After our observation, the partition results are more reasonable when the weight is greater than or equal to 2. With a sliding window size of 7 and a date span of 90 days, we detected a total of 3636 communities (no community with a size equal to 1), from which we randomly selected 727 communities for manual evaluation and labeling. Finally, we obtained the ground-truth data containing 104 crowdturfing communities and 623 genuine communities. If at least 20% of the reviews on an account are judged to be fake, we treat that account as a crowdturfing sender.

5.2. Genuine vs. Crowdturfing Analysis

The goal of this section is to focus on the differences between crowdturfing reviews and genuine users’ reviews. We focus on comparing ratings, ratio of positive reviews and sentiment scores of texts. Crowdturfing is more likely to submit positive reviews.

5.2.1. Star Ratings

The star rating of an app is also one of the key elements in the app marketplace, so we start our analysis with it. Figure 6 shows the distribution of star ratings for crowdturfing reviews and genuine reviews. The two types of reviews show different rating distributions. Genuine reviews have a higher number of ratings with 1 star (36.17%) and 5 stars (35.35%), and the distribution is relatively average. Crowdturfing reviews are mostly concentrated in 5 stars (71.72%). The overall observation is that the ratings of both reviews are positive. This suggests that employers usually use crowdturfing to provide 5 stars reviews, and the goal of hiring crowdsourcing is to improve the application’s rating. Then, crowdturfing accounts will post some non-5-star ratings to make their accounts more like genuine users in order not to be detected by the App Store. It is worth mentioning that during our observation, we found that some genuine users will give 5 stars ratings, but the content of the review is negative, the reason is to let more people see their criticism.

5.2.2. Ratio of Positive Reviews

We are also curious about other features of reviews posted by crowdturfing accounts, so we chose to observe the ratio of positive ratings of users over a period of time. Figure 7 compares the distribution of positive review rates between crowdturfing users and genuine users over 90 days, with the horizontal coordinate representing the value of users’ positive review rates and the vertical coordinate representing the percentage of users with the same value. For genuine users, the values of positive ratios in the range 0–0.5 account for a total of 62.92%, which shows that most of them do not give positive reviews very often. Among them, 30.32% do not give positive reviews at all. The crowdturfing accounts, on the other hand, have an extremely high rate of positive reviews, with values in the range 0.9583–1.0, accounting for 97.34%, which is 67.92% higher than the genuine users’. The observed results correspond to the distribution of ratings in Figure 6. It is evident that the difference in the frequency of posting positive reviews between the two types of users is significant and can be used as one of the reference indicators in detection.

5.2.3. Sentiment Analysis of Review Data

Reviews contain various emotions and sentiment inclinations of users, which mainly divide into three categories: positive, negative, and neutral. We use the SnowNLP library for scoring, where a score of 0.5 indicates a neutral sentence, less than 0.5 represents a negative sentence, and toward 1 represents a positive sentence. By calculating the sentiment scores of each crowdturfing and genuine review, we arrive at Figure 8. We find that the majority of crowdturfing review texts are positive, with 91.42% of scores greater than 0.8. They also use good vocabulary while giving high ratings. Less than 5% of the crowdturfing text data had negative sentiment. This indicates that the tasks received by these crowdsourcing platforms are basically to promote and boost the rating of the application. Most of the genuine users will choose to post reviews only when they feel dissatisfied with the app, as can be seen from the figure which shows that 58.71% of them have an emotional score of less than 0.4.

5.3. Crowdturfing Characteristics Analysis

This section investigates the categories of applications in which crowdturfing is most frequently found and investigates the behavioral characteristics of crowdturfing more deeply through reviews, such as comparing the similarity between reviews and the words often used.

5.3.1. Categories of Apps

To understand which categories of apps prefer to hire crowdturfing or buy reviews, we summarized the information obtained according to 27 categories of apps on the App Store [38]. Figure 9 shows the percentage distribution of the detected 16 categories. The results show that apps belong to the category “Utilities” with more crowdturfing reviews (20.49%), followed by the categories “Education” (12.56%) and “Games” (12.12%). For these categories, there are many other apps of the same category in the market, so there will be competition among them. For these categories, there are many other apps of the same category in the market, so there will be competition among them. In order to stand out from them, they choose to hire crowdturfing services to improve their rankings and ratings, as well as to create the illusion of high usage and downloads. On the other hand, these apps usually include some in-app purchases. If crowdturfing works, the apps will get more profits from genuine users.

5.3.2. Similarity of Reviews

To understand the similarity of app review content, we chose to use the SimText library to calculate the Jaccard similarity of the text. Two sets of results were obtained by calculating the reviews in each crowdturfing account and each app separately. Figure 10a illustrates the similarity grouping results between each two reviews for each crowdturfing account, i.e., self-similarity. The results show that the reviews between each account are mostly dissimilar, with a similarity of less than 0.1 accounting for the majority. The similarity grouping results of reviews in each app are presented in Figure 10b, and about 12.22% of the reviews have a similarity greater than 0.2. This indicates that most of the reviews in a single account are not similar, while there may be duplicate reviews within a single app.

To identify the cause of this situation, we found that these crowdturfing workers will post targeted reviews matching app functionality and will simulate the language characteristics of genuine users to avoid detection and identification. And crowdsourcing platforms sometimes receive tasks with specific requirements, such as the need to include certain keywords in the reviews. Another scenario is when crowdturfing workers receive a task that also provides a template for commenting. In order to complete the task quickly, they choose to copy and paste the template directly or modify parts of the template to get commissions. Therefore, the similarity of the crowdturfing reviews in the app will be higher.

5.3.3. Common Words

Extracting the most frequently occurring keywords in crowdturfing reviews can help us better understand the purpose and patterns behind these groups. An analysis of frequently occurring keywords can reveal trends hidden in the data and can further identify which reviews need to be filtered and deleted. To compare and highlight the characteristics of common keywords in reviews, we randomly select 83,687 genuine reviews and 100,497 crowdturfing reviews for keyword analysis. After removing punctuation and segmenting the sentences, we utilize TF-IDF and TextRank methods to extract text keywords. Comparing the two methods comprehensively, the results are visualized in the two word clouds in Figure 11. For crowdturfing, they tend to use more positive adjectives for apps, such as “worth”, “awesome”, and “satisfied”. While genuine users also express dissatisfaction and describe the problems they encounter when writing reviews, so the keywords will include more specific nouns, such as “ads”, “subscription”, and “error”. So, there are also negative adjectives such as “annoying”, “frustrating”, and “wrong”. Real users value the experience of use more, while crowdturfing is more monotonous praise.

5.3.4. Number of Reviews Provided

After receiving tasks, crowdsourcing platforms will distribute them to workers. As there are usually multiple tasks, accounts of crowdsourcing workers will frequently engage in activities, such as posting a large number of reviews or forwarding the same content in a short period of time. To gain more insight into the frequency of commenting activity on crowdturfing accounts, we count the total number of reviews given by such accounts over a two-year period. The groups in Figure 12 are based on the number of reviews posted by each account. The total number of reviews given by these accounts ranges between 2 and 134. With 47.69% of crowdturfing accounts posting 11 or more reviews, it is clear that they engage in frequent activity. Their frequency of reviewing is once every 11.2 days on average. In contrast, genuine users mostly only post one or two reviews.

5.4. Relations between the Ratio of Crowdturfing Reviews and App Rankings

The purpose of hiring crowdturfing services is usually to improve the ranking and exposure of the item to attract more real users to download and use it. After our investigation, we found that the App Store only shows the top 500 app rankings in the overall ranking or each ranking category. So, we only counted 117 ranked apps that were detected to have more crowdturfing reviews. Figure 13 illustrates the relationship between the percentage of crowdturfing reviews and the application ranking. It can be seen that 49% of the apps ranked in the top 250 have a crowdturfing review ratio greater than 0.5, which shows that a fake and large number of positive reviews can indeed help improve rankings. There are also some apps that rank high but have a low portion of crowdturfing reviews, possibly because these apps themselves have more users and higher rankings, so they no longer need too many crowdsourcing services. Although apps ranked after 250 have also hired many reviews, they have not received high rankings. According to our analysis, the reason for this situation is that application stores usually use a series of algorithms that consider a variety of metrics to determine app rankings, such as user retention rates. Thus, it can be seen that crowdturfing can have a positive effect on rankings in some cases, but this effect is usually limited.

6. Crowdturfing Detection

When given a new review, our goal is to identify what type of review it is. In this section, we investigate whether the information collected can identify crowdturfing reviews. For this purpose, we first introduce the selected features, then investigate their ability to train machine learning classifiers and compare them in order to find the best classifier for this dataset.

6.1. Features Selection

We chose two types of features, user behavior (UB) features and review (RE) features. UB features include the total number of reviews posted by the user and the ratio of positive reviews (4–5 stars). RE features are extracted from the metadata of a review and contain the review text, sentiment score, character length, and rating. Table 2 shows the details of the features. We aimed to find the best combination that can identify crowdturfing reviews the most accurately. We chose to analyze the review text itself by word embedding. Word embedding is a technique for representing natural language in a fixed-dimension vector. Word2Vec [39] can be used to generate such vectors and can measure the predictive ability of a certain word to its surrounding words. So, we ran Word2Vec on each review, mapping the words in the review to a vector describing its semantics.

6.2. Evaluation Metrics

This paper chooses precision, recall, F1-score, and accuracy metrics to measure the performance of classification. The calculation process is defined by the equations as follows, where TP (True Positive) refers to the number of reviews classified correctly by the classifier as crowdturfing reviews, TN (True Negative) is the number of reviews classified correctly as noncrowdturfing reviews, FP (False Positive) is the number of reviews incorrectly labeled as crowdturfing reviews, and FN (False Negative) refers to the number of reviews incorrectly labeled as noncrowdturfing. The larger these values, the better the performance of the classifier.

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

R e c a l l = \frac{T P}{T P + F N}

(2)

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(4)

6.3. Performance of Classifiers

We describe how to use UB and RE for classification detection. Considering that the textual content of reviews plays an important role in the identification, we select textual content as features in each classifier. We input both UB and RE features into each classifier individually and combined and evaluate the results using precision, recall, F1-score, and accuracy.

To evaluate the performance of our method, we first applied the method on the ground-truth dataset. Table 3 shows the results of the six classification models using three features in the ground-truth data. The results indicate that both the UB and RE features play an important role in classification, and when combining the two features, the classification results are better in most cases. Specifically, both the RF and LR models show the best performance when only the UB features are used. When using only the RE features, the best model is RF. When the features are UB+RE, the LR model has the best results of all evaluations, with 98.13% accuracy, precision, and recall and a 98.11% F1-score.

To further validate the selected model and features, we then extracted another 423 crowdturfing communities with 39,745 accounts from the remaining communities’ data described in Section 5.1. The performance results are shown in Table 4. The trained model can identify crowdturfing reviews with the highest accuracy of 89.44%. After observation, when using the UB features, the evaluation results of the LR model are the best. The RF model is the best when only the RE features are used. When the features are UB+RE, the DT and RF models show the same and the highest accuracy. Among them, only the evaluation results of the KNN model decreased. The differences between the various best models obtained above and the conclusions drawn from the ground-truth data are not significant. The RF model can achieve the best accuracy in some features of both data. So, we picked the best combination of the RF model and UB+RE features. The reason for the decrease in evaluation indicators compared with Table 3 may be that there is a situation in the remaining communities’ data where crowdturfing and genuine users are divided into the same community, which has an impact on the training results.

6.4. Experiments on the Unlabeled Data

To compare the number of crowdturfing reviews detected by different models, we continued to measure the unlabeled data. We obtained the unlabeled community partition data with a sliding window size of 7 and a date span of 90 days. There are 1,441,816 unlabeled reviews in total. We selected the best-performing LR model in Table 3 and the best-performing RF and DT models in Table 4, respectively. Both the UB+RE features are used in the experiments. As depicted in Figure 14, the LR model detected 802,135 crowdturfing reviews. In total, 822,496 and 758,752 crowdturfing reviews were detected by the RF and DT models, respectively. The results show that the RF model outperforms the LR and DT models in terms of the number of detected crowdturfing reviews. On the other hand, the detected crowdturfing reviews can account for up to 57% of the total number of reviews, indicating that the App Store has been invaded by a lot og crowdturfing and the app marketplace environment is being damaged.

6.5. Feature Importance

In prediction projects, measuring feature importance can help improve the efficiency and effectiveness of model predictions, and it can highlight the relevance of features to the target. We chose to perform a feature importance analysis on the RF and DT models, which obtained the highest accuracy in Table 4. Random forest calculates the average of the contributions made by each feature on each tree and finally compares the contribution size among the features. The feature importance of the decision tree is computed using the Gini importance [40]. We use the methods provided by sklearn in Python for evaluation.

Figure 15 shows the percentage of importance of each feature in the two models. For the RF model, the review text (52.6%) and the ratio of positive reviews (17.8%) are the two most important features. In the DT model, the sentiment score (76.8%) and the review text (12.4%) have higher importance. It can be observed that the review text has a high proportion of importance in different models. There is also a significant difference in the ratio of positive reviews and sentiment scores between crowdturfing and genuine reviews, so they have a strong correlation with the target in prediction.

7. Discussion

Crowdturfing spammers constantly change the template and phrasing of fake reviews to evade vetting, such as using abbreviations or emojis instead of words. Groups on other platforms or channels may also have different behaviors and strategies. Therefore, we do not claim that the results in this paper generalize to all crowdturfing workers and accounts. Although we do not currently know all the types of crowdturfing reviews, we can report the results of the measurement based on the types we classified and know. We focus on this dataset to understand the current state of the activity in the Apple App Store to facilitate further research. More types of crowdturfing and their strategies can be identified through the large-scale recruitment of fake promotion reviews. In addition, several machine learning algorithms achieved essentially 90% or higher accuracy when classifying. This suggests that machine learning is applicable to this classification problem and can provide useful assistance.

This paper also reveals the impact of crowdturfing on app ranking during the measurement process, which shows that such manipulation of reviews can disrupt the environment of fair competition in the market and also affect consumers’ trust in developers. Merchants should consciously abide by the rules of the platform, and consumers should also have a basic understanding of false comments and learn to distinguish them in order to prevent losses from falling into traps. The platforms should also use machine learning or deep learning methods to enhance the detection of fake review intrusion to protect the interests of normal merchants and consumers.

The measurements we obtained can serve as a reference for future research on crowdturfing, fake review detection, natural language processing, and similar issues (e.g., click farming). Detecting crowdturfing can reduce the presence of fake or manipulated user reviews and ratings. By identifying and combating crowdturfing, app marketplaces can ensure a more transparent and trustworthy environment for users to make informed decisions about downloading and using applications. This helps maintain fair competition among developers and fosters a more reliable user experience. And, this situation is not only found in application markets but also in shopping, social, and other online platforms. Our method can be applied to other environments to detect such groups and better promote the healthy development of online marketplaces.

In future work, the user-to-user homogeneous graphs built in this paper can be extended to heterogeneous graphs that can contain more information, or add more attributes to the edges (such as timestamps or IP addresses of users, etc.) to build larger social networks. It can also extend the approach to dynamic networks to monitor crowdturfing users in real time. Secondly, social network analysis can provide insights about the network features and behavioral patterns of crowdturfing participants. By analyzing metrics such as network structure, node centers, and social relationships, the cooperation patterns and decision paths between participants can be revealed. Therefore, combined with better social network analysis methods, we will be able to better distinguish between genuine users and provide better classification results.

8. Conclusions

In this paper, we presented the results of measuring crowdturfing in the App Store. During this study, the use of a sliding window and community detection helped us created the ground-truth data, and then a large-scale measurement analysis of crowdturfing was conducted. We showed the features of their reviews, explored their operational targets, and compared them with genuine users. We trained multiple classifiers to identify crowdturfing reviews and achieved an accuracy of up to 98%, which demonstrates that features of reviews and user behavior can be used to detect crowdturfing reviews.

For future work, we plan to test our proposed approach in other application markets and further extend and explore the types of crowdturfing and detection methods. To address the vagaries of crowdturfing reviews, we will collect more crowdturfing review datasets and use their different feature relationships to effectively identify fake reviews in other venues. This will enhance review security and balance competition in application markets, as well as protect the interests of consumers.

Author Contributions

Methodology, Q.H., X.Z. and Z.T.; analysis, Q.H.; investigation, F.L. and S.W.; resources and data, Z.T.; writing—original draft preparation, Q.H.; writing—review and editing, Z.T., F.L. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China, under Grants 61802252.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The metadata are available on the EPF Relational project [17]. The data presented in this study are available from the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Song, J.; Lee, S.; Kim, J. Crowdtarget: Target-based detection of crowdturfing in online social networks. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 12–16 October 2015; pp. 793–804. [Google Scholar]
Lee, Y.; Wang, X.; Lee, K.; Liao, X.; Wang, X.; Li, T.; Mi, X. Understanding iOS-based Crowdturfing Through Hidden {UI} Analysis. In Proceedings of the 28th {USENIX} Security Symposium ({USENIX} Security 19), Santa Clara, CA, USA, 14–16 August 2019; pp. 765–781. [Google Scholar]
Local Consumer Review Survey. 2023. Available online: https://www.brightlocal.com/research/local-consumer-review-survey/ (accessed on 20 February 2023).
Shihab, M.R.; Putri, A.P. Negative online reviews of popular products: Understanding the effects of review proportion and quality on consumers’ attitude and intention to buy. Electron. Commer. Res. 2019, 19, 159–187. [Google Scholar] [CrossRef]
App Store Review Guidelines: Developer Code of Conduct. 2022. Available online: https://developer.apple.com/app-store/review/guidelines/#code-of-conduct (accessed on 7 March 2023).
Yao, Y.; Viswanath, B.; Cryan, J.; Zheng, H.; Zhao, B.Y. Automated crowdturfing attacks and defenses in online review systems. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 30 October–3 November 2017; pp. 1143–1158. [Google Scholar]
Su, N.; Liu, Y.; Li, Z.; Liu, Y.; Zhang, M.; Ma, S. Detecting Crowdturfing “Add to Favorites” Activities in Online Shopping. In Proceedings of the 2018 World Wide Web Conference, Republic and Canton of Geneva, CHE, Lyon, France, 23–27 April 2018; pp. 1673–1682. [Google Scholar]
Feng, Q.; Zhang, Y.; Kuang, L. Crowdturfing Detection in Online Review System: A Graph-Based Modeling. In Proceedings of the Collaborative Computing: Networking, Applications and Worksharing: 17th EAI International Conference, CollaborateCom 2021, Virtual Event, 16–18 October 2021; pp. 352–369. [Google Scholar]
Liu, B.; Sun, X.; Ni, Z.; Cao, J.; Luo, J.; Liu, B.; Fu, X. Co-Detection of crowdturfing microblogs and spammers in online social networks. World Wide Web 2020, 23, 573–607. [Google Scholar] [CrossRef]
App Store—Apple. Available online: https://www.apple.com/app-store/ (accessed on 1 January 2023).
Xie, Z.; Zhu, S. AppWatcher: Unveiling the underground market of trading mobile app reviews. In Proceedings of the 8th ACM Conference on Security & Privacy in Wireless and Mobile Networks, New York, NY, USA, 22–26 June 2015; pp. 1–11. [Google Scholar]
New “Report a Problem” Link on Product Pages. Available online: https://developer.apple.com/news/?id=j5uyprul (accessed on 20 February 2023).
App Store Review Guidelines: Business. Available online: https://developer.apple.com/app-store/review/guidelines/#business (accessed on 1 October 2021).
How to Report a Concern in Google Play and App Store. Available online: https://appfollow.io/blog/how-to-report-a-concern-in-google-play-and-app-store (accessed on 20 February 2023).
How Do I Report Scam Apps. Available online: https://discussions.apple.com/thread/3801852 (accessed on 20 February 2023).
Reporting an Apple Store App That Is a SCAM. Available online: https://discussions.apple.com/thread/253565159?answerId=256677556022#256677556022 (accessed on 20 February 2023).
Enterprise Partner Feed Relational. Available online: https://performance-partners.apple.com/epf (accessed on 1 May 2023).
Wang, G.; Wilson, C.; Zhao, X.; Zhu, Y.; Mohanlal, M.; Zheng, H.; Zhao, B.Y. Serf and turf: Crowdturfing for fun and profit. In Proceedings of the 21st international Conference on World Wide Web, New York, NY, USA, 16–20 April 2012; pp. 679–688. [Google Scholar]
Lee, K.; Webb, S.; Ge, H. Characterizing and automatically detecting crowdturfing in Fiverr and Twitter. Soc. Netw. Anal. Min. 2015, 5, 1–16. [Google Scholar] [CrossRef]
Voronin, G.; Baumann, A.; Lessmann, S. Crowdturfing on Instagram-The Influence of Profile Characteristics on The Engagement of Others. In Proceedings of the Twenty-Sixth European Conference on Information Systems 2016, Portsmouth, UK, 23–28 June 2018. [Google Scholar]
Kaghazgaran, P.; Caverlee, J.; Squicciarini, A. Combating crowdsourced review manipulators: A neighborhood-based approach. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, New York, NY, USA, 5–9 February 2018; pp. 306–314. [Google Scholar]
Corradini, E.; Nocera, A.; Ursino, D.; Virgili, L. Investigating negative reviews and detecting negative influencers in Yelp through a multi-dimensional social network based model. Int. J. Inf. Manag. 2021, 60, 102377. [Google Scholar] [CrossRef]
Corradini, E.; Nocera, A.; Ursino, D.; Virgili, L. Defining and detecting k-bridges in a social network: The yelp case, and more. Knowl.-Based Syst. 2020, 195, 105721. [Google Scholar] [CrossRef]
Cauteruccio, F.; Corradini, E.; Terracina, G.; Ursino, D.; Virgili, L. Extraction and analysis of text patterns from NSFW adult content in Reddit. Data Knowl. Eng. 2022, 138, 101979. [Google Scholar] [CrossRef]
Li, S.; Caverlee, J.; Niu, W.; Kaghazgaran, P. Crowdsourced app review manipulation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 7–11 August 2017; pp. 1137–1140. [Google Scholar]
Saab, F.; Elhajj, I.; Chehab, A.; Kayssi, A. CrowdApp: Crowdsourcing for application rating. In Proceedings of the 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), Doha, Qatar, 10–13 November 2014; pp. 551–556. [Google Scholar]
Guo, Z.; Tang, L.; Guo, T.; Yu, K.; Alazab, M.; Shalaginov, A. Deep graph neural network-based spammer detection under the perspective of heterogeneous cyberspace. Future Gener. Comput. Syst. 2021, 117, 205–218. [Google Scholar] [CrossRef]
Dou, Y.; Li, W.; Liu, Z.; Dong, Z.; Luo, J.; Yu, P.S. Uncovering download fraud activities in mobile app markets. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 671–678. [Google Scholar]
Hu, Y.; Wang, H.; Ji, T.; Xiao, X.; Luo, X.; Gao, P.; Guo, Y. CHAMP: Characterizing Undesired App Behaviors from User Comments based on Market Policies. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Virtual, 25–28 May 2021; pp. 933–945. [Google Scholar]
Joshi, K.; Kumar, S.; Rawat, J.; Kumari, A.; Gupta, A.; Sharma, N. Fraud App Detection of Google Play Store Apps Using Decision Tree. In Proceedings of the 2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM), Pradesh, India, 23–25 February 2022; Volume 2, pp. 243–246. [Google Scholar]
Rathore, P.; Soni, J.; Prabakar, N.; Palaniswami, M.; Santi, P. Identifying groups of fake reviewers using a semisupervised approach. IEEE Trans. Comput. Soc. Syst. 2021, 8, 1369–1378. [Google Scholar] [CrossRef]
Xu, Z.; Sun, Q.; Hu, S.; Qiu, J.; Lin, C.; Li, H. Multi-view Heterogeneous Temporal Graph Neural Network for “Click Farming” Detection. In Proceedings of the PRICAI 2022: Trends in Artificial Intelligence: 19th Pacific Rim International Conference on Artificial Intelligence, PRICAI 2022, Shanghai, China, 10–13 November 2022; pp. 148–160. [Google Scholar]
Li, N.; Du, S.; Zheng, H.; Xue, M.; Zhu, H. Fake reviews tell no tales? dissecting click farming in content-generated social networks. China Commun. 2018, 15, 98–109. [Google Scholar] [CrossRef]
Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef] [Green Version]
Tang, Z.; Tang, K.; Xue, M.; Tian, Y.; Chen, S.; Ikram, M.; Wang, T.; Zhu, H. iOS, Your {OS}, Everybody’s {OS}: Vetting and Analyzing Network Services of iOS Applications. In Proceedings of the 29th {USENIX} Security Symposium ({USENIX} Security 20), Boston, MA, USA, 12–14 August 2020; pp. 2415–2432. [Google Scholar]
Beutel, A.; Xu, W.; Guruswami, V.; Palow, C.; Faloutsos, C. Copycatch: Stopping group attacks by spotting lockstep behavior in social networks. In Proceedings of the 22nd International Conference on World Wide Web, New York, NY, USA, 13–17 May 2013; pp. 119–130. [Google Scholar]
Wang, Z.; Hou, T.; Song, D.; Li, Z.; Kong, T. Detecting review spammer groups via bipartite graph projection. Comput. J. 2016, 59, 861–874. [Google Scholar] [CrossRef]
Compare Categories. Available online: https://developer.apple.com/app-store/categories/ (accessed on 30 March 2023).
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The workflow of measuring crowdturfing.

Figure 2. A scatter plot of users posting reviews within a month: The x-axis represents the date, ranging from the 1st to the 31st. The y-axis represents the 24 h of the day. Each pixel point corresponds to the specific date and time of a particular user’s review.

Figure 3. An illustration of two graph models: User Relationship Graph indicates users (user1 to user4 and crowdturfing1), APPs, and reviews, respectively. u1 to u4 and ct1 in the User Graph correspond in the structure to the User Relationship Graph. For example, the User Graph of APP3 indicates that both u2 and ct1 have reviewed APP3. Since only one person, user4, has posted a review for APP2, there is no User Graph for APP2.

Figure 4. The process of window sliding: The window size s in the figure is 3 days and the step width t is 1 day.

Figure 5. The pattern of vertices corresponds to the pattern of dates, indicating that a certain user posted a review on that day: The area covered by the blue shadow indicates the vertices that appear within the current window, and edges are established between these vertices. The red dashed lines represent the edges that are not generated compared with the complete graph, and these edges are reduced by this method.

Figure 6. Distribution of star ratings between crowdturfing and genuine reviews.

Figure 7. Distribution of positive review rates between crowdturfing and genuine users.

Figure 8. Comparison of sentiment scores between crowdturfing and genuine reviews.

Figure 9. Distribution of categories of apps that solicit crowdturfing.

Figure 10. Comparison of review similarity: (a) The self-similarity grouping results of each two reviews for each crowdturfing account. (b) The grouping result of similarity of reviews in each application.

Figure 11. Word clouds composed of keywords in two types of reviews.

Figure 12. Number of reviews provided per crowdturfing account.

Figure 13. The relationship between the percentage of crowdturfing reviews and app rankings.

Figure 14. Measurement results of unlabeled data.

Figure 15. Relative importance of features to detect crowdturfing reviews.

Table 1. Overview of the number (#) of reviews and users per year.

	2018	2019	2020	2021	2022
# of reviews	9,250,284	12,189,435	8,091,701	8,979,512	7,135,120
# of users	7,038,175	8,445,133	5,821,240	6,350,432	5,218,392

Table 2. Features selected.

Category	Feature	Type	Example
User behavior (UB)	# reviews (total)	Int	11
	% positive reviews (4–5 stars)	Float	0.5
Review (RE)	Review text	String	Nice.
	Sentiment score	Float	0.6163
	Character length	Int	26
	Rating	Int	5

Table 3. Performance scores of the proposed classifiers on the ground-truth data.

Classifier	Features	Precision	Recall	F1-Score	Accuracy
SVM	UB	0.9394	0.9403	0.9355	0.9403
	RE	0.8952	0.8806	0.8411	0.8806
	UB+RE	0.9091	0.9067	0.8887	0.9067
RF	UB	0.9628	0.9627	0.9608	0.9627
	RE	0.9701	0.9701	0.9691	0.9701
	UB+RE	0.9775	0.9776	0.9771	0.9776
MLP	UB	0.9435	0.9440	0.9437	0.9440
	RE	0.9665	0.9664	0.9650	0.9664
	UB+RE	0.9668	0.9664	0.9666	0.9664
DT	UB	0.9242	0.9254	0.9247	0.9254
	RE	0.9616	0.9590	0.9598	0.9590
	UB+RE	0.9560	0.9515	0.9528	0.9515
LR	UB	0.9624	0.9627	0.9617	0.9627
	RE	0.9550	0.9552	0.9535	0.9552
	UB+RE	0.9813	0.9813	0.9811	0.9813
KNN	UB	0.9560	0.9552	0.9531	0.9552
	RE	0.9296	0.9291	0.9235	0.9291
	UB+RE	0.9332	0.9328	0.9280	0.9328

Table 4. Performance scores of the proposed classifiers on the remaining communities’ data.

Classifier	Features	Precision	Recall	F1-Score	Accuracy
SVM	UB	0.8702	0.8400	0.8244	0.8400
	RE	0.7918	0.6954	0.6017	0.6954
	UB+RE	0.8319	0.7839	0.7506	0.7839
RF	UB	0.8776	0.8627	0.8539	0.8627
	RE	0.8959	0.8814	0.8746	0.8814
	UB+RE	0.9083	0.8944	0.8888	0.8944
MLP	UB	0.8456	0.8473	0.8441	0.8473
	RE	0.8763	0.8733	0.8689	0.8733
	UB+RE	0.8829	0.8838	0.8825	0.8838
DT	UB	0.8317	0.8343	0.8317	0.8343
	RE	0.8425	0.8383	0.8299	0.8383
	UB+RE	0.8961	0.8944	0.8916	0.8944
LR	UB	0.8888	0.8773	0.8705	0.8773
	RE	0.8853	0.8684	0.8594	0.8684
	UB+RE	0.8978	0.8879	0.8824	0.8879
KNN	UB	0.8672	0.8513	0.8405	0.8513
	RE	0.8703	0.8587	0.8499	0.8587
	UB+RE	0.8591	0.8538	0.8466	0.8538

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, Q.; Zhang, X.; Li, F.; Tang, Z.; Wang, S. Measuring and Understanding Crowdturfing in the App Store. Information 2023, 14, 393. https://doi.org/10.3390/info14070393

AMA Style

Hu Q, Zhang X, Li F, Tang Z, Wang S. Measuring and Understanding Crowdturfing in the App Store. Information. 2023; 14(7):393. https://doi.org/10.3390/info14070393

Chicago/Turabian Style

Hu, Qinyu, Xiaomei Zhang, Fangqi Li, Zhushou Tang, and Shilin Wang. 2023. "Measuring and Understanding Crowdturfing in the App Store" Information 14, no. 7: 393. https://doi.org/10.3390/info14070393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Measuring and Understanding Crowdturfing in the App Store

Abstract

1. Introduction

2. Background

2.1. Review System in the App Store

2.2. Metadata Definition

3. Related Work

4. Research Method

4.1. Overview

4.2. Dataset Description

4.2.1. Data Collection

4.2.2. Data Preprocessing

4.3. Problem Definition

4.4. Sliding Window

4.5. Detecting Communities

5. Measurement

5.1. Ground-Truth Data

5.2. Genuine vs. Crowdturfing Analysis

5.2.1. Star Ratings

5.2.2. Ratio of Positive Reviews

5.2.3. Sentiment Analysis of Review Data

5.3. Crowdturfing Characteristics Analysis

5.3.1. Categories of Apps

5.3.2. Similarity of Reviews

5.3.3. Common Words

5.3.4. Number of Reviews Provided

5.4. Relations between the Ratio of Crowdturfing Reviews and App Rankings

6. Crowdturfing Detection

6.1. Features Selection

6.2. Evaluation Metrics

6.3. Performance of Classifiers

6.4. Experiments on the Unlabeled Data

6.5. Feature Importance

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI