Uncovering the Dark Side of Telegram: Fakes, Clones, Scams, and Conspiracy Movements

Telegram is one of the most used instant messaging apps worldwide. Some of its success lies in providing high privacy protection and social network features like the channels – virtual rooms in which only the admins can post and broadcast messages to all its subscribers. However, these same features contributed to the emergence of borderline activities and, as is common with Online Social Networks, the heavy presence of fake accounts. Telegram started to address these issues by introducing the verified and scam marks for the channels. Unfortunately, the problem is far from being solved. In this work, we perform a large-scale analysis of Telegram by collecting 35,382 different channels and over 130,000,000 messages. We study the channels that Telegram marks as verified or scam, highlighting analogies and differences. Then, we move to the unmarked channels. Here, we find some of the infamous activities also present on privacy-preserving services of the Dark Web, such as carding, sharing of illegal adult and copyright protected content. In addition, we identify and analyze two other types of channels: the clones and the fakes. Clones are channels that publish the exact content of another channel to gain subscribers and promote services. Instead, fakes are channels that attempt to impersonate celebrities or well-known services. Fakes are hard to identify even by the most advanced users. To detect the fake channels automatically, we propose a machine learning model that is able to identify them with an accuracy of 86 clones to spread quickly on the platform reaching over 1,000,000 users.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

06/09/2018

A Logical Framework for Verifying Privacy Breaches of Social Networks

We present a novel approach to deal with transitivity permission-delegat...
10/03/2019

TexTrolls: Identifying Russian Trolls on Twitter from a Textual Perspective

The online new emerging suspicious users, that usually are called trolls...
01/28/2020

Charting the Landscape of Online Cryptocurrency Manipulation

Cryptocurrencies represent one of the most attractive markets for financ...
08/27/2020

Privacy Intelligence: A Survey on Image Sharing on Online Social Networks

Image sharing on online social networks (OSNs) has become an indispensab...
05/22/2021

Sockpuppet Detection: a Telegram case study

In Online Social Networks (OSN) numerous are the cases in which users cr...
05/27/2021

On the Globalization of the QAnon Conspiracy Theory Through Telegram

QAnon is a far-right conspiracy theory that became popular and mainstrea...
12/09/2020

Interconnection between darknets

Tor and i2p networks are two of the most popular darknets. Both darknets...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Telegram is likely the most controversial instant messaging platform. From one side, it has become extremely popular for its focus on user privacy, with tons of users that flocked from Whatsapp to Telegram in early 2021 (fortune), following the update of the privacy policy of Whatsapp. On the other side, as is often the case with platforms that offer greater privacy protection, people abuse it for illegal purposes. Indeed, Telegram hit the news several times in the last years for the infamous activities run on the platforms. In Indonesia, terrorists used Telegram to promote radicalism and give instructions for carrying out attacks bbcterrorist. Neo-Nazi groups leverage Telegram to share their ideologies (telegramNeoNazi). Crypto investors coordinate large groups through Telegram channels to arrange market manipulations like pump and dump frauds (la2020pump). Moreover, to worsen everything, revenge porn (revengePorn) and channels with child pornography (chlidporn) content are not uncommon. All this shows that Telegram is a complex ecosystem with a largely unexplored dark side.

In this paper, we present TGDataset, a collection of more than channels. Analyzing the dataset, we find channels related to porn, carding, inciting violence, hacking, and white supremacism. We study two particular kinds of channels, the verified and the scam. The first are channels for whom Telegram verified that are official of a public figure, a creator, or a company. Instead, scam channels are those that users report to Telegram because the admins arrange frauds, impersonate public figures, or more, in general, channels that is better not to trust. We discover it is possible to reach scam channels following the flow of forwarded messages in less than 2 hops starting from a verified channel. Thus, unaware users can easily end up in those dangerous channels. Then, we deal with the widespread phenomenon of misleading accounts that affect online social networks and Telegram channels. We define two different kinds of channels, the fakes and the clones. A fake channel is a channel pretending to be the official one of a celebrity or organization and posting messages different from those of the official one. A clone channel is a channel that mimics an official one publishing its exact content. Lastly, we carry out a thorough investigation on Sabmyk. Sabmyk is a new conspiracy theory that exploited fake and clones channels intensively to became popular.

Our main contributions are the following:

  • TGDataset. We build a new dataset made of channels. To the best of our knowledge, TGDataset is the first collection of Telegram channels that take a snapshot of the actual Telegram ecosystem instead of focusing on a particular topic. Moreover, we release our resource (dataset) publicly to help researchers in further investigations.

  • Fake channels detection. We define the phenomenon of fakes channels on Telegram, and we propose a machine learning model able to detect fake channels with an accuracy of . With the proposed model, we detected allegedly fake accounts of which we could confirm .

  • Clone channels analysis. We find in our dataset clone channels. Analyzing them, we discover that the admins of clones exploit the popularity gained to promote cryptocurrencies, ideologies, or sell goods.

  • Sabmyk spread strategy. We investigate Sabmyk. Analyzing our dataset as a graph, we identify the channels composing the Sabmyk network and analyze their strategy to reach a large audience quickly.

2. Telegram

Telegram is a popular instant messaging platform that started in 2013, with more than 500 million active users by 2021 (TelegramUsers). On Telegram, users can share text messages, images, videos, audio, stickers, and files weighing up to 2 GB. Aside from the standard one-to-one messaging, Telegram provides group chats and channels. Both have a unique username on the platform, a title, a description, and they can be private or public. While groups allow many-to-many messaging (any member can write) and have a limit of 200,000 members, channels provide one-to-many communication (only admins can post content) and unlimited subscribers. Moreover, channels do not show info about the subscribers, except the total number. Although they serve different purposes, private chats, groups, and channels are not isolated but linked through message forwarding. This is a functionality that allows users and channels to forward content posted in a chat to a different user, group, or channel showing the author of the original message. In particular, Telegram channels are an effective solution for spreading information to a large pool of people. Indeed, several institutional public figures and companies opened an official Telegram channel to broadcast announcements and news (covidTelegram). Likewise, start to pop up on the platform also channels aiming to impersonate these kinds of channels or that leverage Telegram channels and groups to sell fake products or services. To face this phenomenon, Telegram introduced the verified and the scam marks. Channels, groups, and bots can achieve the verified mark proving to Telegram that the profile has the verified status at least on two social media platforms (e.g., TikTok, Facebook, Twitter, Instagram) (howVeryfyTelegramCh). Instead, Telegram flags a channel or a group as a scam if several users reported it for fraud (telegramScam).

3. The dataset

In this section, we describe the approach we used to build our dataset, the TGDataset. Then, we explore its main characteristics, focusing primarily on verified, scam, and standard channels. Finally, we investigate the topics covered by the retrieved channels.

3.1. Building the dataset

Existing Telegram datasets are designed for specific studies. Thus, they contain only channels related to a particular topic, language, or country. For instance,  (hashemi2019telegram) contains exclusively Iranian channels; while (hoseini2021globalization) only has channels related to QAnon. Finally, the PushShift dataset (baumgartner2020pushshift) focuses on right-wing extremist politics or cryptocurrencies related channels. Conversely, our work aims to study the Telegram ecosystem broadly to understand how potential malicious actors behave and the main topics discussed on the platforms. Thus, we need a dataset representing an actual snapshot of Telegram covering many popular and connected channels. For these reasons, we build the TGDataset.

To explore Telegram, and in particular the most popular and connected channels, we start from seed channels covering different topics and expand the dataset by adding, for every forwarded message in the seed channels, the original channel of the message, as previously done in (baumgartner2020pushshift). To select the seed channels, we leverage Tgstat (tgstat), a popular service that indexes more than Telegram channels and collects statistics about them. Tgstat reports for each channel information such as the number of subscribers, category (topic), growth percentage, and language. Among the other statistics, Tgstat reports the rank of the top channels by the number of users. From this rank we retrieve all the categories these channels belong to, finding the following categories: Sales, Humor and entertainment, News and Mass media, Video & Movies, Business & Startups, Cryptocurrencies, Politics, Technologies, Sport, Marketing, Economics, Games, Religion, Software and Applications, Lifehacks, Fashion & Beauty, Medicine, Psychology, and Adults. Then, we select as seeds the most popular channels by the number of subscribers from each category.

Overall, we obtain a total of seed channels. From each seed channel, we download the last messages through the Telethon APIs (telethonAPI)

, an open-source Python wrapper of the official Telegram APIs. Although a channel can contain more than

messages, we decide not to download more than that. Indeed, we empirically notice that messages are a fair trade-off to have a good representation of the channel’s content and not abuse the Telegram’s API. After downloading the data, we parse the messages to discover new channels analyzing the forwarded messages. Finally, to further expand the TGDataset, we use the newly discovered channels as new seeds and iterate the above-described procedure. After three iterations, more than 50% of seeds do not contribute any more since all theirs forwarding flows are completely explored.

From each channel, we store the following information: The title, the description, the username, the ID, the creation date, the number of subscribers and if it is marked as a scam or verified. Whereas for the messages, we store the author, the timestamp, and, in case of forwarded messages, the original author, the original posting date, and which is the original channel. Finally, we store the content of the text-based messages, while just the title and the file format of the media-based messages.

3.2. Dataset overview

Data collection ended on 15 March 2021. With the described approach, we found channels, 1,205 of which were not longer accessible and deleted. Overall, the TGDataset is 121 GB in size, contains messages and different channels. Among the channels, () are verified channels and () are scam channels. Manually investigating the scam channels, we found that they are mainly related to trading, cryptocurrencies, and fake accounts of political figures (e.g., Donald Trump, Mike Pompeo, and Ivanka Trump). For the sake of clarity, we will refer to the channels that are neither scam nor verified as standard channels. In the following section, we analyze these three kinds of channels separately to highlight differences in behavior.

(a) Subscribers
(b) Forwarded messages
Figure 1. 1(a) CDF of the number of subscriber for scam, verified and standard channels. 1(b) CDF of the ratio of forwarded messages for scam, verified an standard channels.

3.3. Subscribers and Messages

As shown in Fig. 1(a), the most popular of the three kinds of channels are the verified ones. Indeed, these channels represent celebrities or services, and hence it is very likely that they have a large base of subscribers. On average, a verified channel has subscribers, with the most popular channel in this category (Telegram News, the official channel of Telegram) having subscribers, while the smaller one (Russian MFA, the official channel for Russian Foreign Ministry). Scam channels are the runner-up on this statistic, with an average number of subscribers of . This result is not completely surprising. Indeed, the most popular scam channels attempt to impersonate popular people, and careless users could not recognize that these channels are not official. For instance, the most popular scam channel, with subscribers, aims to represent "Donald J Trump" (https://t.me/trumps). Lastly, we have the standard channels. Among them, the largest channel is HINDI HD MOVIES KGF LATEST, a channel with subscribers for downloading copyrighted films. Nevertheless, standard channels usually have far fewer users, indeed of these channels (the beginning of the knee in Fig. 1(a)) have less than subscribers.

(a) Text-based messages
(b) Media-based messages.
Figure 2. 2(a) and 2(b) display respectively the CDFs of the number of text-based and media-based messages posted by the 3 kind of channels.

Then, we analyze the number of text-based and media-based messages shared by the three categories of channels. Fig. 2(a) and Fig. 2(b) reveal that verified channels tend to share more messages in general, both text-based or media-based. On average, verified channels post text messages and media content, while scam and standard channels post and text messages and and media, respectively. As displayed in Fig. 1(b), verified and scam channels forward fewer messages than standard ones: on average, verified and scam channels forward and of their messages, respectively, while of the messages posted in standard channels are forwarded. Of these scam channels, the one that forwards more messages ( messages from different channels) is Hype Royale, a hacking channel for a mobile game.

3.4. The Graph of the Dataset

The dataset can be represented as a directed graph in which nodes in are the channels and edge in represents the presence in channel  of a message originally posted in and forwarded to  by the admin of channel . Since the users of channel  can navigate the forwarded message and land on channel , the edge represents in an natural way the possible flow through channels of a users following forwarded messages.

The resulting graph has 7,551 strongly connected components, a giant component of 27,672 nodes (78% of the channels), 139 components of two or more nodes, and 7,412 isolated nodes. The giant component includes 54 seed channels, while 50 seed channels are actually isolated nodes. The main Tgstat categories to which these 50 channels belong are Fashion & Beauty (7), Software and Application (6), News and mass media (5), and Humor and entertainment (5). The channel that posts forwarded messages from the largest set of other channels (i.e., the node with largest out-degree), is Rekt plebs, an entertainment channel that jokes about the losses of unwary cryptocurrency investors, which posts forwarded messages from 3,308 other channels, covering 9% of the channels in the dataset. Instead, the channel whose messages are posted as forwarded messages to the largest set of other channels (i.e., the node with largest in-degree), is a Russian news channel (Раньше всех. Ну почти.) with in-degree 3,371.

One intriguing detail to investigate is whether and how seed channels are connected to scam channels. It is interesting since they are the most popular and, thanks to their status, could be considered trusted by the users. Looking at the distance (shortest path) between seed and scam channels, we notice that all seed channels are very close to the scam channels. Twelve of the scam channels can be reached with two hops and another dozen with three hops from a seed channel. One of the seed channels, Donald J. Trump, is even itself a scam channel. In the case of verified channels, we find a similar situation. Indeed, 22 out of 26 scam channels can be reached with only two or three hops from verified ones. Furthermore, 116 verified channels ( of the total number) are connected to all 26 scam channels. These results show that it can be really easy that a user of a verified channel navigates to a scam channel.

Scam channels, between themselves, are almost isolated. Sixteen of the 26 scam channels are not connected to any other scam channel. The remaining ten are at an average distance of 5 hops from the closest scam channel, and one of the scam channels is 11 hops away from the nearest. Thus, it is likely that there is no collaboration between the scam channels in our dataset. Verified channels are the opposite. 120 of them () are less than 3 hops away from the closest verified channel, with 58 channels at only 1 hop. Nonetheless, we also find 62 verified channels disconnected from the other ones.

Another interesting aspect is related to which are the most influential channels. These are the nodes that spread the information more frequently and faster (kempe2005influential). One of the most popular approaches to identify the influential nodes is to use centrality metrics like PageRank (brin1998anatomy; chen2013identifying). The idea is to define as the most relevant nodes those that have the highest PageRank. Fig. 3(a) shows the CDF of the Page Rank values for verified, scam, and standard channels. Verified channels have a higher Page Rank value. In contrast, standard and scam channels have a similar distribution. This shows that verified channels are the influential nodes within the graph and the main engine of information dissemination within Telegram. Interestingly, these results show that the Page Rank could be a very relevant feature to identify verified channels.

(a) Page Rank values
(b) Copied messages
Figure 3. 3(a) CDF of PageRank values of scam, verified and standard channels. 3(b) CDF of the ratio of copied messages.

3.5. Discovering Topics

In this subsection, we investigate the topics covered by the channels in our TGDataset using Topic Modeling (hofmann2001unsupervised). This is a data mining tool that allows finding a brief description of the topic addressed by the messages of a channel. For this analysis, we consider only channels that post material in English. We first pre-process the messages normalizing and polishing them. Then, to detect the languages of the channels, we leverage LangDetect (nakatani2010langdetect), a language detection library implemented by Google with precision over for 53 languages. In this way, we find 7,101 ( of our dataset) channels that we can use as input to the topic modeling.

To discover the latent topics addressed within the channels, we use the Latent Dirichlet Allocation (LDA) (blei2003latent) as Topic Modeling algorithm. LDA needs as input the number of topics, so we used the UMass measure (mimno2011optimizing) to select the optimal one. LDA relies on the idea that documents are generated by a particular probabilistic model, according to which each document is composed of a mixture of a small number of topics, and each word belongs to one of them. UMass is an intrinsic measure of Topic Coherence (stevens2012exploring). It computes the log-likelihood that two words that represent the topic occur in the same documents. In particular, the higher the coherence of the words representing topics, the closer to 0 the value of UMass. In our case, we calculate the best UMass value reached by selecting the number of topics from 10 to 25. The best is obtained with 14 topics, with a UMass value of -0.97. Tab. 3 in Appendix shows the inferred topics and top 10 keywords for each of them.

Having identified the topics, we group the channels according to the topic. We use the values obtained from LDA as input features to K-means 

(hartigan1979ak), a classical clustering algorithm based on partitioning.

Figure 4. Telegram service messages.

Tab. 1 reports the topics discovered in the TGDataset and the number of channels associated to each topic. As we can see, the emerged topics are quite different from those covered by the seed channels (see Sec. 3.1). Indeed Sport, Marketing, Humor and entertainment, Sales, Business Startups, Medicine, Psychology, and Fashion & Beauty disappear from our dataset, while some other interesting topics come up. One of these is related to carding. Carding is the practice of selling full details of stolen credit cards or selling prepaid cards or other goods purchased with them. Similarly to what happens in dark web forums (kigerl2020behind), carders (the people who own the stolen credit cards) use Telegram channels to place gift cards or goods for sale. In the TGDataset, we find 74 channels ( of the English part of the dataset) that offer this service. Another unusual cluster of channels is about violated terms. They are 110 () channels whose messages have been obscured by Telegram because they incited violence, published illegal pornographic content, or shared content protected by copyright. In this case, Telegram replaced some or all of the channel’s messages with text explaining the reasons for obfuscation, as those reported in Fig. 4. Interestingly, the channel itself and the metadata (e.g., the posting date) related to the original messages are still available. Despite the commitment of Telegram in obfuscating these channels, we still find public channels within our dataset promoting the spread of neo-Nazi ideologies or call for violence (e.g., White Aryan Woman, Feuerkrieg Division **OFFICIAL**). Therefore, the problem is still far from being solved. The channels mentioned above belong to the group of channels that we identify as Religion and supremacism, the largest group (3,352 channels) in the TGDataset. In this group, there are channels strictly related to religion and channels that praise the supremacy of the white race. Intrigued by this surprising mixture of topics discovered in our dataset, we dig more on these channels, reading the shared content. We discover that the supremacist channels in this group use many references to the Christian religion or utilize religion itself as a motivation for their ideology. Thus, it is likely that the mixed tones used in this kind of channels lead our topic modeling approach to build this peculiar topic. Finally, we find that the other topics are more aligned with the ones covered by the seed channel of the TGDataset.

Topic # channels # scam # verified
Religion and supremacism 3352 (47.20%) 1 5
News 1244 (17.52%) 8 30
India news/career 651 (9.17%) 0 4
Adult Content 564 (7.94%) 0 1
Games hacking 269 (3.79%) 1 0
Free music/movie 220 (3.10%) 0 2
Software 208 (2.93%) 0 9
Cryptocurrencies 206 (2.90%) 0 0
Violated terms/pornographic 110 (1.55%) 0 0
Hacking 103 (1.45%) 0 0
Carding 74 (1.04%) 1 0
Telephone modding 40 (0.56%) 0 0
Trump supporters 39 (0.55%) 0 6
Games discussion 21 (0.30%) 0 1
Table 1. Discovered topics and number of channels

4. Clone channels

A curious aspect of Telegram is the presence of sets of two or more channels that post identical messages. Clearly, the actual creator of the content is only one in the set, and we refer to it as the original channel. Instead, we call clone channels the channels that publish the exact content of the original one. To understand the reasons behind the creation of a clone channel and how common this phenomenon is, we examine the TGDataset.

To find the clone channels, we compare the messages of each channel with those of all other channels. To reduce the search space significantly, we compare only channels written in the same language. To avoid messages that could be coincidentally identical, we only take into considerations messages longer than 5 words, and we do not consider forwarded messages or messages about the violated terms, such as the ones in Fig. 4. Finally, we analyze the distribution of copied messages in our dataset (Fig 3(b)). As we can see, more than of channels have less than identical messages in common with other channels. To find the clones, we restrictively select the tail of the CDF (the orange dot in the figure) that represents the channels with or more identical messages with another channel. We consider the channel a clone of the original channel if, for each common message, the one of has a publication date later than that of .

With this approach, we find 83 clone channels. In particular, we find 37 written in English (), 20 in Russian (), and the others in Bulgarian, Farsi, German, Estonian, Hindi, Indonesian, Marathi, and Arabic. Manually investigating the English channels, we find that the target of a clone is often the official channel of a celebrity or service. In particular, 5 clones have a different name with respect to the original channels but they post all the messages of the original ones. Moreover, they interleave the original messages with links to an external platform to buy goods (e.g., books, microwaves) or links to join other channels. For instance, we find a clone of a cryptocurrency-related channel that promotes another channel that arranges pump and dumps operations la2020pump. 5 channels clone the official channel of a celebrity and have very similar name of the original. In these clones, we find additional messages with controversial political content such as anti-vaccine campaigns. Then, we find a group of 10 channels cloning channels of politicians close to Donald J. Trump or Republican news channels. In this case, all the messages not taken from the original channels promote a new cryptocurrency called Trump coin.

There are also 2 perfect clones with the same content, title, description, and profile image of another channel. These 2 channels copied the original channel for weeks and then started to post messages about a new conspiracy theory called Sabmyk (see Sec. 6).

Interestingly, we find 4 clones that, as the original channels, post books protected by copyright. We believe that the admin of the clones is the same admin of the original channels and uses the clones as a backup of the material shared. If this is the case, this technique appears to be effective. Indeed, checking the original channels a month later we downloaded the data, we found that Telegram removed the content of the original channels while the clones continued its activity. Finally, for the other clones, we notice nothing suspicious other than being clones. However, it is crucial to remark that they are the clones with fewer subscribers (less than 1,000). Thus, they could not have awakened yet, or the admin stopped his cloning activity, as we found in one case. It is clear that looking at their behavior, the goal of clone channels is to take advantage of the popularity and content generated by the original channel to gain subscribers and promote other services. This strategy is very effective. Indeed, the average number of subscribers of the clone channels is 28,491.06. The larger clone channel is one of the perfect clones, with over 100,000 subscribers. It is not surprising since, in this case, the clone and the official channel are virtually indistinguishable without knowing the channel’s username.

5. Fake channels

Just like what happens with fake accounts in Online Social Networks (cao2012aiding; xiao2015detecting), fake channels are widespread in Telegram. A fake channel, as a fake account, attempts to impersonate an important service or person. A characteristic of fake channels is that the title is the exact name of the target or a slight variation of it (e.g., presence of emoji in the title). Indeed, they attempt to qualify themselves as officials using words like official, real, and verified or adding the verified mark on the profile image. Fake channels are different from clone channels since they do not replicate the messages of the original channel. In this section, firstly, we present our machine learning model to detect fake channels. Then, we apply our detector to the TGDataset.

5.1. The Fake Channels dataset

Training a machine learning model able to detect fake channels requires a ground-truth, a good amount of channels for whom we are sure about their status of official or fake. Thus we create the Fake Channels dataset. To build it, we use the following approach: We firstly leverage the telemeter.io (telemeterIO) services to retrieve a list of verified channels. Then, for each verified channel found, we look for fake channels claiming to be the official ones. At the end of the process, the Fake Channels dataset consists of 342 different channels, 184 of which official and 158 fake. Of course, we discarded from the Fake Channels dataset clones or channels of the TGDataset, so that we can use the dataset as the training set.

5.2. Features and classifier

To build a classifier that detects fake channels, we leverage what we learned in Sec. 

3 along with other features that capture differences between official and fake channels in the writing style, references to other channels, and time of activity. We tried several sets of features and classifiers to build our model. In the following, we describe the configuration that achieved the best performances.

Writing style features: average message length, average number of emojis per message, average number of non-alphanumeric characters per message, number of non-alphanumeric characters in the title and description, and average number of non-alphanumeric characters in the channel’s title.

Temporal features: number of text messages published in the last 3, 6, 9 months and average posting time between two consecutive messages.

External interaction features

: number of forwarded messages, standard deviation of the number of source channels for the forwarded messages, number of shared links, and number of duplicate messages containing at least one link.

We use these features to train a Multilayer Perceptron (MLP) 

(gardner1998artificial) of

linear layers with Rectified Linear Unit function (ReLU

(hara2015analysis)

as the activation function, Adam optimization algorithm 

(zhang2018improved) as the optimizer, and binary cross-entropy (BCE) (mannor2005cross)

as the loss function. We train the model for 50 epochs and evaluate its performance through 5-fold cross-validation 

(anguita2012k), achieving an accuracy of and a weighed F1-score of . To further assess our model, we run the detector on the verified and scam channels we have in our dataset. To experiment, we select all the 191 channels and the scams channel that attempt to impersonate users or services, accounting for channels. Here the model detects as official out of channels and fakes out of . Thus, in this experiment, the model achieves a global weighed F1-score of and an accuracy of .

Figure 5. SHAP values of the 5 features that contribute most to model prediction.

5.3. Features analysis

To understand which features contribute more to obtain the excellent performance of our model, we use the Shapley Additive Explanations (SHAP) value (lundberg2017unified)

. It determines the contribution of each feature based on game theory principles and local explanations. Fig. 

5 shows the SHAP values of the 5 features that contribute more to the predictions of the model. According to the SHAP value, the 3 most significant features are the number of links posted within a channel, the number of text messages posted in the last 3 months, and the number of non-alphanumeric characters in the title. Interestingly, a high number of links suggests to the model that the channel is an official one. Indeed, analyzing the Fake Channels dataset, we find that the official channels tend to post many more social media links (on average 2006.24) than fake channels (on average 682.74). Moreover, a large number of posts published in the last 3 months inclines the model to consider a channel as an official. The cause could be that some fake channels, unlike the official ones, tend to have a short life of activity. Instead, non-alphanumeric characters in the title lead the model to flag a channel as fake since several fake channels use emojis (especially the one similar to the verified channel symbol) in their title to attract users.

Surprisingly, during our study, we notice that using as a feature the number of subscribers produces a negative effect on the model’s performance. The reason for this is that the number of subscribers highly depends on the popularity of the target channel. In fact, several fake channels have more subscribers than official channels representing niche services or not so famous people. In Sec. 3.4 we saw that verified channels have a very high Page Rank value with respect to other kinds of channels. So, Page Rank could be a powerful feature to improve the model. However, to use the Page Rank, it is required to know the graph of the Telegram channel’s, which is not even possible. For this reason, we do not use the Page Rank as a feature in our experiment.

5.4. Discovering fake channels

After validating our classifier, we leverage it to detect fake channels on the TGDataset. For this task, we consider only English channels. We ignore channels dealing with carding, hacking, and video games, as they do not have websites from which it is possible to verify their identity. Hence, we collect the channels that have in their title, description or username the words real, official or verified. Indeed, as we said, several fake channels use these words to persuade users that they are official. To further expand the dataset, we consider all the channels that have a similar name (edit distance less than 3) to one of the verified channels. In the end, we collected a set of 502 channels. The classifier returns as fake 198 channels out of 502. Since we do not have a ground truth for this set of channels, we check all of them manually to assess the results. In particular, we consider a channel:

  • Official: if Telegram marked it as verified or there exists an official source (e.g., Website, Facebook, Instagram, Twitter) of the person/service indicating the Telegram channel as the official one.

  • Fake: if there is another channel that we consider official with the same name or an official source states that there is no official Telegram channel.

  • Allegedly fake/official: if our classifier detects the channel as fake/official, but there is no evidence of their status. In particular, we have no channel with the same or a similar name that we consider official and the related official web pages or social media pages do not mention any Telegram channel.

After the manual investigation, we mark as fakes or officials channels out of . In particular, among the channels recognized as fakes by our model, there are fakes, allegedly fakes, and official. While, among the channels classified as official, there are actual official channels, allegedly official, and fakes. Thus, for the channels we have evidence of their status, our classifier was able to classify channels out of correctly, equivalent to an accuracy of , which aligned with the results obtained in cross-validation.

About the channels we verified to be fake, are of political figures from the Republican party, including claiming to be Donald Trump, are of celebrities, mostly actors, and are of news. Interestingly, fake channels, including those of actors and political figures, mainly forward messages about a conspiracy theory called Sabmyk.

6. SABMYK: a new conspiracy theory

Analyzing the fake channels detected by the model described in the previous section, we notice a group of channels related to Sabmyk. This is a conspiracy theory that proposes itself as a better alternative to QAnon and promotes a singular quasi-religion centered around a messianic figure known as Sabmyk (independetSabmyk). According to the "HOPE not hate" organization, the Sabmyk-network has over a million members distributed on about one hundred Telegram channels (HopeSabmyk). In particular, the mastermind of this operation is a German artist, Sebastian Bieniek, who has previously used social media to publicize his work. Intrigued by the considerable number of members achieved by this conspiracy theory, we leverage the TGDataset to investigate how Sabmyk operated and how it built this strong network.

We start by discovering the Telegram channels involved in spreading the Sabmyk theory, leveraging the graph we built in Sec. 3.4 and a community detection algorithm. A community in a graph is a subset of nodes that are densely connected to each other and weakly connected to nodes in other communities. To uncover the Sabmyk community, we used the Leiden algorithm (traag2019louvain). At the end of the process, we find communities, of which one contains all the channels we already know. The community discovered is made up of channels that, through manual investigation, we confirm are related to Sabmyk. Now, analyzing these channels, we examine how the people behind Sabmyk operate.

6.1. Spreading a Conspiracy Theory

What is surprising about Sabmyk is how quickly it has spread. In a few months, these channels went from zero subscribers to an average of more than , with the biggest channel Great Awakening Channel with . Concerning the temporal aspect, we find that the creation date of the first channel of the network is April 2020 and of the following channels is December 2020. However, it is only in 2021 that most of them appeared on Telegram ( in January and in February).

Attractive topics. The goal is to spread the messages as widely as possible. Thus, the need is to create channels dealing with topics that can attract many subscribers. As seen before, channels of services or public figures attract numerous subscribers, and it is challenging to distinguish an official channel from a fake or a clone one. Sabmyk exploited this idea by creating fake channels of famous people ( of the network), institutional entities (e.g., Department Of Defence, US Navy Channel, US Marines Channel), or news (, e.g., Liverpool Times, London Post, Chicago Reporter). Another technique used was to create channels that target specific kinds of users near the Sabmyk theory. This category includes channels related to QAnon (), far-right (), or other conspiracies theories (e.g., Obama Gate Truth, Chemtrails News). In the Appendix (Tab. 2), we report the complete list of Sabmyk-related channels we discovered.

Reuse of content. Producing content for about channels could be a laborious and time-consuming task. In detail, messages have been shared within the Sabmyk network. However, analyzing these messages, we find that the number of distinct messages is only . Indeed, most of the messages are forwarded multiple times within the network. The most shared messages are an image related to the "Great Awakening Channel" posted times ( of total messages), the invitation link to join the channel of "John F. Kennedy Jr." posted times, and the message asking people to follow and share the Great Awakening Channel, posted times.

(Almost) Random content. Sabmyk intensively reuses its content. Through the analysis of forwarded messages of each channel, it is possible to note that there is no care about forwarding coherent messages with the channel’s name. It is evident in channels with names of specific domains. For instance, in the channel named "Satoshi Nakamoto Official" (the pseudonym of the inventor of Bitcoin), of the forwarded messages are not related to the Bitcoin or cryptocurrency world. However, they pay attention to two aspects. The first point concerns the content created by the channel. Indeed, in this case, the content topic is related to the name of the channel. Looking again at the "Satoshi Nakamoto Official" channel, all the content created by this channel is related to the cryptocurrency world. The second aspect is about the language of the messages. Indeed, the content written in English is forwarded over the whole network. Instead, messages written in other languages (e.g., Italian, German) are forwarded only in the channels that specifically target users whose native language is the same as the content (e.g., QAnonItaliano, Great Awakening DE). Fig. 6(b) shows the percentage of the network reached by each message. About of messages are shared between and of the network, while almost of messages by nearly the whole network. Of particular interest are the messages that have never been forwarded ( in the figure), accounting for about . Indeed they are all messages that belong to the initial activity — before the channels start to forward messages— of the network’s clone channels. Finally, the remaining of the messages forwarded by less than of the channels are the ones written not in English.

High coordination. The admins forward messages in the channels as soon as the new content is available. Fig. 6(a) shows the delay in forwarding messages from the creation time of the content. The first forward of a new message happens in the of the cases within minutes. It is likely because the content creator also manages other channels and instantaneously forwards the messages with them. The time that the whole network forwards a new message is incredibly fast. of messages cover the network in just minutes, and more than in the first 24 hours. Since the messages do not cover the whole network simultaneously, we believe that the forwarding is not managed by software or a single person but by many highly coordinated people.

(a) Message forwarding time
(b) Copied messages
Figure 6. Fig. 6(a) Forwarding time of first, average and last forwards of messages. 6(b) Percentage of message sharing within the Sabmyk network.

A core channel. By analyzing the graph of the Sabmyk network as we did in Sec. 3.4, we find that it consists of 2 strong connected components. One is of a single node, the channel entitled Sabmyk, and the other component contains the remaining channels. Interestingly, the Sabmyk channel is the only one in the network that never forwards a message, whereas the whole network forwards all messages posted in the Sabmyk channel. Therefore, all the channels of the Sabmyk network are at 1 hop from the Sabmyk channel. Thus, it could be quite easy for users who joined one of the network channels to end up in the Sabmyk channel. Conversely, the users who joined the Sabmyk channel directly could remain unaware of the rest of the network. Finally, observing the content created, we notice that, although every channel created at least one content, only channels created more than the of messages.

7. Related work

There are several works focused on the Telegram ecosystem or emerging research issues related to it. Hashemi et al. (hashemi2019telegram) collect Iranian channels and Iranian groups on Telegram to identify high-quality groups, such as business groups, among low-quality groups (e.g., dating groups). They show that high-quality groups distinguish themselves from low-quality ones by longer messages and more user engagement. Nobari et al. (dargahi2017analysis) present a structural and topical analysis of messages posted on Telegram. In particular, they build a dataset of more than groups or channels and form a graph based on mentions. This study indicates that the PageRank algorithm is not suitable for detecting high-quality channels in Telegram. Jalilvand et al. (jalilvand2020channel) address the problem of finding an ordered list of channels related to a user request in Telegram. Baumgartner et al. (baumgartner2020pushshift) publish a dataset of over thousand channels and messages from million unique users. Their dataset includes a wide range of right-wing extremist groups, as well as protest movements. In their work, Weerasinghe et al. (weerasinghe2020pod) reveal that Telegram hosts several organized groups, called pods, where each member interacts with each other’s content to increase the popularity of their Instagram accounts. Other works (xu2019anatomy; la2020pump) reveal a vast presence on Telegram of channels and groups focused on cryptocurrency pump and dump. This market manipulation involves artificially inflating the price of a cryptocurrency held and then reselling it at a higher price. Finally, several studies focus on the activity of terrorist organizations, like ISIS, that utilize Telegram for disseminating content and recruiting new followers (cao2017dynamical; yayla2017telegram).

8. Conclusions and Future works

Telegram is becoming more popular every day, both as a classic instant messaging app and as a platform to deliver live updates and content to a large audience. Thus, it becomes increasingly important to understand what happens on the platform and how it will evolve in the future.

In this paper, we present our dataset of Telegram’s channels, the TGDataset, made of more than channels. To the best of our knowledge, this is the first dataset of Telegram’s channels that attempt to make a general-purpose snapshot of the channels present on Telegram. Thus we will release it publicly accessible (dataset) to help researchers on further investigations. Starting to scratch the surface of our dataset, we discovered the presence of several borderline activities running their business on public channels, as well as channels dealing with dubious ethical content. In our study, we mainly focused on the widespread phenomenon of fake and clone channels. We characterize these kinds of channels and try to understand how admins of these channels take advantage of them. We propose a machine learning model that achieves an accuracy of in detecting fake accounts. Running our detector on a subset of our dataset, we found allegedly fake account of which we could confirm . Given the extent of the phenomenon and the difficulty of distinguishing fake channels from official ones, even for technically knowledgeable users, the need for institutions, famous people, and organizations to obtain the verified status for their channels is on the rise. Indeed, we notice assessing our results that only few official channels leverage this opportunity. Finally, we investigated Sabmyk, a conspiracy theory that exploited fake and clone channels to reach a broad audience.

With this work, we shed light on several dark sides of Telegram. However, we believe further investigations are needed to illuminate the Telegram ecosystem completely. Indeed, in our research, we leverage only text-based messages. Considering media-based content, we think it is possible to achieve a more refined topic modeling classification. Moreover, we believe Telegram public groups are a vast portion of Telegram and deserve further exploration. Indeed, here it is possible to easily access the complete list of subscribers, compromising the users’ privacy and impersonating the administrators to carry out frauds.

Appendix A The Sabmyk network

Table 2. Titles, number of subscribers, and username of the Sabmyk channels. The largest channel of the network is in bold. Prepending the string https://t.me/ to the username is possible to obtain the URL of the channel (https://t.me/username).

Appendix B Topic discovered within our dataset

Topic Top 10 keywords
Porn videos, interview, drop, porn, fetish, queen, teen, click, vent, bitch
Religion and supremacist 4chan, jew, jesus, christ, states, hitler, church, lord, proud, identity, antifa
Software php, linux, web, vulnerability, item, readhacker, scan, software, privacy, hours
Carding trick, iphone, apple, carding, amazon, method, samsung, payment, cracking, prime
Telephone modding xda, feedproxy, appeared, android, galaxy, samsung, developer, download, pixel, oneplus
Hacking premium, proxy, mod, click, hack, server, netflix, port, log, apk
Cryptocurrencies reddit, bitcoin, btc, ift, crypto, rss, eth, xrp, stock, exchange
Games hacking root, pubg, password, hack, mod, aimbot, antiban, esp, mobile, recoil
India news/career india, upsc, minister, affairs, foreign, exam, promotion, articleshow, express, hindu
Trump supporters tweet, donald, realdonaldtrump, joebiden, zerohedge, hashtag, tribunal, dbongino, wire, february, shapiro
News biden, coronavirus, vaccine, reuters, election, donald, court, pandemic, capitol, lockdown
Free music/movie download, 720p, genre, artist, audio, tumblr, title, album, size, spotify
Violated terms/pornographic displayed, violated, terms, pornographic, post, placed, rockwell, infringement, unavailable, copyright
Games discussion castle, defense, gold, lvl, protected, points, fox, successfully, potato, pouch
Table 3. Top 10 keywords within LDA topics.