Expertise and Dynamics within Crowdsourced Musical Knowledge Curation: A Case Study of the Genius Platform

06/15/2020 ∙ by Derek Lim, et al. ∙ cornell university 0

Many platforms collect crowdsourced information primarily from volunteers. As this type of knowledge curation has become widespread, contribution formats vary substantially and are driven by diverse processes across differing platforms. Thus, models for one platform are not necessarily applicable to others. Here, we study the temporal dynamics of Genius, a platform primarily designed for user-contributed annotations of song lyrics. A unique aspect of Genius is that the annotations are extremely local – an annotated lyric may just be a few lines of a song – but also highly related, e.g., by song, album, artist, or genre. We analyze several dynamical processes associated with lyric annotations and their edits, which differ substantially from models for other platforms. For example, expertise on song annotations follows a “U shape” where experts are both early and late contributors with non-experts contributing intermediately; we develop a user utility model that captures such behavior. We also find several contribution traits appearing early in a user's lifespan of contributions that distinguish (eventual) experts from non-experts. Combining our findings, we develop a model for early prediction of user expertise.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Crowdsourced Lyric Annotation

“Lookin’ around and all I see
Is a big crowd that’s product of me”

— Kendrick Lamar, A.D.H.D

footnotetext: See https://github.com/cptq/genius-expertise for the dataset and code to reproduce experiments.

Online platforms for crowdsourced information such as Wikipedia and Stack Exchange provide massive amounts of diverse information to people across the world. While different platforms have varying information structure and goals, they share a fundamental similarity: the source of the content is a community of users who contribute their time and expertise to curate knowledge. Understanding the users that enable the success of these platforms is vital to ensure the continual expansion of their utility to the general public, but doing so requires special consideration of the particular structure and form of information that users contribute.

The activity and expertise distribution amongst users on these crowdsourced information platforms is often heavy-tailed, and there is substantial effort in both understanding the sets of users that make meaningful and/or voluminous contributions (Pal et al., 2011; Movshovitz-Attias et al., 2013), and how contributions change over time (Anderson et al., 2012). One example problem is expertise detection (Zhang et al., 2007). On platforms such as Quora and the TurboTax live community, experts are not clearly defined but longitudinal data can help identify experts (Pal et al., 2011; Patil and Lee, 2015). In contrast, Stack Exchange users have explicit reputation scores accumulated over time, and there is a focus on early determination of user expertise, where ex ante predictions leverage user behavior and their temporal dynamics (van Dijk et al., 2015; Pal et al., 2012).

Here, we take expertise detection and characterization as test problems to study the temporal dynamics of user contributions on Genius (genius.com), a platform primarily for crowdsourced lyric transcriptions and annotations of songs. Genius was launched as Rap Genius (after a brief stint as Rap Exegesis) in 2009, focusing on the lyrical interpretation of English rap songs.111https://genius.com/Genius-about-genius-annotated Since then, Genius has grown to encompass lyrics in many genres and languages, and alexa.com ranks the web site 408th in terms of global engagement as of May 2020.222https://alexa.com/siteinfo/genius.com

Figure 1. Screenshot of an annotation of the lyrics: “I’m livin’ big, I swear to God I’m Liu Kang kickin’” in Young Thug’s song Digits. (https://genius.com/8869644). (Left) Most recent state of the annotation as of May 2020. (Right) Two edits of the annotation. The bottom version is the third edit and the top is the fourth edit. This annotation has contributions from high-IQ users and contains rich structure such as hyperlinks, a quote, and an image — all examples of what we call quality tags (Section 3.1).

On Genius, lyrics are originally sourced from users who work together to transcribe songs. After, users annotate the lyrics for interpretation, and the annotations are edited over time (Fig. 1). The structure of annotations on Genius differs substantially from that of contributions on other large and popular crowdsourced platforms such as Stack Exchange, Quora, and Wikipedia. For one, Genius has no explicit question-and-answer format. Instead, the transcription of newly released songs offers an implicit question generation mechanism, namely, what is the meaning of these lyrics? Similar to Wikipedia, annotations are continually edited to provide a singular authoritative narrative. However, annotations are extremely localized in the song; an annotated lyric may be just a couple of bars in a rap song or the chorus of a folk song. Still, annotations are related within lyrics of the same song, album, artist, or genre. Finally, the content is much less formal than Wikipedia; annotations often contain slang, jokes, and profanity. The unique structure of Genius leads to user dynamics that do not align with existing models for other crowdsourced information sites.

We focus much of our analysis on user IQ — Genius’s aggregate measure of user experience and activity that is analogous to reputation on Stack Exchange — which also serves as a proxy for user expertise. While tools from the study of other crowdsourced information sites apply to user behavior and IQ on Genius, the structure of annotations also necessitates specialized metrics. Thus, we define metrics of annotation quality, coverage of lyrics by annotations, and lyric originality, which help uncover key characteristics of user contribution behavior that distinguish experts.

We find several patterns in the temporal dynamics of annotations and subsequent edits on Genius that have not been observed in prior studies of other crowdsourced information platforms. One distinguishing set of characteristics are “U-shaped” relationships in annotations made on a song over time. For example, early annotations on a song are made by experienced users, intermediate annotations are made by less-experienced users, and the most recent annotations are again made by experienced users. The quality of annotations and originality of annotated lyrics also largely follow this pattern — the earliest and latest annotations are on average of higher quality and on more original lyrics. We conceptualize this through an IQ diamond (), in which an annotation travels from top to bottom, being considered by experienced users in the narrow top and bottom, and by the bulk of less-experienced users in the wider middle.

Our IQ diamond model for annotation dynamics contrasts sharply with answer arrival dynamics on Stack Overflow. For example, Anderson et al. (2012) described a “reputation pyramid” model of user organization in which questions are first answered by higher-reputation users and then by lower-reputation users. This Stack Overflow model does not work as a model for Genius annotations, as it does not explain the increasing IQ in the later song annotations. Furthermore, the model does not agree with the editing dynamics on Genius; later edits on an annotation tend to be made by more experienced users and increase the quality of the annotation.

To explain the materialization of the IQ diamond and to further understand latent factors inducing user annotation patterns, we develop a model of user utility based on network effects and congestion. In this model, users of different experience levels gain different utility from creating new annotations on a song, given the fraction of lyrics that are already annotated. Fitting simple parametric functions for network effects and congestion matches empirical differences in user behavior between high- and low-IQ users.

Similar to studies on Stack Overflow (Movshovitz-Attias et al., 2013) and various rating sites (Danescu-Niculescu-Mizil et al., 2013; McAuley and Leskovec, 2013), we also analyze user evolution stratified by eventual IQ levels. We find inherent traits of eventual experts visible in their early contributions: even in their first annotations on the site, eventual experts create higher quality annotations, are more often an early annotator on a song, are more often an early editor of an annotation, and annotate lyrics that are more original. We use these features to design a simple discriminative model that successfully makes early predictions of eventual super experts — users with very high IQ on the site.

1.1. Additional Related Work

The basic statistics of Genius have been analyzed and the trustworthiness of annotations has been modelled (Al Qundus, 2018; Al Qundus and Paschke, 2018). More broadly, the platform is used for African American literature courses (Rambsy, 2018) and for understanding music history (Dawson, 2018). In this paper, we analyze Genius in the context of other well-studied crowdsourced information sites, such as Stack Overflow (Ravi et al., 2014; Posnett et al., 2012; Tian et al., 2013), Quora (Wang et al., 2013; Maity et al., 2015) Yahoo Answers (Adamic et al., 2008), and Wikipedia (Beschastnikh et al., 2008; Mesgari et al., 2015). The temporal dynamics of user activity on such sites have been studied in several contexts (Anderson et al., 2012; Jurgens and Lu, 2012; Paranjape et al., 2017; Patil and Lee, 2015; Almeida et al., 2007).

Expertise plays a central role in the study of crowdsourced information. For example, employee-labeled “superusers” have been studied in the TurboTax live community (Pal et al., 2011), and Quora’s manually identified “Top Writers” can be identified by temporal behavior (Patil and Lee, 2015). On Stack Overflow, early user activity and textual features have been used for predicting eventual experts (Movshovitz-Attias et al., 2013; van Dijk et al., 2015; Pal et al., 2012).

2. Data Description

We scraped data directly from genius.com with crawls running from September 2019 to January 2020. We refer to any page with lyrics as a song, even though a small fraction of these pages correspond to other content such as transcriptions of podcasts or parts of a movie. Users create annotations on specific lyrics within a song (often, an annotated lyric may be just a few lines). After an annotation is made, users can create an edit for an annotation, and the most recent version of the annotation is displayed on the web page.

We collected lyrics, annotations, and edits from 223,257 songs. Of these, 33,543 songs have at least one annotation. In total, we collected 322,613 annotations and 869,763 edits made by 65,378 users. For each annotation, we have the complete edit history and content. We also have timestamps on every annotation and edit. Annotation and edit content consists of text and HTML.

Figure 2. (Left) Distribution of annotation counts of users. (Right) Distribution of annotation counts on songs.

For each user, we have every annotation and edit that they have made on the collected songs. We also recorded the IQ of each user, which, as mentioned in the introduction, is a quantity analogous to reputation on Stack Exchange. IQ is an aggregate measure that accounts for various contributions on Genius, such as writing annotations and transcribing songs. A user accumulates more IQ if their annotation earns upvotes from other users.333https://genius.com/10606101 We find that the distribution of the number of annotations made by a user and the number of annotations on a song are heavy tailed (Fig. 2).

In addition to annotations and edits, we also collected other Genius user actions such as suggestions, questions, answers, comments, and transcriptions. We say that a user has contributed to a song if they have recorded some interaction with the song.

3. Metrics for Annotations

The annotations that users create and edit on Genius are a unique form of crowd contribution. In order to study user behavior and dynamics, we first define metrics for annotation related to quality, coverage, and originality.

3.1. Annotation Quality

To better understand the generation of content on Genius, we would like to quantify the quality of an annotation. We find that an effective proxy for quality is simply the number of certain HTML tags that indicate rich content creation. Specifically, we consider the following quality tags: <a>, <img>, <iframe>, <blockquote>, <twitter-widget>, <ul>, <ol>, and <embedly-embed>. The annotation in Figure 1 has three unique quality tags: <blockquote>, <img>, and <a>.

Many of these quality tags are found to be associated with quality of user-generated content in other sites. For instance, featured articles on Wikipedia have more links and images than other articles (Stvilia et al., 2005)

, the probability of answer acceptance on Stack Overflow positively correlates with number of URLs in the answer 

(Calefato et al., 2015), and the probability of retweeting on Twitter positively correlates with the presence of URLs in the tweet (Suh et al., 2010). Later, we will show that higher quality tag counts in annotations distinguishes high-IQ users, even on their earliest annotations on the site.

We also measure annotation quality by length, i.e., the number of characters in the content. This has been an effective metric for quality of content in Wikipedia, Yahoo Answers, and Stack Exchange (Blumenstock, 2008; Stvilia et al., 2005; Adamic et al., 2008; Gkotsis et al., 2014). In later sections, we detail nuances in using annotation length as a proxy for annotation quality, as we find evidence that possible time-pressure to annotate early may cause the first annotations on a song to be shorter.

3.2. Annotation Coverage

Figure 3. Mean annotation coverage as a function of number of contributing users on a song (left) and number of page views on a song (right), computed over songs with at least one annotation. Increased popularity with site contributors and visitors correlates with higher coverage. (Note: Genius only lists view counts for songs with at least 5,000 views).

The extent to which crowdsourced contributions satisfy the needs of information seekers is vital to a platform’s success. To this end, we consider coverage, or the amount of the information sought by visitors that is actually present on the site. Various notions of coverage have been used for analyzing Stack Overflow discussions (Parnin et al., 2012), accuracy on Wikipedia (Mesgari et al., 2015; Giles, 2005), and topical concerns of Wikipedia (Samoilenko and Yasseri, 2014; Brown, 2011).

As the fundamental function of Genius is to provide annotations on documents, we study coverage of lyrics by annotations. While some lyrics may be fillers or seemingly lack meaning, there is often still potential for interesting annotations. For example, annotations on such lyrics may provide references to other similar lyrics, references to related external social media content, or historical context.

Starting from all the lyrics of a song, we compute its annotation coverage as follows. First, we remove lyrical headers (e.g., “[Verse 1: Kanye West]” or “[Refrain]”). This leaves total text characters available for coverage. Next, let be the union of all parts of the lyrics for which there is an annotation. Then the annotation coverage for the song is . We find a positive relationship between annotation coverage and both the number of users contributing to the song and the number of views of a song (Fig. 3).

3.3. Lyric Originality

Figure 4. Mean annotation coverage over songs as a function of number of page views, stratified by songs in the upper third of originality (blue), and songs in the lower third of originality (red). Songs with more original lyrics tend to have higher annotation coverage.

We find that annotation coverage also depends on the originality of the lyrics of a song. To measure originality, we first compute inverse document frequencies (idfs), where documents are songs, i.e., , where is the total number of songs and is the number of songs containing word . Following ideas from Ellis et al. (2015), we define the originality of a lyric appearing in a song as the following

-estimator:

(1)

where denotes the th percentile of the idf values of unique words in . We use large percentiles as such words are of interest to web site visitors and annotators, and many words in song lyrics may be fillers for aesthetic reasons (e.g., leading to a rhyme). Only computing percentiles over unique words prevents long, repetitive songs from achieving high originality scores. We find that more original songs tend to have higher annotation coverage (Fig. 4).

4. Dynamics of Annotations and Edits

Here, we investigate relationships involving the temporal order of annotations on a song or of edits on an annotation. We define the time rank of an annotation on a song as the numerical position in which it was created; for example, the third earliest annotation on a song has a time rank of 3. Similarly, the proportional time rank of an annotation for a song with at least two annotations is given by , where is the number of annotations on the song; this measurement allows for comparison of annotations with similar relative positions in a song’s lifespan across songs with different numbers of annotations. We use analogous definitions for edits on an annotation, considering the first edit on an annotation to be the creation of the annotation itself.

Below, we analyze how temporal orders relate to both contributing users and to the content itself. One of our main findings is that user experience and measures of annotation quality exhibit “U-shaped” patterns with respect to the proportional time rank of annotations. In other words, more experienced users and higher-quality content appear both early and late in time, with less experienced users and lower quality content in the intermediary. We develop an “IQ diamond” model of user behavior based on ideas of economic utility. In contrast, for edits, experience and quality grow over time.

4.1. Dynamics of Annotations

First, we analyze temporal dynamics of annotation creation. Measuring the IQ of a user making an annotation as a function of the proportional time rank (Fig. 5, top left), we find the aforementioned “U shape”. Early annotations on a song are made by high IQ users and then the mean IQ decreases monotonically with proportional time rank until around half the total annotations have been made. After these middle annotations, the mean IQ increases monotonically. This differs from other platforms such as Stack Overflow, where there is a monotonic decrease in user reputation over time (Anderson et al., 2012). We also find that the number of total annotations made by a user follows the same trend (Fig. 5, top right). Thus, the “U shape” is present if we measure user experience by an aggregate measure such as IQ or simply the total number of annotations the user has made.

Figure 5. Various user and content statistics as a function of the proportional time rank of an annotation. Mean IQ (top left), total number of annotations made by the annotating user (top right), and number of quality tags (bottom left) follow a “U shape” with respect to proportional time rank.

We also consider annotation quality as a function of proportional time rank (Fig. 5, bottom row). Here, the number of quality tags follows the same “U shape”. Thus, not only are the earlier and later annotations on songs made by more experienced users, but in fact they are also of higher quality under this metric. However, our other quality metric — annotation length — largely increases with proportional time rank. A possible contributing factor is that early annotators may feel time pressure to annotate lyrics before others, incentivizing shorter annotations that are faster to create. This trend indicates that annotation length measures other factors beyond annotation quality. Such time-pressure may also explain the somewhat lower number of quality tags for the earliest annotations compared to the latest annotations.

Figure 6. Lyric originality as a function of annotation proportional time rank. Apart from the first annotation, originality follows a “U shape”.

Finally, we consider how lyric originality relates to temporal annotation ordering. Again, for the most part, we see the familiar “U shape” (Fig. 6), indicating that the complex lyrics are annotated both at the beginning and end. However, the very first annotations tend to be on lyrics with lower originality. This may be due to users’ reluctance to annotate before there are any other annotations, leading them to annotate simpler lyrics to get the ball rolling.

4.2. The IQ Diamond ()

At a high level, annotation dynamics on Genius appear to be markedly different from Stack Overflow. Anderson et al. (2012) observed that user reputation decreases with later answers to a given question on Stack Overflow. From this, they develop a “reputation pyramid” model of user behavior, where new questions are first considered by high-reputation or experienced users before being considered by users with less expertise. On Genius, we see the same initial descent but then a curious ascent producing the “U shape”. On average, early annotators are experienced users who quickly make annotations of high quality and on more novel lyrics. After, users with less experience make annotations of typically lower quality on lyrics that are less novel. Finally, the late annotations are again made by experienced users on the remaining lyrics, which tend to be more original. This behavior suggests an IQ diamond () model for Genius, in which song lyrics are first processed by high-IQ users that form a narrow point of the diamond, then the song opens to a broader set of users to form the wide middle of the diamond, and then finally narrows to the experienced users once again.

To explain the IQ diamond pattern, we develop a model of utility for user annotation. The utility of annotating at any given time depends on the proportion of the song that is already annotated at that time. More annotation coverage impacts utility both positively and negatively — one can gain more IQ by annotating songs with more activity, but higher coverage limits the choice of lyrics that a user may annotate. Thus, the users’ utility functions may be modelled similarly to the utility of services with both (positive) network effects and (negative) congestion effects (Johari and Kumar, 2009). Related models have been employed for users on crowdsourcing systems (Chen et al., 2019).

More specifically, fix a song, and consider a population of users labeled to . Let

be the vector for which the

th entry is the annotation coverage of the song by user ’s annotations, so and . Here we will refer to an annotation as an infinitesimal increase in coverage. We model the expected utility that user would derive from adding an annotation by

(2)

where , , and are such that

  • is the expected a priori personal utility that user derives from annotating a random lyric segment.

  • is a nondecreasing function measuring the expected positive network effect when proportion of the lyrics are covered by other users. The positive network effect arises because users tend to gain more IQ and have their contribution viewed by more people on songs that are more popular. Empirically, coverage is positively correlated with the number of song page views (Fig. 3).

  • is a nondecreasing function that measures the expected congestion effects when proportion of the lyrics are covered; lyrics that are already annotated cannot be annotated by user .

For simplicity, we only consider two types of users: high-IQ () and low-IQ (). For subsequent measurements, we consider high- and low-IQ users as those in the top third or bottom third in IQ over all users with at least 10 annotations.

Suppose that some user has not yet annotated a song. Then is the annotation coverage and works as a proxy for the proportional time rank in our infinitesimal setting. Assuming the likelihood that user makes an annotation is proportional to their utility , we would see that a user would make annotations at proportional time ranks corresponding to points at which their utility is high.

Figure 7. Distribution of proportional annotation time ranks for high-IQ (left) and low-IQ (right) users. The red bars are histogram bins where the height is the probability density. The black curves are our utility models that are fitted to the histogram bin midpoints (shown as yellow points).

We measure this empirically by simply considering the distribution of proportional time ranks of annotations made by a high- or low-IQ user (Fig. 7). Next, we fit the model in Eq. 2 to these approximate utility curves. To this end, we make some assumptions on and . When there are no annotations, there are no network or congestion effects, so . Both of these functions are nonnegative and nondecreasing. We assume that is concave, which could be due to diminishing returns (Chen et al., 2019). We also assume that is concave for the same reasons. One extremely simple class of concave functions are quadratics, so we fit and as quadratic functions, which under these assumptions must satisfy:

(3)
(4)

To determine these coefficients, we take the histogram midpoints and histogram heights from the proportional time rank distribution (Fig. 7) and solve the linear least squares problem

(5)

subject to the constraints (3), (4), and . We solve this problem for for the two sets of users. Table 1 shows the fitted coefficients and Fig. 7 shows the resulting curves, which match the empirical distribution.

high-IQ (h) 1.25 0.003 2.02 1.84 3.74
low-IQ (l) 1.06 0.79 1.83 0.04 1.44
Table 1. Fitted parameters of utility functions.

The coefficients in Table 1 match the IQ diamond model. First, , which is sensible as the a priori utility that high-IQ users receive for annotating should be higher. Indeed, high-IQ users may derive extra benefits due to their status in Genius’s social network or increased attention on other social media accounts linked to their Genius profile. Also, while . Since and , these inequalities imply that when a song only has a few annotations, high-IQ users are more influenced by the loss in utility from congestion while low-IQ users are more positively influenced by network effects.

Since , network effects approximately scale linearly for high-IQ users while they are most significant when there are few annotations for low-IQ users. This could arise if network effects for high-IQ users are due mostly to IQ gains from additional views and activity on a song. Recall that users earn IQ per upvote on their annotation, so the linearity of network effects may be due to the approximately linear positive relationship between song views and upvotes (). On the other hand, network effects for low-IQ users might come from social factors. The larger marginal network effects felt for early annotations may be due to desire of low-IQ users to achieve a baseline social validation that the song is worth annotating.

For congestion effects, . Thus, congestion approximately scales linearly for low-IQ users while it is most significant when there are few annotations for high-IQ users. One could decompose the congestion effect by , where is the expected utility lost due to lyrics that user is qualified to annotate having already been annotated, and is the utility lost due to other lyrics already having annotations. If is proportional to the amount of general knowledge that user has to annotate lyrics, we can assume it is linear. Then by concavity of , is also concave. With this decomposition, one sufficient condition for the observed inequalities and is that high-IQ users have both more specific knowledge and general knowledge about lyrics than low-IQ users. The low value of suggests that low-IQ users have little specific knowledge.

4.3. Dynamics of Edits on an Annotation

The temporal dynamics of edits are quite different from that of annotations. Around 95% of annotations have 10 edits or fewer, so we directly study time ranks instead of proportional time ranks. We find that users often edit their own annotations, and users often make several consecutive edits in a row. Removing these edits gives qualitatively similar results, so we present results over all edits. Recall that in this section we consider the first edit to be the creation of an annotation (the activity studied in Section 4.1).

Figure 8. User and content statistics as a function of edit time rank. Each line corresponds to a different total number of edits for an annotation. Thus, each point along a line at time rank is the mean user statistic for users making the th edit, where the computation is considered over annotations with a fixed number of total edits. For user experience, mean user IQ (top left) and number of annotations made by a user (top right) both increase with edit time rank. For quality, the mean number of quality tags (bottom left) and mean length (bottom right) increase with edit time rank.

We first find that the experience of a user — as measured by both mean IQ and mean total number of annotations — increases as edit time rank increases, regardless of the number of edits on an annotation (Fig. 8, top row). This positive correlation could be explained by hesitance of users with less skill to make edits on the content of a user with more experience. The plots in the top row of Fig. 8 have the opposite trend of similar plots for user reputation on Stack Overflow (Anderson et al., 2012). Annotations also have higher quality with more edits, as measured by both mean number of quality tags and length (Fig. 8, bottom row); such a relationship is reasonable as edits are meant to improve content.

We see nuance in these positive relationships when comparing annotations with different numbers of total edits at a fixed time rank (i.e. points at a fixed x-axis value in Fig. 8). For a fixed edit time rank, the user making the edit tends to have higher IQ if there are fewer total edits of the annotation. However, the edits on annotations with fewer total edits have fewer quality tags and are shorter. We would expect this behavior in cases when an annotation initially has more quality tags and longer length, and the complexity may require more edits for the annotation to reach its final state.

5. Evolution of User Behavior

Having considered temporal dynamics of annotations and edits with respect to arrivals on a single song or annotation, we now analyze how user behavior changes over a user’s lifespan on Genius. By studying behavior for users of different IQ, we can better understand how the early and current behavior of an expert can be distinguished from that of other users. We use these ideas in Section 6 for early expert prediction. Similar analysis of user behavior evolution on Stack Overflow has proven useful for identifying experts based on early behavior (Movshovitz-Attias et al., 2013).

Figure 9. User behavior over time, stratified by high-IQ users (at least 100,000 IQ, blue), mid-IQ users (between 10,000 and 50,000 IQ, red), and all users (green). (Top row) High-IQ users make more first annotations on songs or first edits on annotations, especially early in their lifespan. (Middle row) High-IQ users make use of quality tags on annotations, and mid- or high-IQ users tend to have longer annotations. (Bottom row) High-IQ users annotate more original lyrics and songs, even early in their lifespan.

To analyze how user behavior changes over time, we measure cumulative averages of properties of the annotations and edits that users make over their lifespans. We also stratify users into three levels based on the IQ that they have accumulated. These levels consist of users with at least 100,000 IQ (high-IQ users), users with between 10,000 and 50,000 IQ (mid-IQ), and a group with all users. We only include users with at least 10 annotations, and for each user consider activity over the first 750 days after their first annotation (in the case of annotation lifespan) or 750 days after their first edit (in the case of edit lifespan).

In agreement with Figure 5 and the IQ diamond model in general, users with the highest IQ tend to make relatively more first annotations on songs (Fig. 9, top left). High-IQ users also make relatively more first edits on annotations early in their lifespan, although there is little difference with the general user population later in the lifespan (Fig. 9, top right). Thus, the positive relationship between IQ and edit time rank (Fig. 8, top left) is not likely caused by high-IQ users making relatively fewer early edits, but by later edits more likely coming from high-IQ users.

Quality of annotations in terms of quality tags and annotation length generally increases over a user’s lifespan (Fig. 9, middle row). High-IQ users use relatively more quality tags, especially early on; however, annotation length from high-IQ users is comparable to those of mid-IQ users. Again, annotation length is nuanced. Annotation length has a mostly positive relationship with proportional time rank (Fig. 5, bottom right), and the discrepancy may come from high-IQ users that make more first-annotations on songs (Fig. 9, top left). As hypothesized earlier, there may be time pressure to annotate faster, driving down annotation length. Still, mid- and high-IQ users create substantially longer annotations than the rest of the users on the site, so there is some notion of quality or expertise that is marked by annotation length. Finally, high-IQ users have a tendency to annotate relatively more original lyrics and songs (Fig. 9, bottom row).

As such information is informative for early detection of expertise, we reiterate the findings that certain properties of the early contributions of high-IQ users on the site distinguish them from other users. From the start, high-IQ users make a higher proportion of first annotations and edits, make higher quality annotations as measured by number of quality tags, and annotate more original lyrics and songs. These traits are likely beneficial to the Genius platform, and they appear to be inherent traits of high-IQ users upon entry to the site. In the next section, we use these traits to develop models for early prediction of the highest IQ users.

6. Early Prediction of Super Experts

mean # of quality tags in first 15 annotations
mean time between first 15 annotations
# of first 15 annotations that are a song’s first
mean originality on songs for first 15 annotations
mean time between first 15 edits
# of first 15 edits that are an annotation’s first
Table 3.

Bootstrapped mean coefficients and confidence intervals of a logit model using the predictors in Table 

4 for the outcome variable of super-expert vs. normal expert.
Predictor mean regression coeff. 95% CI
0.8177 (0.590, 1.072)
0.7942 (0.548, 1.074)
0.5166 (0.349, 0.700)
0.2623 (0.098, 0.435)
-0.3832 (-0.586, -0.193)
0.2281 (0.054, 0.409)
intercept 0.1127 (-0.053, 0.283)
Table 4.

Classification results for various feature subsets. Listed results are the mean and standard deviation over 1,000 random 75%/25% training/test splits. Guessing the most common test label has mean accuracy of 0.522.

Predictors Accuracy AUC
, , , , ,
, , , ,
, , ,
, ,
,
Table 2.

Predictors for classifying super experts and normal experts, derived from the analysis in Sections 

4 and 5.

Given the user behaviors discussed in earlier sections, we now turn to early prediction of expertise based on user activity. To do this, we continue to use IQ as a proxy for expertise, as it is one simple metric that measures contributions of all types on the site, and we set up a classification problem that uses features derived from the first few annotations and edits of users.

First, we collected all users with at least 30 annotations and at least 30 edits. Of these users, we label the users in the highest third of IQ as “super experts,” as these users are above the 99.8 quantile in IQ over all users on Genius. We label the lowest third of IQ as “normal experts,” as these users still appear above 93.7 quantile of IQ (having at least 30 annotations and edits leads to some accumulation of IQ). In total, we have 784 labeled users. On average, the labeled users have made 109 annotations and 537 edits.

We use this coarse prediction framework for several reasons. First, IQ is only a rough measurement of expertise. Second, our IQ data was scraped at different times, and we do not have data on the evolution of IQ over time for users. Still, since the distribution of contributions to the site is heavy-tailed (Fig. 2), the most active users contribute a large amount of content to the site, and we expect that our IQ splits are still meaningful.

Next, we analyze a logistic model for predicting super vs. normal experts from several content-based and edit-based predictors derived from our observations of user behavior in the prior sections, such as the fact that experts use quality tags, work on more original songs, and often make first edits (Table 4

). These predictors are computed over the first 15 annotations and first 15 edits of the labeled users, and were normalized to have zero mean and unit variance. We use a bootstrapped logistic model with 10,000 samples to estimate the mean and 95% confidence intervals of the regression coefficients (Table 

4).

The positive coefficients on , , and agree with our findings in Section 5 that the number of quality tags, proportion of first annotations, originality of annotated songs, and proportion of first edits, respectively, are higher early in the lifespan of high-IQ users. These results also substantiate our IQ diamond model, which agrees with the importance of first annotations in detecting expertise. The fact that (time between annotations) has a positive coefficient while (time between edits) has a negative coefficient is surprising. Users with more time between their early annotations may face less time-pressure and may make higher quality annotations. On the other hand, users with less time between their edits may simply make more edits. If a user is making edits early in their lifespan, then they may have an eye for good contributions, which presumably provides a strong signal of expertise.

Finally, we evaluate these predictors in terms of test predictions. To do so, we randomly split the data into 75% for training and 25% for test and averaged the test accuracy and AUC over 1,000 splits (Table 4). Over the splits, this classifier attains a mean accuracy of 0.673 and mean AUC of 0.748, whereas a majority-class baseline guess yields mean accuracy of 0.522. This substantial performance gain is remarkable given that we only use limited information from the first 15 annotations and edits.

7. Discussion

Crowdsourced information platforms provide a variety of information for the world. Here, we have analyzed various aspects of the temporal dynamics of content and users on the Genius platform that collects and manages crowdsourced musical knowledge. Genius has a substantially different knowledge curation process compared to Question-and-Answer platforms such as Stack Overflow or formal authoritative sources like Wikipedia. In turn, we found that the content and user dynamics have markedly different behavior from these other well-studied platforms.

More specifically, our IQ diamond model of behavior stands in contrast to properties of the dynamics of user contribution on Stack Overflow, which has a similar reputation system. Our findings may be useful in mechanism design for platforms governed by dynamics similar to Genius. To encourage annotation coverage of a certain song, one could incentivize experts or eventual experts to make just a few annotations. Under the IQ diamond model, this will open the song to the bulk of users to create more annotations. If higher quality annotations are desired, one may incentivize experts to edit intermediate annotations appearing at the bottom of the “U-shaped” curve (middle of the diamond). Furthermore, some users may desire that a song be annotated but do not have the skills themselves. In this case, if they annotate some less original or “easy” lyrics, the induced network effects may incentivize experienced users to create high quality annotations.

Figure 10. There is a positive relationship between user PageRank scores in the Genius social network and the IQ of the user (left) or number of annotations made by the user (right). Each blue point represents a user, and red X’s are binned means. Many users have exactly 100 IQ, which is the amount awarded for adding a profile picture.

Studying user behavior over time stratified by eventual expertise revealed several traits of eventual experts that can be identified by early behavior. We used these findings to develop strong predictors of eventual experts. The set of predictions could be enhanced by social features, as Genius has an underlying social network for its users. We collected the social network data of 782,432 users and 1,777,351 directed “following” relationships amongst them, but we do not have temporal information on the social network that would be useful for early prediction. Still, we found that certain social features are strongly correlated with user expertise. For instance, Figure 10 shows that users with large PageRank scores have higher IQ and have made more annotations. Just using PageRank as a predictor for our experiments in Section 6 achieves a mean AUC of 0.972, and even just using the in-degree achieves a mean AUC of 0.977. While determining current expertise is often a useful task for crowdsourced-information sites, it is not a difficult task when we define expertise solely based on a public metric such as IQ.

Future Work.  There are many avenues for future work based on this Genius data or the models we have developed. For one, we could design experiments based on the actionable mechanism designs described above. As another example, one could use the Genius data to augment other music datasets (Brost et al., 2019; Bertin-Mahieux et al., 2011). Conversely, other music data may improve the analysis of the content on Genius. One may argue that Genius is somewhat limited by its focus on lyrics since the musical context of the lyrics is deeply important for both the analysis and experience of songs (Kehrer, 2016).

Moreover, there is plenty of data on Genius that we did not collect as well as collected data that we have not analyzed in detail. In particular, verified annotations (annotations written by artists) and forms of user contribution besides annotations and edits were crawled from the site but not studied. The rich linguistic information in the lyrics and annotations of this dataset can also be analyzed in more depth than we have done here; for instance, we do not consider the sequential and hierarchical structures in lyrics that have been used in music information retrieval (Tsaptsinos, 2017). These structures add yet another layer of depth to the organization of contributions and content on Genius that distinguishes it from other crowdsourced information sites.

Acknowledgements.
This research was supported in part by NSF Award DMS-1830274, ARO Award W911NF19-1-0057, ARO MURI, and JPMorgan Chase & Co.

References

  • (1)
  • Adamic et al. (2008) Lada A. Adamic, Jun Zhang, Eytan Bakshy, and Mark S. Ackerman. 2008. Knowledge Sharing and Yahoo Answers: Everyone Knows Something. In WWW 08’ (Beijing, China). ACM, New York, NY, USA, 665–674. https://doi.org/10.1145/1367497.1367587
  • Al Qundus (2018) Jamal Al Qundus. 2018. Technical analysis of the social media platform genius. Technical Report. Freie Universität Berlin.
  • Al Qundus and Paschke (2018) Jamal Al Qundus and Adrian Paschke. 2018. Investigating the Effect of Attributes on User Trust in Social Media. In Database and Expert Systems Applications. Springer International Publishing, 278–288.
  • Almeida et al. (2007) Rodrigo B Almeida, Barzan Mozafari, and Junghoo Cho. 2007. On the Evolution of Wikipedia. In ICWSM.
  • Anderson et al. (2012) Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2012. Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow. In SIGKDD (Beijing, China). 9. https://doi.org/10.1145/2339530.2339665
  • Bertin-Mahieux et al. (2011) Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. 2011. The Million Song Dataset. In ISMIR 11’.
  • Beschastnikh et al. (2008) Ivan Beschastnikh, Travis Kriplean, and David W McDonald. 2008. Wikipedian Self-Governance in Action: Motivating the Policy Lens.. In ICWSM.
  • Blumenstock (2008) Joshua E. Blumenstock. 2008. Size Matters: Word Count as a Measure of Quality on Wikipedia. In WWW ’08 (Beijing, China). ACM, 1095–1096. https://doi.org/10.1145/1367497.1367673
  • Brost et al. (2019) Brian Brost, Rishabh Mehrotra, and Tristan Jehan. 2019. The Music Streaming Sessions Dataset. In WWW ’19 (San Francisco, CA, USA). ACM, 2594–2600. https://doi.org/10.1145/3308558.3313641
  • Brown (2011) Adam R. Brown. 2011. Wikipedia as a Data Source for Political Scientists: Accuracy and Completeness of Coverage. PS: Political Science & Politics 44, 2 (2011), 339–343. https://doi.org/10.1017/S1049096511000199
  • Calefato et al. (2015) Fabio Calefato, Filippo Lanubile, Maria Concetta Marasciulo, and Nicole Novielli. 2015. Mining Successful Answers in Stack Overflow. In MSR ’15 (Florence, Italy). 4.
  • Chen et al. (2019) Yanjiao Chen, Xu Wang, Baochun Li, and Qian Zhang. 2019. An Incentive Mechanism for Crowdsourcing Systems with Network Effects. ACM Trans. Internet Technol. 19, 4, Article 49 (Sept. 2019), 21 pages. https://doi.org/10.1145/3347514
  • Danescu-Niculescu-Mizil et al. (2013) Cristian Danescu-Niculescu-Mizil, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No Country for Old Members: User Lifecycle and Linguistic Change in Online Communities. In WWW ’13 (Rio de Janeiro, Brazil). 307–318. https://doi.org/10.1145/2488388.2488416
  • Dawson (2018) Ted M. Dawson. 2018. Reinventing Genius in the .com Age: Austrian Rap Music and a New Way of Knowing. Journal of Austrian Studies (2018).
  • Ellis et al. (2015) Robert J. Ellis, Zhe Xing, Jiakun Fang, and Ye Wang. 2015. Quantifying Lexical Novelty in Song Lyrics. In ISMIR ’15.
  • Giles (2005) Jim Giles. 2005. Internet encyclopaedias go head to head. Nature 438, 7070 (Dec. 2005), 900–901. https://doi.org/10.1038/438900a
  • Gkotsis et al. (2014) George Gkotsis, Karen Stepanyan, Carlos Pedrinaci, John Domingue, and Maria Liakata. 2014. It’s All in the Content: State of the Art Best Answer Prediction Based on Discretisation of Shallow Linguistic Features. In WebSci ’14 (Bloomington, Indiana, USA). ACM, New York, NY, USA, 202–210. https://doi.org/10.1145/2615569.2615681
  • Johari and Kumar (2009) Ramesh Johari and Sunil Kumar. 2009. Congestible Services and Network Effects.
  • Jurgens and Lu (2012) David Jurgens and Tsai-Ching Lu. 2012. Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia. In ICWSM.
  • Kehrer (2016) Lauron Kehrer. 2016. Genius (formerly Rap Genius). Genius Media Group, Inc. genius.com. JSAM 10, 4 (2016), 518–520. https://doi.org/10.1017/S1752196316000444
  • Maity et al. (2015) Suman Kalyan Maity, Jot Sarup Singh Sahni, and Animesh Mukherjee. 2015. Analysis and prediction of question topic popularity in community Q&A sites: a case study of Quora. In ICWSM.
  • McAuley and Leskovec (2013) Julian John McAuley and Jure Leskovec. 2013. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In WWW ’13. 897–908.
  • Mesgari et al. (2015) Mostafa Mesgari, Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and Arto Lanamäki. 2015. “The sum of all human knowledge”: A systematic review of scholarly research on the content of Wikipedia. JASIST 66, 2 (Dec. 2015), 219–245. https://doi.org/10.1002/asi.23172
  • Movshovitz-Attias et al. (2013) Dana Movshovitz-Attias, Yair Movshovitz-Attias, Peter Steenkiste, and Christos Faloutsos. 2013. Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow. In ASONAM ’13 (Niagara, Ontario, Canada). ACM, New York, NY, USA, 886–893. https://doi.org/10.1145/2492517.2500242
  • Pal et al. (2011) Aditya Pal, Rosta Farzan, Joseph A. Konstan, and Robert E. Kraut. 2011. Early Detection of Potential Experts in Question Answering Communities. In UMAP 11’. Springer, Berlin, Heidelberg, 231–242.
  • Pal et al. (2012) Aditya Pal, F. Maxwell Harper, and Joseph A. Konstan. 2012. Exploring Question Selection Bias to Identify Experts and Potential Experts in Community Question Answering. ACM Trans. Inf. Syst. 30, 2, Article 10 (May 2012), 28 pages. https://doi.org/10.1145/2180868.2180872
  • Paranjape et al. (2017) Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. 2017. Motifs in Temporal Networks. In WSDM ’17 (Cambridge, United Kingdom). ACM, 601–610. https://doi.org/10.1145/3018661.3018731
  • Parnin et al. (2012) Chris Parnin, Christoph Treude, Lars Grammel, and Margaret-Anne Storey. 2012. Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow. Technical Report. Georgia Institute of Technology.
  • Patil and Lee (2015) Sumanth Patil and Kyumin Lee. 2015. Detecting experts on Quora: by their activity, quality of answers, linguistic characteristics and temporal behaviors. Social Network Analysis and Mining 6, 1 (Dec. 2015). https://doi.org/10.1007/s13278-015-0313-x
  • Posnett et al. (2012) Daryl Posnett, Eric Warburg, Premkumar Devanbu, and Vladimir Filkov. 2012. Mining stack exchange: Expertise is evident from initial contributions. In SocInfo ’12. IEEE, 199–204.
  • Rambsy (2018) Howard Rambsy. 2018. Becoming A Rap Genius. In The Routledge Companion to Media Studies and Digital Humanities. Routledge. https://doi.org/10.4324/9781315730479-36
  • Ravi et al. (2014) Sujith Ravi, Bo Pang, Vibhor Rastogi, and Ravi Kumar. 2014. Great question! question quality in community q&a. In ICWSM.
  • Samoilenko and Yasseri (2014) Anna Samoilenko and Taha Yasseri. 2014. The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia coverage of academics.

    EPJ Data Science

    3, 1 (Jan. 2014).
    https://doi.org/10.1140/epjds20
  • Stvilia et al. (2005) Besiki Stvilia, Michael B. Twidale, Linda C. Smith, and Les Gasser. 2005. Assessing information quality of a community-based encyclopedia. In In ICIQ ’15. 442–454.
  • Suh et al. (2010) B. Suh, L. Hong, P. Pirolli, and E. H. Chi. 2010. Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network. In SocialCom ’17. 177–184.
  • Tian et al. (2013) Qiongjie Tian, Peng Zhang, and Baoxin Li. 2013. Towards predicting the best answers in community-based question-answering services. In ICWSM.
  • Tsaptsinos (2017) Alexandros Tsaptsinos. 2017. Lyrics-based music genre classification using a hierarchical attention network. In ISMIR ’17.
  • van Dijk et al. (2015) David van Dijk, Manos Tsagkias, and Maarten de Rijke. 2015. Early detection of topical expertise in community question answering. In SIGIR.
  • Wang et al. (2013) Gang Wang, Konark Gill, Manish Mohanlal, Haitao Zheng, and Ben Y. Zhao. 2013. Wisdom in the Social Crowd: An Analysis of Quora. In WWW.
  • Zhang et al. (2007) Jun Zhang, Mark S. Ackerman, and Lada Adamic. 2007. Expertise Networks in Online Communities: Structure and Algorithms. In WWW.