Cornell Reddit Data v1.0 (released March 2015)

Distributed together with:

All Who Wander: On the Prevalence and Characteristics of Multi-community Engagement
Chenhao Tan, Lillian Lee
In Proceedings of WWW, 2015

The paper, data, and associated materials can be found at:
http://chenhaot.com/pages/multi-community.html

If you use this data, please cite:
@inproceedings{tan+lee:15,
  author = {Chenhao Tan and Lillian Lee},
  title = {All Who Wander: On the Prevalence and Characteristics of Multi-community Engagement},
  year = {2015},
  booktitle = {Proceedings of WWW}
}


Files description:

The following files contain the data described in much more detail in Section 3 of the paper.

* reddit_all.gz (included in https://chenhaot.com/data/reddit_full_data.gz, 24G)
  Each line contains a json object returned from the API. Relevant fields include:
    subreddit: the subreddit where the post was made at
    author: the author of the post
    created_utc: the timestamp when the post was created
    title: the title of the post
    selftext: the main text of the post if it is a text post
    ups: the "noisy" number of upvotes received by the text post
    downs: the "noisy" number of downvotes received by the text post
  Note that the actual number of upvotes or downvotes is purposely inacces- sible: http://www.reddit.com/r/announcements/comments/28hjga/reddit_changes_individual_updown_vote_counts_no/.

* bots.txt
  A list of bots on http://www.reddit.com/r/autowikibot/wiki/redditbots when the data was collected.
  Each line is a username.

* spammers.txt
  A list of spammers determined by checking whether the user page has been deleted according to the Reddit API for users who posted more than 500 posts.
  Each line is a user name.

* reddit_meta.gz (included in https://chenhaot.com/data/reddit_meta_data.gz, 1.4G)
  Because the size of the entire data is large (~24G), we provide a smaller dataset which only contains the meta data of posts.
  In this file, each line is comma separated. The fields are author, subreddit, created_utc, number of comments, ups and downs.

* README (this file)


Data collection details:
Relying primarily on RedditAnalytics, in February 2014 we collected all posts ever submitted to Reddit since its inception, together with their associated feedback values.
We found that all the posts in November, 2011 were missing in results from RedditAnalytics.
After confirming with Jason Baumgartner, we use the Reddit api's search function to fill in this period (http://www.reddit.com/r/redditdev/comments/1hpicu/whats_this_syntaxcloudsearch_do/).
The number of posts reported in the paper (76.6M) excluded posts from bots, spammers and "[deleted]".

Please email any questions to: chenhao@chenhaot.com