The Stanford MOOCPosts Data Set

Created by nine ODesk Consultants
under direction of
Akshay Agrawal and Andreas Paepcke

The Stanford MOOCPosts dataset contains 29,604 anonymized learner forum posts from eleven Stanford University public online classes. The purpose of this collection is to serve as a foundation for testing computational algorithms that process forum posts. You are free to use the corpus for academic studies. Please credit the corpus in any publications for which it proved useful.

Table of Contents

The Corpus

Each post was manually coded along six dimensions by three independent coders: In addition, coders were asked to flag posts in which our automated anonymization program had failed, and that therefore revealed any poster's identity. These posts were then manually anonymized. Intercoder reliability was computed, and separate 'gold' codings were derived for each of the dimensions. This procedure is described below.

The posts are organized into three sets of related courses:

Humanities/Sciences:

Course Name Number of Entries
GlobalHealth/WomensHealth/Winter2014 2,254 entries
HumanitiesScience/StatLearning/Winter2014 3,112 entries
HumanitiesScience/Stats216/Winter2014 341 entries
HumanitiesSciences/EP101/Environmental_Physiology 2,549 entries
HumanitiesSciences/Econ-1/Summer2014 1,584 entries
HumanitiesSciences/Econ1V/Summer2014 160 entries
Total: 10,000 entries

Medicine:

Course Name Number of Entries
Medicine/HRP258/Statistics_in_Medicine 3,321 entries
Medicine/MedStats/Summer2014 1,218 entries
Medicine/SciWrite/Fall2013 5,184 entries
Medicine/SURG210/Managing_Emergencies_What_Every_Doctor_Must_Know 279 entries
Total: 10,002 entries

Education:

Course Name Number of Entries
Education/EDUC115N/How_to_Learn_Math 10,000 entries

The number of entries lists the posts that were presented to the coders. The final set is somewhat smaller, because missing data was removed from the coded results.

Corpus Schema

ColumnName ValueDescription
Text Text of one post
Opinion(1/0 binary: post contains an opinion
Question(1/0) binary: post contains a question
Answer(1/0) binary: post contains an answer
Sentiment(1-7) Learner sentiment expressed in post: 1=negative; 7=positive; 4=neutral
Confusion(1-7) Learner degree of confusion expressed in post: 1=not confused; 7=very confused
Urgency(1-7) How urgent is it that instructor reads the post: 1=not urgent, 7=very urgent
CourseType One of Education, Humanities, and Medicine
forum_post_id Unique ID of the respective row's post in its original OpenEdX context
course_display_name Name of course in context of Stanford's online, free, public offerings
forum_uid Unique identifier of learner who posted the post
created_at Post date
post_type One of Comment or CommentThread; the latter is assigned to posts that originated a thread, while the former is assigned to all other posts
anonymous If True, poster appears to everyone as name `anonymous'
anonymous_to_peers If True, poster appears under his/her own screen name to discussion moderator and instructor, but as 'anonymous' to everyone else.
up_count Number of post's up-votes
comment_thread_id ID of thread object
reads The total number of reads logged for the thread with ID comment_thread_id.

Corpus Demographics

The following table shows self-reported gender and education distribution for all but one of the courses. Course HumanitiesSciences/Econ1V/Summer2014 was small, and its demographics are folded into course HumanitiesSciences/Econ1/Summer2014. For convenience, an Excel version is available.

Column header GenderO stands for Other. Column EduNotCollected shows the number of learners who enrolled before the level of education was collected by the OpenEdX software.

GenderF GenderM GenderO GenderWithheld EduElementary EduJunHighSchool EduHighSchool EduAssocDeg EduBachelors EduMasters EduPhD EduOther EduNone EduWithheld EduNotCollected
GlobalHealth/WomensHealth/Winter2014 504964722190556689 2452278205027311261877
HumanitiesScience/StatLearning/Winter2014 107933712492555930130 252260315367218277824320 614703181
HumanitiesScience/Stats216/Winter2014 92502580060 16930012246
HumanitiesSciences/EP101/Environmental_Physiology 58316416311009413591768 5234329419889423224894 25
HumanitiesSciences/Econ-1[+V]/Summer2014 632114774531803504513392 681806571221291241441536 78
Medicine/HRP258/Statistics_in_Medicine 992515215652472292392036 5207306101934810353412098 52
Medicine/MedStats/Summer2014 64659132281246171471379 251449763992859207221030 63
Medicine/Sci Write/Fall2013 2100424414753402452983897 70112435196308463429462909 42
Medicine/SURG210/Managing_Emergencies_What... 438206710620 331513001058
Education/EDUC115N/How_to_Learn_Math 24565145637725711363182930 116313791185832279447832041 5

Procedures

Creating the final set involved several steps. First, the randomly chosen posts were anonymized via an automated facility. Then the 30,000 posts were distributed to ODesk consultants for coding. The returned sets were then combined into one gold set. These steps are next described in more detail.

Anonymization

The original forum posts were on the Web, visible to tens of thousands of learners who had enrolled in the respective course. Anyone, anywhere was free to enroll without cost, and thereby enjoyed access to the material. We nevertheless removed personally identifiable information from this research set as best we could. Here a description of our procedure.

Automated anonymization replaced identity-revealing words with a single word that indicates the type of revealing information being obscured. Example: when the anonymization algorithm detects a string that might be a telephone number, the string is replaced with <phoneRedac>. Some such fillers contain underscores, all are delimited by angle brackets. Except for rare cases of a full, two-part name being redacted, word count of the posts is thus the same as the original, as long as the tokenizer underneath the word counter considers the terms in angle brackets as one token. Strings that are suspected to be zip codes are replaced by <zipRedac>, email addresses are replaced by <emailRedac>. For uninteresting reasons, posters’ names are replaced by the more complex <nameRedac_<anon_screen_name_redacted>>.

The coders were asked to note any posts that still contained personally identifying information. A staff member went through all those posts, and manually redacted as per the above automated procedure. We ask that you respect the posters' anonymity, and do not attempt to reconstruct redacted information.

Coding Process

Each course set was coded by three distinct, independent, paid coders. That is three triplets of coders each worked on one set of 10,000 posts. No coder worked on more than one course set. Each coder attempted to code every post for his or her particular set. All posts with malformed or missing scores in at least one coder’s spreadsheet were discarded. This elision accounts for the difference between the 29,604 posts in the final set, and the original 30,002 posts.

Generating the Gold Sets

We employed the following heuristics in computing scores for the gold sets: For each course set and Likert variable in that course set, we computed the Krippendorff alphas for all possible combinations of coders. In every case, we found that some combination of two coders achieved higher agreement than the combination of all three coders. For each variable we therefore picked the coder combination that achieved the highest agreement and discarded the scores from the coder not included in that combination. A given post’s gold score was computed as an unweighted average of the two scores assigned to it by the selected coder combination. You can download CSV files containing scores submitted by the optimal coder combination for each Likert variable, across all three course sets. Each row in these files correspond to one row in the gold set. The two columns in each file list the scores given by the two most similar coders. The gold set contains the average of those number pairs. For example, file education-pair-confusion.csv lists for each post the confusion scores given by the coders with the highest Krippendorff alpha around the confusion dimension. Note: you can use these files to compute your own inter-coder reliability score between the two coders.

For each course set's binary variable, post gold scores were chosen by majority vote across all three coders. Download CSV files with scores submitted by all three coders for every binary variable, across all course sets, and the consensus files. Note: you can use these files to compute your own inter-coder reliability score between the three coders with respect to each binary quantity.

The spreadsheet titled Krippendorff enumerates the Krippendorff alphas for the CSVs of Likert and binary variables. Note that the spreadsheet contains two worksheets. Alphas were computed using ReCal, a free online tool. Likert variables were treated as interval data, while binary variables were treated as nominal.

Coder Instructions

The coder instructions were posted on oDesk. We hired nine colleagues for the task. During the first week of coding some questions arose, which we answered immediately. The most important clarification was that it was possible for a post to have multiple binary dimensions be true. That is, a post could express both a question, and an opinion.

Appendix: Corpus Learner Origins

The following tables show course countries of learner origins for countries that contributed more than 100 learners to a respective course. Courses without a table below only hosted learners in which no country contributed more than 100 learners.
GlobalHealth/WomensHealth/Winter2014:
+----------------+------------------+
| country        | LearnerCountries |
+----------------+------------------+
| Australia      |              205 |
| Brazil         |              136 |
| Canada         |              310 |
| China          |              134 |
| France         |              145 |
| Germany        |              173 |
| India          |              254 |
| Reserved       |              107 |
| United Kingdom |              409 |
| United States  |             2973 |
+----------------+------------------+

HumanitiesScience/StatLearning/Winter2014: +---------------------------------+------------------+ | country | LearnerCountries | +---------------------------------+------------------+ | Argentina | 246 | | Australia | 1312 | | Austria | 632 | | Bangladesh | 103 | | Belarus | 103 | | Belgium | 326 | | Brazil | 1131 | | Bulgaria | 151 | | Canada | 1931 | | Chile | 195 | | China | 3794 | | Colombia | 380 | | Croatia (LOCAL Name: Hrvatska) | 138 | | Czech Republic | 324 | | Denmark | 411 | | Egypt | 407 | | European Union | 175 | | Finland | 307 | | France | 1464 | | Germany | 2327 | | Greece | 386 | | Hong Kong | 549 | | Hungary | 168 | | India | 5272 | | Indonesia | 210 | | Iran (ISLAMIC Republic Of) | 319 | | Ireland | 250 | | Israel | 362 | | Italy | 976 | | Japan | 887 | | Kenya | 138 | | Korea Republic of | 977 | | Lithuania | 225 | | Malaysia | 213 | | Mexico | 535 | | Netherlands | 910 | | New Zealand | 246 | | Nigeria | 116 | | Norway | 600 | | Pakistan | 314 | | Peru | 164 | | Philippines | 175 | | Poland | 824 | | Portugal | 308 | | Reserved | 1356 | | Romania | 316 | | Russian Federation | 1377 | | Saudi Arabia | 167 | | Serbia | 155 | | Singapore | 888 | | Slovakia (SLOVAK Republic) | 141 | | South Africa | 286 | | Spain | 1692 | | Sweden | 761 | | Switzerland | 593 | | Taiwan; Republic of China (ROC) | 447 | | Thailand | 144 | | Turkey | 273 | | Ukraine | 307 | | United Arab Emirates | 147 | | United Kingdom | 3140 | | United States | 23138 | | Venezuela | 128 | | Viet Nam | 261 | +---------------------------------+------------------+

HumanitiesSciences/EP101/Environmental_Physiology: +--------------------+------------------+ | country | LearnerCountries | +--------------------+------------------+ | Australia | 352 | | Austria | 136 | | Brazil | 315 | | Canada | 497 | | China | 433 | | Colombia | 136 | | Denmark | 116 | | Egypt | 165 | | France | 323 | | Germany | 472 | | Greece | 159 | | Hong Kong | 111 | | India | 858 | | Italy | 223 | | Japan | 165 | | Korea Republic of | 105 | | Mexico | 222 | | Netherlands | 180 | | Norway | 122 | | Pakistan | 127 | | Poland | 234 | | Reserved | 251 | | Romania | 207 | | Russian Federation | 446 | | Singapore | 144 | | Spain | 288 | | Sweden | 138 | | Switzerland | 104 | | Turkey | 103 | | Ukraine | 183 | | United Kingdom | 710 | | United States | 5413 | +--------------------+------------------+

HumanitiesSciences/Econ-1/Summer2014: +---------------------------------+------------------+ | country | LearnerCountries | +---------------------------------+------------------+ | Argentina | 124 | | Australia | 722 | | Austria | 213 | | Belgium | 139 | | Brazil | 799 | | Canada | 703 | | Chile | 134 | | China | 1625 | | Colombia | 251 | | Czech Republic | 107 | | Denmark | 156 | | Egypt | 920 | | France | 596 | | Germany | 756 | | Greece | 191 | | Hong Kong | 291 | | India | 2616 | | Indonesia | 146 | | Iran (ISLAMIC Republic Of) | 143 | | Ireland | 111 | | Israel | 130 | | Italy | 369 | | Japan | 429 | | Korea Republic of | 279 | | Malaysia | 177 | | Mexico | 385 | | Netherlands | 299 | | Nigeria | 140 | | Norway | 213 | | Pakistan | 250 | | Peru | 127 | | Philippines | 143 | | Poland | 277 | | Portugal | 114 | | Reserved | 667 | | Romania | 201 | | Russian Federation | 548 | | Saudi Arabia | 208 | | Singapore | 425 | | South Africa | 146 | | Spain | 454 | | Sweden | 253 | | Switzerland | 207 | | Taiwan; Republic of China (ROC) | 198 | | Turkey | 200 | | Ukraine | 187 | | United Arab Emirates | 166 | | United Kingdom | 1343 | | United States | 7737 | | Viet Nam | 195 | +---------------------------------+------------------+ 50 rows in set (2.78 sec)

Medicine/HRP258/Statistics_in_Medicine: +---------------------------------+------------------+ | country | LearnerCountries | +---------------------------------+------------------+ | Argentina | 118 | | Australia | 805 | | Austria | 257 | | Belgium | 146 | | Brazil | 485 | | Canada | 925 | | Chile | 108 | | China | 917 | | Colombia | 225 | | Croatia (LOCAL Name: Hrvatska) | 104 | | Czech Republic | 109 | | Denmark | 226 | | Egypt | 695 | | Finland | 119 | | France | 649 | | Germany | 936 | | Greece | 255 | | Hong Kong | 182 | | India | 2127 | | Iran (ISLAMIC Republic Of) | 241 | | Ireland | 141 | | Israel | 129 | | Italy | 504 | | Japan | 342 | | Kenya | 316 | | Korea Republic of | 217 | | Lithuania | 133 | | Malaysia | 142 | | Mexico | 316 | | Morocco | 104 | | Netherlands | 407 | | New Zealand | 103 | | Nigeria | 155 | | Norway | 347 | | Pakistan | 412 | | Poland | 376 | | Portugal | 226 | | Reserved | 419 | | Romania | 231 | | Russian Federation | 465 | | Saudi Arabia | 262 | | Singapore | 246 | | South Africa | 161 | | Spain | 611 | | Sweden | 386 | | Switzerland | 244 | | Taiwan; Republic of China (ROC) | 112 | | Turkey | 105 | | Ukraine | 102 | | United Arab Emirates | 106 | | United Kingdom | 1627 | | United States | 10403 | | Viet Nam | 195 | +---------------------------------+------------------+

Medicine/MedStats/Summer2014: +---------------------------------+------------------+ | country | LearnerCountries | +---------------------------------+------------------+ | Australia | 550 | | Austria | 191 | | Belgium | 121 | | Brazil | 449 | | Canada | 637 | | China | 613 | | Colombia | 177 | | Czech Republic | 117 | | Denmark | 246 | | Egypt | 1064 | | France | 489 | | Germany | 642 | | Greece | 160 | | Hong Kong | 137 | | India | 1405 | | Iran (ISLAMIC Republic Of) | 192 | | Italy | 395 | | Japan | 281 | | Kenya | 164 | | Korea Republic of | 229 | | Malaysia | 112 | | Mexico | 289 | | Netherlands | 354 | | Nigeria | 171 | | Norway | 229 | | Pakistan | 272 | | Poland | 330 | | Portugal | 147 | | Reserved | 562 | | Romania | 228 | | Russian Federation | 373 | | Saudi Arabia | 236 | | Singapore | 161 | | South Africa | 138 | | Spain | 532 | | Sweden | 283 | | Switzerland | 180 | | Taiwan; Republic of China (ROC) | 115 | | Turkey | 107 | | United Arab Emirates | 109 | | United Kingdom | 1242 | | United States | 5954 | | Viet Nam | 108 | +---------------------------------+------------------+ 43 rows in set (2.37 sec)

Medicine/wSURG210/Managing_Emergencies_What_Every_Doctor_Must_Know: +---------------------------------+------------------+ | country | LearnerCountries | +---------------------------------+------------------+ | Argentina | 392 | | Australia | 1284 | | Austria | 676 | | Belgium | 318 | | Brazil | 2119 | | Bulgaria | 124 | | Canada | 1489 | | Chile | 331 | | China | 2921 | | Colombia | 1030 | | Croatia (LOCAL Name: Hrvatska) | 154 | | Czech Republic | 256 | | Denmark | 392 | | Ecuador | 121 | | Egypt | 886 | | European Union | 109 | | Finland | 293 | | France | 1382 | | Germany | 2502 | | Greece | 469 | | Hong Kong | 444 | | Hungary | 153 | | India | 3383 | | Indonesia | 246 | | Iran (ISLAMIC Republic Of) | 750 | | Ireland | 233 | | Israel | 245 | | Italy | 1139 | | Japan | 742 | | Kenya | 201 | | Korea Republic of | 585 | | Lithuania | 225 | | Malaysia | 301 | | Mexico | 766 | | Morocco | 139 | | Netherlands | 917 | | New Zealand | 269 | | Nigeria | 201 | | Norway | 708 | | Pakistan | 615 | | Peru | 230 | | Philippines | 248 | | Poland | 1667 | | Portugal | 593 | | Reserved | 626 | | Romania | 428 | | Russian Federation | 1513 | | Saudi Arabia | 398 | | Serbia | 206 | | Singapore | 408 | | Slovakia (SLOVAK Republic) | 163 | | South Africa | 292 | | Spain | 2329 | | Sri Lanka | 112 | | Sweden | 722 | | Switzerland | 488 | | Taiwan; Republic of China (ROC) | 311 | | Thailand | 194 | | Turkey | 351 | | Ukraine | 475 | | United Arab Emirates | 168 | | United Kingdom | 2583 | | United States | 12938 | | Venezuela | 106 | | Viet Nam | 381 | +---------------------------------+------------------+

Education/EDUC115N/How_to_Learn_Math: +----------------------+------------------+ | country | LearnerCountries | +----------------------+------------------+ | Australia | 1193 | | Austria | 196 | | Brazil | 418 | | Canada | 1765 | | China | 609 | | Colombia | 192 | | Denmark | 178 | | Egypt | 131 | | Finland | 106 | | France | 488 | | Germany | 711 | | Greece | 203 | | Hong Kong | 154 | | India | 1543 | | Indonesia | 121 | | Ireland | 122 | | Israel | 123 | | Italy | 336 | | Japan | 215 | | Korea Republic of | 132 | | Malaysia | 129 | | Mexico | 370 | | Netherlands | 170 | | New Zealand | 380 | | Norway | 181 | | Pakistan | 251 | | Philippines | 230 | | Poland | 204 | | Reserved | 326 | | Romania | 200 | | Russian Federation | 402 | | Singapore | 224 | | South Africa | 264 | | Spain | 371 | | Sweden | 346 | | Switzerland | 120 | | Turkey | 109 | | Ukraine | 115 | | United Arab Emirates | 126 | | United Kingdom | 1965 | | United States | 23099 | +----------------------+------------------+