Learning analytics research presents challenges for researchers embracing the principles of open science. Protecting student privacy is paramount, but progress in increasing scientific understanding and improving educational outcomes depends upon open, scalable and replicable research. Findings have repeatedly been shown to be contextually dependent on personal and demographic variables, so how can we use this data in a manner that is ethical and secure for all involved? This paper presents ongoing work on the MOOC Replication Framework (MORF), a big data repository and analysis environment for Massive Open Online Courses (MOOCs). We discuss MORF's approach to protecting student privacy, which allows researchers to use data without having direct access. Through an open API, documentation and tightly controlled outputs, this framework provides researchers with the opportunity to perform secure, scalable research and facilitates collaboration, replication, and novel research. We also highlight ways in which MORF represents a solution template to issues surrounding privacy and security in the age of big data in education and key challenges still to be tackled.
What is already known about this topic
- Personal Identifying Information (PII) has many valid and important research uses in education.
- The ability to replicate or build on analyses is important to modern educational research, and is usually enabled through sharing data.
- Data sharing generally does not involve PII in order to protect student privacy.
- MOOCs present a rich data source for education researchers to better understand online learning.
What this paper adds
- The MOOC replication framework (MORF) 2.1 is a new infrastructure that enables researchers to conduct analyses on student data without having direct access to the data, thus protecting student privacy.
- Detail of the MORF 2.1 structure and workflow.
Implications for practice and/or policy
- MORF 2.1 is available for use by practitioners and research with policy implications.
- The infrastructure and approach in MORF could be applied to other types of educational data.
CONFLICT OF INTEREST
No conflict of interest (financial or non-financial) has been declared by the authors.
DATA AVAILABILITY STATEMENT
The data available in the MOOC Replication Framework can be used by external researchers through the framework under a data use agreement with the relevant university which oversees the data being accessed, and with evidence of approval of the research by an Institutional Review Board (or comparable ethical oversight organization, such as in uses of MORF for non-US data by non-US researchers).
- 2013). Knowledge component (KC) approaches to learner modeling. Design Recommendations for Intelligent Tutoring Systems, 1, 165–182.
- 2018). Towards adapting to learners at scale: Integrating MOOC and intelligent tutoring frameworks. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale (pp. 1–4). ACM. https://dl.acm.org/doi/abs/10.1145/3231644.3231671
- 2018). Studying MOOC completion at scale using the MOOC replication framework. In Proceedings of the 8th international conference on learning analytics and knowledge (pp. 71–78). ACM.
- 2017). Replicating 21 findings on student success in online learning. Technology, Instruction, Cognition, and Learning, 10(4), 313–333.
- Under Review). Exploring cross-country prediction model generalizability in MOOCs.
- 2013). Gender differences in the use and benefit of advanced learning technologies for mathematics. Journal of Educational Psychology, 105(4), 957–969.
- Algorithmic bias in education. International Journal of Artificial Intelligence in Education, 1–41.
- 2014). Educational data mining and learning analytics. In Learning analytics (pp. 61–75). Springer.
- 2015). An introduction to docker for reproducible research. ACM SIGOPS Operating Systems Review, 49(1), 71–79.
- 2018). Interpretability of deep learning models: A survey of results. In 2017 IEEE SmartWorld ubiquitous intelligence and computing, advanced and trusted computed, scalable computing and communications, cloud and big data computing, internet of people and Smart City innovation (pp. 1–6). IEEE. https://doi.org/10.1109/UIC-ATC.2017.8397411
- 2020). A survey of surveys on the use of visualization for interpreting machine learning models. Information Visualization, 19(3), 207–233.
- 2019). The ethics of learning analytics in Australian higher education. A Discussion Paper. https://melbournecshe.unimelb.edu.au/research/research-projects/edutech/the-ethical-use-of-learning-analytics
- 2017). Predicting success in massive open online courses (MOOCs) using cohesion network analysis. In Proceedings of the international conference on computer-supported collaborative learning (pp. 103–110). International Society of the Learning Sciences.
- 2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.
- 2018). Open source, open science, and the replication crisis in HCI. In Extended abstracts of the 2018 CHI conference on human factors in computing systems (pp. 1–8). ACM.
- EdX. (2020). Using the Research Data Exchange Data Package — EdX Research Guide documentation. https://edx.readthedocs.io/projects/devdata/en/latest/rdx/index.html
- 2014). Open science: one term, five schools of thought. In Opening science (pp. 17–47). Springer, Cham.
- 2013). The effects of culturally congruent educational technologies on student achievement. In International Conference on Artificial Intelligence in Education (pp. 493–502). Springer, Berlin, Heidelberg.
- 2017). Open science framework (OSF). Journal of the Medical Library Association: JMLA, 105(2), 203.
- 2006). You are what you say: privacy risks of public mentions. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 565–572). ACM.
- 2021). ManyClasses 1: Assessing the generalizable effect of immediate feedback versus delayed feedback across many college classes. Advances in Methods and Practices in Psychological Science, 4(3), 25152459211027576. https://doi.org/10.1177/25152459211027575
- 2018). Dropout model evaluation in MOOCs. In Thirty-Second AAAI Conference on Artificial Intelligence. AAAI Press.
- 2018). Replicating MOOC predictive models at scale. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale (pp. 1–10). ACM.
- 2014). Modeling problem solving in massive open online courses. Massachusetts Institute of Technology.
- 2013). Analyzing millions of submissions to help MOOC instructors understand problem solving. In NIPS Workshop on Data Driven Education (pp. 1–5). Curran Associates.
- 2014). The ASSISTments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. International Journal of Artificial Intelligence in Education, 24, 470–497. https://doi.org/10.1007/s40593-014-0024-x
- 2011). Culturally responsive pedagogy for African American students: Promising programs and practices for enhanced academic performance. Teaching Education, 22(4), 345–362.
- 2020). Algorithmic fairness in education. ArXiv Preprint ArXiv:2007.05443. https://arxiv.org/abs/2007.05443
- 2017). Self-regulated learning strategies predict learner behavior and goal attainment in massive open online courses. Computers & Education, 104, 18–33.
- 2010). A data repository for the EDM community: The PSLC DataShop. Handbook of Educational Data Mining, 43, 43–56.
- 2017). Sharing and reusing data and analytic methods with LearnSphere. In Companion Proceedings 9th International Conference on Learning Analytics & Knowledge (pp. 328–331).
- NEAP Data Mining Competition. (2019). NAEP data mining competition 2019. Nations Report Card - Data Mining Competition, 2019 https://sites.google.com/view/dataminingcompetition2019/home?authuser=0
- 2015). Teaching a chemistry MOOC with a virtual laboratory: Lessons learned from an introductory physical chemistry course. Journal of Chemical Education, 92(10), 1661–1666.
- 2016). Studying learning at scale with the ASSISTments TestBed. In Proceedings of the third (2016) ACM conference on learning@ scale (pp. 333–334). ACM.
- 2018). A system-general model for the detection of gaming the system behavior in CTAT and LearnSphere. In International Conference on Artificial Intelligence in Education (pp. 257–260). Springer, Cham.
- 2010). Modeling individualization in a bayesian networks implementation of knowledge tracing. In International conference on user modeling, adaptation, and personalization (pp. 255–266). Springer, Berlin, Heidelberg.
- 2015). moocRP: An open-source analytics platform. In Proceedings of the second (2015) ACM conference on learning@ scale (pp. 103–110). ACM.
- 2018). ASSISTments Longitudinal Data Mining Competition 2017: A Preface. In Proceedings of the Workshop on Scientific Findings from the ASSISTments Longitudinal Data Competition, International Conference on Educational Data Mining. IEDMS.
- 2018). Forgetting personal data and revoking consent under the GDPR: Challenges and proposed solutions. Journal of Cybersecurity, 4(1).
- 2017). Cross-disciplinary higher education of data science – Beyond the computer science student. Data Science, 1(1–2), 101–117.
- 2017). An elephant in the learning analytics room: The obligation to act. In Proceedings of the seventh international conference on learning analytics & knowledge conference (pp. 46–55). ACM.
- 2015). The effect of gender and race intersectionality on student learning outcomes in engineering. The Review of Higher Education, 38(3), 359–396.
- 2016). Assistments dataset from multiple randomized controlled experiments. In Proceedings of the third ACM conference on learning@ scale (pp. 181–184). ACM.
- 2016). Proceedings of the EDM 2016 workshops and tutorials co-located with the 9th international conference on educational data mining. Raleigh. In Educational data analysis using LearnSphere workshop. IEDMS.
- Student Data Privacy Consortium & others. (2018). Student Data Privacy Consortium: Policy and procedures. Author. https://privacy.a4l.org/wp-content/uploads/2018/06~
- 2019). Deep learning for dropout prediction in MOOCs. In 2019 Eighth International Conference on Educational Innovation through Technology (EITT) (pp. 87–90). IEEE.
- U.S. Department of Labor. (n.d.). Guidance on the Protection of Personal Identifiable Information. Author. https://www.dol.gov/general/ppii
- 2018). Open education science. AERA Open, 4(3), 2332858418787466.
- 2020). Diagnostic questions: The neurIPS 2020 education challenge. arXiv preprint arXiv, 2007.12061.
- 2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61, 726–728. https://doi.org/10.1037/0003-066X.61.7.726
- 1962). Responsibility for raw data. American Psychologist, 17(9), 657–658. https://doi.org/10.1037/h0038819
- 2016). Temporal predication of dropouts in MOOCs: Reaching the low hanging fruit through stacking generalization. Computers in Human Behavior, 58, 119–129.
- 2021). De-identification is insufficient to protect student privacy, or – What can a field trip reveal? Journal of Learning Analytics, 8(2), 83–92.