Chapter 19: Predictive Modelling of Student Behavior Using Granular Large-Scale Action Data

 Click on this http://www.mvcc.edu/library website link, and login. Keyboard or copy/paste each of the case-mix and severity of illness software programs highlighted in yellow below (next to the bullets) to locate an article about each one.  For topics about which you do not find articles at the library website, students are permitted to use the Google search engine to locate information about topics.  However, at least one topic must be cited as a result of research at the MVCC Library website. If you need assistance navigating MVCC Library's research databases, send an email to library@mvcc.edu to request help.

students engaged in MOOCs, we ask whether generalthrough MOOCs can be uncovered by modelling the behaviour of those who were ultimately successful in the course.Capturing the trends that successful students take through MOOCs can enable the development of automated recommendation systems so that struggling students can be given meaningful and take in a sequence of events as an input and generate a probability distribution over what event is likely to the recurrent neural network (RNN) model, which have traditionally been successful when applied to other generative and sequential tasks.
actions the student has taken in a MOOC.The purpose of such analysis would eventually be to create a system whereby an automated recommender could query the model to provide meaningful guidance on what action in other cases, it may be a recommendation to consult a resource from a previous lesson or enrichment material buried in a corner of the courseware unknown to the student.These models we are training are known as generative, in that they can be used to generate what actions the student has already taken.Actions can include things such as opening a lecture video, to a forum post.This research serves as a foundation for applying sequential, generative models towards with sequential data.
In the case of the English language, generative models understanding of how that language is structured.A simple but powerful model used in natural language distribution is learned over every possible sequence of n terms from the training set.Recently, recurrent neural networks (RNNs) have been used to perform Cernocky, & Khudanpur, 2010), where previously seen words are subsumed into a high dimensional continuous latent state.This latent state is a succinct numerical representation of all of the words previously seen in sentation to predict what words are likely to come generate candidate sentences and words to complete sentences.In this work, rather than learning about the plausibility of sequences of words and sentences, the generative models will learn about the plausibility of sequences of actions undertaken by students in MOOC to generate recommendations for what the student In the learning analytics community, there is related work where data generated by students, often in MOOC many different types of student-generated data, and there are many different types of prediction tasks.through a process of manual feature engineering.In our approach, feature representations are learned directly from the raw time series data.This approach does not help improve the correlation of MOOC resource usage with knowledge acquisition.In that work, the presence of student self-selection is a source of noise and confounders.In contrast, student selection becomes the signal in behavioural modelling.In Reddy, Labutogether via embedding.This embedding process maps assignments, student ability, and lesson effectiveness onto a low dimensional space.Such a process allows for lesson and assignment pathways to be suggested based on the model's current estimate of student ability.The work in this chapter also seeks to suggest learning pathways for students, but differs in that additional student behaviours, such as forum post accesses and lecture video viewings, are also included in the model.Additionally, different generative models are employed.log data from MOOCs.While this user clickstream resources involved in these interaction sequences.

RELATED WORK
from interactions in the forums (Oleksandra & Shane, student events at a more abstract level compared to these content-focused approaches.much work has been done to assess the latent knowledge of students through models such as Bayesian knowledge tracing (BKT;Corbett & Anderson, 1994), course structure as a source of knowledge components.This type of modelling views the actions of students as learning opportunities to model student latent knowlthis chapter, though the work is related.Instead, our models focus on predicting the complement of this performance data, which is the behavioural data of the student.recurrent neural networks to create a continuous latent representation of students based on previously seen assessment results as they navigate online learning environments.In that work, recurrent neural networks shows that a deep learning approach can be used to represent student knowledge, with favourable accuracy predictions relative to shallow BKT.Such results, to BKT.The work in this chapter is related to the use of deep networks to represent students, but differs in that all types of student actions are considered rather than only the use of assessment actions.
the n-gram approach and a variant of the RNN known as the long short-term memory (LSTM) architecture (Hochreiter & Schmidhuber, 1997).These two both model sequences of data and provide a probability LSTM architectures and similar variants have recently -2015), in part due to its mutable memory that allows for the capture of long-and short-range dependencies in sequences.Since student learning behaviour can be represented as a sequence of actions from a successful learning.In previous work, modelling of student clickstream data has shown promise with methods such as n-gram models (Wen & Rosé, 2014).
The dataset used in this chapter came from a Statistics BerkeleyX MOOC from Spring 2013.The MOOC original dataset contains 17 million events from around 31,000 students, where each event is a record of a user interacting with the MOOC in some way.These interactions include events such as navigating to a particular URL in the course, up-voting a forum thread, The data is processed so that each unique user has types of events are possible.Every row in the dataset action taken or the URL accessed by the student.
Thus, every unique user's set of actions is representunique kinds.Our recorded event history included students navigating to different pages of the course, and wiki pages.Within these pages, we also recorded the actions taken within the page, such as playing and pausing a video or checking a problem.We also In this rendition of our pre-processing, we record plicit association with the URL navigated to by the sequential event.Table 1 catalogs the different types of events present in the dataset as well as whether event or not.In our pre-processing, some of these accessing for these events.Some events are recorded only know that the action took place, but not which URL that action is tied to in the course.Note that any event that occurred fewer than 40 times in the origiand seq prev refer to events triggered when students select navigation buttons visible on the browser page.can be given sequences of arbitrary length.
Of the 31,000 students, 8,094 completed enough cation sometimes means that the student paid for a lion of the original 17 million events, with an average this chapter, as we chose to train the generative models under the hypothesis that the sequence of actions that a successful pattern of navigation for this MOOC.
Each row in the dataset contained relevant information ter, we do not consider time or other possibly relevant resource the student accesses or the action taken by the student.Events that occurred fewer than 40 times throughout the entire dataset were removed, as those tended to be rarely accessed discussion posts or user students navigating through the MOOC.
In this work, we investigate the use of two generative models, the recurrent neural network architecture, and the n-gram.In this section, we detail the architecture as the n-gram, are described afterwards.

Recurrent Neural Networks
Recurrent neural networks (RNNs) are a family of neural network models designed to handle arbitrary length sequential data.Recurrent neural networks work by keeping around a continuous, latent state that persists throughout the processing of a particular sequence.This latent state captures relevant information about the sequence so far, so that prediction at later parts latent state.As the name implies, RNNs employ the networks while also imposing a recurring latent state that persists between time steps.Keeping the latent state around between elements in an input sequence

METHODOLOGY
is what gives recurrent neural networks their sequential modelling power.In this work, each input into the RNN will be a granular student event from the MOOC dataset.The RNN is trained to predict the student's

LSTM Models
A popular variant of the RNN is the long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) architecture, which is thought to help RNNs learn that learn when to retain meaningful information latent state, allowing for meaningful long-term interactions to persist.LSTMs add additional gating when to clear and when to augment the latent state with useful information.Instead, each hidden state hi is replaced by an LSTM cell unit, which contains additional gating parameters.Because of these gates, LSTMs have been found to train more effectively than Schmidhuber, & Cummins, 2000).The update equations for an LSTM are as follows:

LSTM Implementation
The generative LSTM models used in this chapter were library built on top of Theano (Bergstra et al., 2010;Bastien et al., 2012).The model takes each student and then consumes the embedded vector one at a time.The use of an embedding layer is common in natural language processing and language modelling multi-dimensional semantic space.An embedding layer is used here with the hypothesis that a similar mapping may occur for actions in the MOOC action action, given actions previously taken by the student.Back propagation through time (Werbos, 1988) is used Categorical cross entropy is used calculating loss, and were added between LSTM layers as a method to curb of network edge weights for each batch of training data.In future work, it may be worthwhile to evaluate 2014).We have made a version of our pre-processing and LSTM model code public, 1 tracting only the navigational actions from the dataset.

LSTM Hyperparameter Search
As part of our initial investigation, we trained 24 LSTM models each with a different set of hyperparameters algorithm making a full pass through the data.The searched space of hyperparameters for our LSTM models is shown in Table 19.2.These hyperparameters were chosen for grid search based on previous work Schmidhuber, 2015).For the sake of time, we chose not to train 3-layer LSTM models with learning rates of where we used the results from the initial investigahyperparameter and training methods.

Cross Validation
To evaluate the predictive power of each model, 5-fold cross validation was used.Each model was trained on 80% of the data and then validated on the remaining actions was in a validation set once.For the LSTMs, the model held out 10% of its training data to serve as the hill climbing set to provide information about validation accuracy during the training process.Each row in the held out set consists of the entire sequence of actions a student took.The proportion of correct computed for each sequence of student actions.The proportions for an entire fold are averaged to generate the model's performance for that particular fold, 1 https:/ /github.com/CAHLR/mooc-behavior-case-studyLSTM model hyperparameter set.

Shallow Models
N-gram models are simple, yet powerful probabilistic models that aim to capture the structure of sequences through the statistics of n grams and are equivalent to n-order Markov chains.

P(x-
i |x i n , ..., x the previous n-1 states in the training set.N-gram models are both fast and simple to compute, and have n-grams relatively high parameter models that essentially assign a parameter per possible action in the action space.
For the n-gram models, we evaluated those where n ranged from 2 to 10, the largest of which corresponds to handle predictions in which the training set contained no observations, we employed backoff, a method that recursively falls back on the prediction of the largest n-gram that contains at least one observation.Our validation strategy was identical to the LSTM models, wherein the average cross-validation score of the same

Course Structure Models
We also included a number of alternative models aimed inspecting the sequences was that certain actions are repeated several times in a row.For this reason, it is important to know how well this assumption alone sequence, we evaluated the ability of the course syl-plished this by mapping course content pages to student page transitions in our action set, which yielded an overlap of 174 matches out of the total 300 items in the syllabus.Since we relied on matching action space, a small subset of possible overlapping actions were not mapped.Finally, we combined both models, wherein the current state was predicted as labus.
In this section, we discuss the results from the previously mentioned LSTM models trained with different learning rates, number of hidden nodes per layer, and number of LSTM layers.Model success is determined through 5-fold cross validation and is related to how

LSTM Models
models computed after 10 iterations of training.For the models with a learning rate of .01,accuracy on the hill climbing sets generally peaked at iteration 10.For the models with the lower learning rates, it would be improve through more training.We chose to simply report results after 10 iterations instead to provide a snapshot of how well these models are performing model performance is unlikely to improve drastically over the .01learning rate model performances in the rate is bolded for emphasis.
One downside of using LSTMs is that they require investigating the best hyperparameters to use, we chose to train additional models based only on a as opposed to just 10 epochs in the previous hypera large improvement over the previous results, where the new accuracy peaked at .7223 compared to .7093.
Figure 19.3 shows validation accuracy on the 10% hill-climbing hold out set during training by epoch for Each data point represents the average hill-climbing accuracy among all three learning rates for a particular layer and node count combination.Empirically, having a higher number of nodes is associated with a higher start with lower validation accuracies for a few epochs before approaching or surpassing the corresponding 1 10 epochs; clearly, for some parameter combinations, more epochs would result in a higher hill-climbing accuracies may start lower initially before improving over their lower-layer counterparts.

Course Structure Models
Model performance for the different course structure models is shown in

N-gram Models
performing models made predictions using either the previous 7 or 8 actions (8-gram and 9-gram respectively).Larger histories did not improve performance, were competitive with the LSTM models, although the best n-gram model performed worse than the best LSTM models.Table 19.7 shows the proportion made using 10-gram observations.Further, fewer than 1% of cases fell back on unigrams or bigrams to make lack of observations for larger gram patterns.
by successively larger n-grams.

Validating on Uncertified Students
2 layers) to predict actions on streams of data from we restricted analysis to students who had at least 30 LSTM model was able to predict actions correctly racy, compared to .7093cross-validated accuracy for application in providing an automated suggestion framework to help guide students.
In this work, we approached the problem of modelling granular student action data by modelling all types of interactions within a MOOC.This differs in approach from previous work, which primarily focuses on modelling latent student knowledge using assessment performing LSTM model produced a cross-validation accuracy of .7223,which was an improvement over the best n-gram model accuracy of .7035:210,000 more    Both the LSTM and the n-gram models have room for improvement.In particular, our n-gram models could techniques, which allow for better handling of unseen parameter grid search, more training time, longer action embeddings.Additionally, the signal-to-noise ratio in our dataset could be increased by removing less informative or redundant student actions, or adding additional tokens to represent time between actions.

CONTRIBUTIONS
The primary reason for applying deep learning models to large sets of student action data is to model student behaviour in MOOC settings, which leads to insights about how successful and unsuccessful students navigate through the course.These patterns can be leveraged to help in the creation of automated recommendation systems, wherein a struggling student can be provided with transition recommendations to view content based on their past behaviour and performance.To evaluate the possibility of such an application, we plan to test a recommendation system derived from our network against an undirected work should assess performance of similar models course-general patterns can be learned using a single model.The models proposed in this chapter maintain a computational model of behaviour.It was demonstrat- the previous content page in the course respectively, while a seq goto represents a jump within a section DATASET to any other section within a chapter.then clicks on section 5 within the navigation bar student's sequence would be represented by five URL of chapter 2, section 1, the second to a play video navigation goto event.The model would be given a dict what should come after.The indices therefore represent the sequence of actions the student took.

Figure 19
Figure 19.1 shows a diagram of a simple RNN, where inputs would be student actions and outputs would equations below show the mathematical operations used on each of the parameters of the RNN model: ht represents the continuous latent state.This latent

Figure 19 .
Figure 19.2 illustrates the anatomy of a cell, where the mentioned update equations for the LSTM: ft, it, and ot represent the gating mechanisms used by the LSTM

Figure 19 . 1 .
Figure 19.1.Simple recurrent neural network performance within the range of the LSTM or n-gram results.

Figure 19 . 3 .
Figure 19.3.Average accuracy by epoch on hill climbing data, which comprised 10% of each training set.
computational model was able to detect these patterns, what can the model tell us about student behaviours the model tracks a hidden behavioural state for the and correlated with other attributes of the students known to present at that time.Future work will seek to open up this computational model of behaviour so that it may help inform our own understanding of the student condition.This work was supported by a grant from the National Proceedings of the 11 th Annual Conference of the International Speech Communication Association Proceedings of the 6 th International Conference on Learning Analytics and Knowledge New York: ACM.Proceedings of the 6 th International Conference on Educational Data Mining Proceedings of the 6 th International Conference on Learning Analytics and Knowledge handwriting recognition.Proceedings of the 14 th International Conference on Frontiers in Handwriting Recognition tracing.In C. Cortes et al. (Eds.),Advances in Neural Information Processing Systems 28 across political differences in forums.Proceedings of the 3 rd ACM Conference on Learning @ Scale (L@S neural network to predict liveliness in educational videos.In T. Barnes et al. (Eds.),Proceedings of the 9 th International Conference on Educational Data Mining -Cortes et al. (Eds.),Advances in Neural Information Processing Systems 28 guage.pdfProceedings of the 2015 IEEE Computer Society Conference on Computer Vision and Pattern Recognition open online courses.Proceedings of the 23 rd ACM International Conference on Information and Knowledge Management Proceedings of the 7 th International Conference on Educational Data Mining Neural Networks, 1

Table 19 .
1. Logged Event Types and their Sp

Table 19 .2. models
, as well as other course structure models, are validated through 5-fold cross validation.

Table 19 .3. Table 19.4.
Table 19.5.Results suggest that many actions can be predicted from simple heuristics such as stationarity (same as last), or course content