# 5.2. The Dataset¶

OnTask relies on the existence of data about how learners interact in a given experience. This activity offers an artificially generated data set to explore the functionality in the platform.

Download and unpack the zip file dataset.zip in a folder in your personal computer. The folder should have the following files: student_list.csv, midterm_results.csv, forum_participation.csv, blended_participation.csv and all_data.csv. These files have been derived from the previously described scenario.

This the information contained in each file:

File student_list.csv

A file with information about 500 students and the following column names:

• Student id (SID),
• Identifier (an auxiliary field),
• email,
• Surname,
• GivenName,
• MiddleInitial,
• Full name,
• Gender,
• Course Code,
• Program, with one of the values FSCI, FEIT, FASS or SMED,
• Enrollment, with one of the values HECS, Local or International, and
• Attendance with values either Full Time or Part Time.
File midterm_results.csv

File with information about 461 students with the following columns:

• SID (with values identical to those in the previous file),
• email,
• Last Name,
• First Name,
• Columns Q01 to Q10 with the result of the 10 multiple choice questions (1 means correct, 0 means incorrect), and
• the column Total with exam score (over 100 points).
File forum_participation.csv

File with information about 500 student and their participation in the discussion forum in the course. The columns in this file are:

• SID (with values identical to those in the previous files),
• The columns Days online, Views, Contributions and Questions replicated four times for weeks 2-5 with the week number as suffix for the column name, and
• the accumulated values for Days online, Views, Contributions, and Questions without any suffix.
File blended_participation.csv

File with information about learner engagement with the videos and questions complementing the videos for weeks 2 - 5 in the course. The columns in this file are:

• SID (with values identical to those in the previous files),
• Columns with names Video_N_WM contain the percentage of the video with number N in week M that has been visualized. For example Video_2_W4 is the percentage of the second video in Week 4 that has been visualized.
• Columns with names Questions_N_WM contain the percentage of questions from group N in Week M that have been answered. For example, Questions_1_W4 is the percentage of questions in block 1 from Week 4 that have been answered.
• Columns with names Correct_N_WM contain the percentage of questions from group N in Week M that have been correctly answered (and therefore a value smaller than the previous one).
File all_data.csv
This file is simply the union of all the columns from the previous files.

## 5.2.1. Key columns¶

Each file in the dataset contains the data in Comma Separated Values or CSV. This format assumes that 1) the first line has the names of the columns separated by commas, and every line below that one contains the data for a row with the values also separated by commas. This format is used to store information with a structure similar to a table (rows and columns).

There is a special type of column that is of special interest in OnTask. If a column has a different value for each of the rows it is called a key column. The reason why these columns are important is because once one value is selected it unequivocally identifies one row in the table. For example, educational institutions typically assign an identifier (SID) to each student which is unique. If a table contains information about a set of students (one row, one student) and one column has the student ID, that column is then a key column. If during a procedure that is manipulating this table a student ID is given, that information uniquely identifies one row of that table.