Journal of Pathology Informatics Journal of Pathology Informatics
Contact us | Home | Login   |  Users Online: 711  Print this pageEmail this pageSmall font sizeDefault font sizeIncrease font size 

Table of Contents    
J Pathol Inform 2015,  6:28

Exploring viewing behavior data from whole slide images to predict correctness of students' answers during practical exams in oral pathology

1 Faculty of Computing, Poznan University of Technology, M. Sklodowska-Curie Square 5, 60-965, Poznan, Poland
2 Institute for Molecular Medicine Finland FIMM, University of Helsinki, P.O. Box 20, FN-00014, Helsinki, Finland
3 Department of Clinical Pathology, Poznan University of Medical Sciences, Przybyszewski Str. 49, 60-355 Poznan, Poland

Date of Submission26-Mar-2015
Date of Acceptance01-Apr-2015
Date of Web Publication03-Jun-2015

Correspondence Address:
Slawomir Walkowski
Faculty of Computing, Poznan University of Technology, M. Sklodowska-Curie Square 5, 60-965, Poznan
Login to access the Email id

Source of Support: None, Conflict of Interest: None

DOI: 10.4103/2153-3539.158057

Rights and Permissions

The way of viewing whole slide images (WSI) can be tracked and analyzed. In particular, it can be useful to learn how medical students view WSIs during exams and how their viewing behavior is correlated with correctness of the answers they give. We used software-based view path tracking method that enabled gathering data about viewing behavior of multiple simultaneous WSI users. This approach was implemented and applied during two practical exams in oral pathology in 2012 (88 students) and 2013 (91 students), which were based on questions with attached WSIs. Gathered data were visualized and analyzed in multiple ways. As a part of extended analysis, we tried to use machine learning approaches to predict correctness of students' answers based on how they viewed WSIs. We compared the results of analyses for years 2012 and 2013 - done for a single question, for student groups, and for a set of questions. The overall patterns were generally consistent across these 3 years. Moreover, viewing behavior data appeared to have certain potential for predicting answers' correctness and some outcomes of machine learning approaches were in the right direction. However, general prediction results were not satisfactory in terms of precision and recall. Our work confirmed that the view path tracking method is useful for discovering viewing behavior of students analyzing WSIs. It provided multiple useful insights in this area, and general results of our analyses were consistent across two exams. On the other hand, predicting answers' correctness appeared to be a difficult task - students' answers seem to be often unpredictable.

Keywords: Practical examination, view path tracking, viewing behavior, whole slide image

How to cite this article:
Walkowski S, Lundin M, Szymas J, Lundin J. Exploring viewing behavior data from whole slide images to predict correctness of students' answers during practical exams in oral pathology. J Pathol Inform 2015;6:28

How to cite this URL:
Walkowski S, Lundin M, Szymas J, Lundin J. Exploring viewing behavior data from whole slide images to predict correctness of students' answers during practical exams in oral pathology. J Pathol Inform [serial online] 2015 [cited 2021 Sep 17];6:28. Available from:

   Background Top

Whole slide images (WSI) are a technology that enables many new possibilities. One area in which WSIs are particularly useful is education. Digital representation of histological slides can be utilized not only in the process of teaching but also examination. If students view WSIs during an exam to answer the questions, we can track the way they actually navigate through the slides and the areas they look at. Gathering such data over multiple exams and analyzing it using advanced methods, such as machine learning, can provide interesting insights into students' viewing behavior and how it is correlated with the correctness of the answers they give. We can also try to predict a student giving correct or incorrect answer based solely on the way he or she viewed a slide.

Tracking viewing behavior while watching WSIs can be accomplished in a variety of ways. Some methods involve eye movement tracking, [1] but they require specialized equipment, which does not scale well to tracking many students taking the exam at the same time. We use the software-based view path tracking method, which has already been presented. [2] The current paper extends the use of this method to collect more data, draw even more general conclusions from extended analysis, and use more advanced methods, like machine learning, to attempt to predict answer correctness based on data which describes viewing patterns.

   Methods Top

Practical exams in oral pathology at Poznan University of Medical Sciences in Poznan, Poland have been conducted with the use of WSIs since 2005. In the first few years, the possibilities of tracking students' WSI viewing behavior during the exams were limited due to the lack of reasonably scalable tracking method. In 2012, we introduced the view path tracking method. [2] It is integrated with the WSI system (WebMicroscope, Fimmic Ltd, Helsinki, Finland), does not require any specialized equipment and is based on records sent to the central database while the slides are viewed. Each record contains information about the WSI area (view field) displayed for a while on student's monitor while the student was navigating through the WSI. Additionally, records contain data identifying the given student, question and time when the fragment was displayed.

We collected data from two exams: From years 2012 (88 students) and 2013 (91 students). Each student in each year was answering 50 exam questions. The view path tracking method was applied to all students participating in the exams. This resulted in the total of about 130,000 view field records gathered during these two exams. More detailed numbers, split by years, can be found in the summary as shown in [Table 1].
Table 1: Summary of the tracking data collected during two practical exams in oral pathology

Click here to view

Most of the WSIs which appeared in the exam questions in 2012 were also present in the 2013 exam, which made certain comparative analyses possible. Like in the earlier paper, [2] the general analysis methods include generating visualizations (both static images and animations) and calculating measures. In terms of visualizations, drawing all students' view paths from 1-year on one image and confronting it with analogous drawing for another year seems to be a good way of comparing viewing patterns occurring year to year [Figure 1]. Similarly, calculated metrics can be aggregated for each year and compared side by side.
Figure 1: Comparing aggregated students' viewing patterns occurring year to year for a whole slide images with neurofibroma

Click here to view

In each year, students took classes in oral pathology in 6 groups. These groups were supervised by different teaching assistants, and the impact of these teachers on students' exam scores has been analyzed. [3] To see if there are any differences across metrics calculated for each group, we included group number as a dimension in one of the analyses in the current work.

Finally, we went beyond the general analysis of viewing behavior among students answering correctly and incorrectly. Our goal was to predict answer correctness based on the calculated metrics. To approach this task, we considered the prepared data as training and testing datasets in a typical binary classification problem, where computed statistics is treated as features (attributes) and correctness of an answer is a label. Since wrong answers were much rarer than good answers, it was convenient for us to focus on predicting incorrect answers. After looking at the correlation between metrics' values and answer incorrectness, we trained machine learning models and explored their prediction potential. We tried multiple types of models and eventually focused on two: Decision trees and random forests. We used two software environments that offer implementation of these (and many other) models: R [4] and Weka. [5]

   Results Top

Having data from two exams available, we wanted to check whether conclusions from the exam in year 2012 [2] hold true for the exam in 2013. One potential level of analysis is comparing how all students who answered the given question were viewing the WSI attached to this question. An example of a WSI for which viewing patterns are consistent from year to year is the case of well-differentiated papillary squamous cell carcinoma [Figure 2]. We can see that relations between the values of six metrics calculated for students answering correctly and incorrectly are consistent across two exam years. On the other hand, it can be noticed that the magnitude of differences has changed for some metrics (for example, the difference in number of view steps is larger in year 2013).
Figure 2: Average metrics values and visualization comparison for a question with whole slide images containing well-differentiated papillary squamous cell carcinoma

Click here to view

Numbers aggregated within student groups, supervised by different teachers, were also compared. If we look at statistics calculated for students answering correctly and incorrectly in each group, we can notice that relations between average values of measures like number of view steps and viewing speed (expressed as number of view steps divided by viewing time) are mostly consistent across multiple student groups and exam years [Figure 3].
Figure 3: Average values of metrics aggregated for all questions within student groups, which were supervised by different teachers during the classes

Click here to view

We also confronted metrics aggregated across all questions and students but separately for years 2012 and 2013. In this general comparison, we first limited the set of questions to those for which we had at least 3 correct and 3 incorrect answers (in the analyzed year). Then, we compared the number of questions for which the average metric value was higher for correct answers with the number of questions for which the average value was higher for incorrect answers. The results of the side-by-side comparison for two exam years are presented in [Figure 4]. Although the magnitude differs, it can be seen that relations between the counts for 2012 are preserved in the results for 2013, which confirms the general patterns observed. Students answering correctly tended to spend less time viewing the slide, go through less view fields but faster, focus more on the diagnostic area (region of interest), use lower magnification level, and the fragments they viewed were rather less dispersed.
Figure 4: General year-to-year comparison metrics for multiple questions. Questions with average measure value higher for students answering correctly/incorrectly are counted separately in green/red bars, respectively

Click here to view

The attempt to predict correctness of students' answers, based on data about viewing behavior, was the most challenging task in this work. Based on the results of the above analysis, we expected that calculated measures have certain prediction potential. In [Figure 5], we put values of 'number of view steps' and 'viewing speed' measures into buckets to show the total volume in each bucket (bars) together with percentage of incorrect values (line). These charts confirm some correlation, also consistent with general analysis from [Figure 4] - when number of view steps is high or viewing speed is low, larger fraction of answers are incorrect. However, this increased percentage is for buckets with relatively low volume (small number of answers).
Figure 5: Analyzing correlation of values of selected metrics with the ratios of incorrect answers

Click here to view

One approach in the prediction experiment was to have a separate decision tree for each selected question, trained on 2012 exam data and tested on 2013 exam data, using 6 selected features. If the performance of these models was good, it would show that answers' correctness within a question can be predicted based solely on viewing patterns registered during a previous exam. This was not the case, and most decision trees trained this way could not predict correctly how students will answer the given question in 2013. However, [Figure 6] shows a tree that detected incorrect answers in 2013 reasonably well, as it can be seen in the confusion matrix. Given the difficulty of the task, precision of 50% at recall of 100% is a good result.
Figure 6: A decision tree trained on data from 2012 exam and tested on data from 2013 exam, plus performance of this tree. Algorithm used: R part from R; upsampling applied for unbalanced set

Click here to view

Finally, we used combined data (9214 instances) for all analyzed students and all questions, from both years, to prepare a general model that would predict correctness of any answer to any question. In this dataset, we extended the feature set to all 26 implemented measures. We also added two standardized versions of each measure for better generalization, resulting in 78 features in total. Then, we ran a 10-fold cross-validation experiment, in which we trained and evaluated random forest models (model settings: 200 trees, maximum depth of 5). Each prediction resulted in a value representing the probability of the given answer being incorrect. [Figure 7] shows the distribution of these values (bars - volume, blue line - predicted probability), together with a red line representing the actual percentages of incorrect answers, which ideally should be equal to predicted probabilities. It can be noticed that answers scored as highly probable of being incorrect are indeed more likely to be actually incorrect. However, if we want to detect most of the incorrect answers (i.e., increase recall by lowering the probability threshold), precision drops significantly, as presented in the confusion matrix in [Figure 7], generated for the probability threshold of 50%.
Figure 7: Performance of random forests trained and tested in a 10-fold cross-validation experiment, using combined data from 2012 and 2013 (and an extended feature set). Algorithm used: Random Forest from Weka: 200 trees, max depth 5; cost matrix applied for unbalanced set

Click here to view

There is an extra outcome from training a random forest model - a list of feature importance values, which estimate the prediction potential of each measure. To generate such list, we trained a random forest model fit to the combined exam data from 2 years. It showed that total viewing time (one raw and two standardized versions) and number of viewed fragments were among top 5 most important features for predicting answer correctness. This is consistent with the general observation that students who view WSIs for long time and go through many view fields tend to answer questions incorrectly.

   Conclusions Top

We confirmed usefulness and scalability of the software-based view path tracking method for WSIs. It was enabled during two practical exams in oral pathology, and data about students' viewing behavior was successfully collected and processed. Presented method is implemented in a WSI viewing system and works in a way that is transparent to the users. As it has been described, [2] this approach could be also applied to scenarios other than an exam in oral pathology.

The results demonstrate a variety of analyses that can be done using the data collected. Viewing patterns were discovered for students answering correctly and incorrectly. The overall metrics comparison shows consistency in the outcomes from two exam years and suggests that general viewing patterns are stable. However, the attempt to predict students' answers based on data about WSI viewing behavior appeared to be a difficult task. While some prediction results were in the expected direction, the outcome was not satisfactory in most cases, suggesting that students' answers are often unpredictable.

   References Top

Krupinski EA, Tillack AA, Richter L, Henderson JT, Bhattacharyya AK, Scott KM, et al. Eye-movement study and human performance using telepathology virtual slides: Implications for medical education and differences with experience. Hum Pathol 2006;37:1543-56.  Back to cited text no. 1
Walkowski S, Lundin M, Szymas J, Lundin J. Students′ performance during practical examination on whole slide images using view path tracking. Diagn Pathol 2014;9:208.  Back to cited text no. 2
Szymas J, Lundin M, Lundin J. Teachers′ impact on dental students′ exam scores in teaching pathology of the oral cavity using WSI. Diagn Pathol 2013;8 Suppl 1:S25.  Back to cited text no. 3
The R Project for Statistical Computing. Available from: [Last accessed on 2015 Apr 24].  Back to cited text no. 4
Weka 3: Data Mining Software in Java. Available from: [Last accessed on 2015 Apr 24].  Back to cited text no. 5


  [Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7]

  [Table 1]

This article has been cited by
1 Eye Gaze Pattern Analysis of Whole Slide Image Viewing Behavior in PathEdEx Platform
Ilker Ersoy,Misha Kovalenko,Chi-Ren Shyu,Richard Hammer,Dmitriy Shin
Microscopy and Microanalysis. 2017; 23(S1): 248
[Pubmed] | [DOI]




   Browse articles
    Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
    Access Statistics
    Email Alert *
    Add to My List *
* Registration required (free)  

  In this article
    Article Figures
    Article Tables

 Article Access Statistics
    PDF Downloaded265    
    Comments [Add]    
    Cited by others 1    

Recommend this journal