Document Type : Original Article
Authors
1
Department of Biosystem Engineering, Faculty of Agriculture, University of Tabriz, Tabriz. Iran
2
Department of Plant Protection, Faculty of Agriculture, University of Tabriz, Tabriz, Iran.
10.22034/jam.2026.68067.1332
Abstract
Introduction
The integration of artificial intelligence and machine vision into agriculture has opened new horizons for the early detection of pests and plant diseases. These technologies are particularly valuable in greenhouse environments, where rapid intervention is essential to minimize crop losses. Traditional approaches have predominantly relied on static image classification, which lacks the ability to capture temporal dynamics of symptoms such as progressive leaf damage. In this context, this study explores the use of video-based deep learning for the detection of thrips induced damage on cucumber leaves. By comparing the performance of two modern spatiotemporal models MovieNet and SlowFast, the research aims to identify a reliable solution for real-time, accurate detection of pest damage in greenhouse conditions.
Materials and Methods
This study aimed to investigate the effectiveness of deep learning models, particularly convolutional neural networks, in detecting thrips-infested cucumber leaves using video-based input rather than still images. To this end, a dataset of 606 images was collected from the research greenhouse of the University of Tabriz, including 274 images of healthy leaves and 332 images of thrips-infested leaves.Given the widespread presence of this pest in the greenhouse, images were captured under natural conditions from various angles and distances to enhance data diversity and improve the model’s generalization to real-world scenarios. Care was also taken to ensure similar conditions when capturing images of healthy leaves to maintain class balance.
To simulate temporal dynamics and enable video-based learning, the collected images were converted into short video clips. Specifically, every four randomly selected images were combined sequentially to form one video. The frame rate was deliberately set to a low value of 3 frames per second to facilitate meaningful temporal feature extraction. After this preprocessing step and applying several offline data augmentation techniques, the final dataset comprised 909 videos, including 411 videos of healthy leaves and 498 videos of thrips-infected leaves.
For the learning task, two deep spatiotemporal architectures were employed MovieNet and SlowFast. Both models are known for their ability to capture motion and spatial patterns effectively. Prior to training, the video data were split into training (70%), validation (15%), and test (15%) sets using stratified sampling to preserve class distribution across subsets. All videos were resized and normalized according to the input requirements of the respective architectures. The models were trained with learning hyperparameters optimally tuned to ensure effective convergence and to minimize overfitting.
Performance evaluation was conducted using common classification metrics, including accuracy, precision, recall, and F1-score, computed on the test set. Additionally, confusion matrices and training-validation loss curves were analyzed to further assess model behavior during training and generalization capability.
Results and Discussion
Training-validation loss curves highlighted key differences between the two models. In the case of MovieNet, the training and validation loss both decreased rapidly at the beginning, indicating effective learning. However, during later epochs, the validation loss diverged from the training loss, suggesting overfitting. The model achieved 100% test accuracy, but this was considered unreliable due to the relatively small test set and the model’s tendency to memorize rather than generalize. Conversely, SlowFast demonstrated fluctuating loss values during the initial training phases, possibly due to its more complex architecture and optimization strategy. Despite the instability early on, both training and validation losses eventually converged, indicating improved generalization. This model achieved a final test accuracy of 99.27%, with a test loss of 0.0425, reflecting strong performance.
Detailed classwise evaluation revealed that the healthy leaf class achieved 98.41% precision and 100% recall, indicating that no healthy samples were misclassified. The thrips-damaged class recorded 100% precision and 98.67% recall, suggesting high detection accuracy with minimal false negatives. The overall F1-score, precision, and recall all exceeded 99%, confirming balanced and accurate performance across both classes.
The confusion matrix further validated these results. All 62 healthy samples were correctly classified, with zero misclassifications. Among the 75 thrips-damaged samples, 74 were correctly identified, with only one instance misclassified as healthy. This minimal error highlights the robustness of the SlowFast model in binary classification of pest damage.
Conclusion
This research demonstrates the efficacy of video-based deep learning methods for detecting thrips damage on cucumber leaves in greenhouse environments. Unlike conventional static image approaches, video enables the capture of dynamic changes and subtle visual cues over time, enhancing model accuracy and reliability.
Between the two models tested, SlowFast outperformed MovieNet, providing superior generalization and higher classification accuracy without overfitting. Its architectural design, particularly the dual-pathway temporal processing and ResNet-50 backbone, enabled it to achieve a final test accuracy of 99.27% and excellent precision-recall balance across both classes.
This video-based approach demonstrated several key advantages over traditional image-based methods, including enhanced accuracy through the capture of temporal symptom progression, reduced misclassification caused by static noise, and improved pattern recognition in dynamic real-world scenarios. These strengths highlight the potential of video-based deep learning techniques for integration into intelligent monitoring systems in modern greenhouses, offering farmers the ability to detect and respond to pest infestations more promptly and effectively
Future work should explore multi-class detection of various pests and diseases, as well as the incorporation of attention mechanisms or transformer-based video models to further improve accuracy. Additionally, developing mobile or cloud-based platforms for model deployment could make this technology more accessible for real-world agricultural applications.
Acknowledgement
The authors would like to thank the students working in the greenhouse for their cooperation and for allowing data collection during their research activities.
Keywords
Main Subjects