Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g. grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is a need to develop methods able to effectively combine different data types of differing sizes to improve predictions. In this work, we develop a new three-step statistical method to predict multi-category traits by combining three data types — genomic, weather, and secondary trait and address the various challenges in this problem. We achieved at 8% reduction in prediction accuracy while reducing the model complexity by over 90% compared to machine learning methods such as random forest and SVM.
Watch the video to learn more about our work and results! This work was presented at the INFORMS 2021.