Journey from Text Mining to Neural Information Retrieval in Agricultural Data Science

Presented a paper at the International Conference on Big Data Applications in Agriculture (ICBAA2017)
5 & 6 December 2017

Agricultural data, like data in other disciplines, tends to suffer from lack of common structure and centralization. Several international and local institutions spend years of research leaving behind large amounts of unstructured and semi-structured data. Data structuring and centralization enables data-wide scale processing and comprehensive decision making. Lately data centralization lost its importance due to the exponential rate of data growth. Current advances in computer hardware and software has provided the capability for online processing of existing and future data.

The focus of this paper is to introduce our journey from text mining technology to neural information retrieval in agricultural data science. Text mining techniques that use regular expression to retrieve knowledge out of semi-structured data tend to be deterministic. We have conducted an experiment on 691K words of agronomic datasets and managed to extract 13K local names, 1.7K synonyms and a taxonomy of 2.2K crops. The text mining techniques scored 97% accuracy based on 20% verified samples. However, analysing 691K words is a toy problem compared to analysing 3.4 billion words as with the Wikipedia English corpus. The text mining performed on 691K words was possible partly because the data was recorded in a semi-structured format. However, the format and the structure was not followed in many cases. Text mining was inevitable to transfer the data from a semi-structured information to understandable knowledge. Wikipedia on the other hand is not limited to any category of people for data collection and documentation. Text mining techniques using regular expression will possibly fail to analyse such huge and complex unstructured datasets. Artificial intelligence advances along with advances in computer processing and storage power, empower scientists with extraordinary tools for data analysis and visualization that were not possible a few years ago. Neural information retrieval [Mitara, 2017] was used in our research to process a Wikipedia corpus and extract insights from agricultural data. Word2Vec [Mikolov 2013] and FastText [Joulin 2016] are two recent neural network approaches developed by Google Research and Facebook Research for text analysis.

Our experiment used a dynamic version of those tools and connected it to the CropBASE knowledge system to analyse 3.4 billion words from Wikipedia against our knowledge base. The analysis aimed to identify the relationships between common English crop names and different countries. Word2Vec and FastText transfer the corpus of text into a vector space (i.e. word embeddings) and the resulting vectors can be used to measure relationships between words. The results of our experiment show strong relationships between crop names and countries that are aligned with expert knowledge. The results can be used as an intermediate search tool for scientist to explore large text corpora. 
Work With Us
Passionate about underutilised crops and agricultural biodiversity? CFF provides an excellent platform for you to make a real difference to global challenges.
Our team is a diverse mix of international and national professionals with a range of skills. All of us are working towards achieving our vision to be recognised as a world-leader, delivering excellent, innovative research and knowledge on underutilised crops.
  • Researchers
    CFF welcomes talented researchers at all levels: senior researchers, academicians or postdoctoral researchers that specialise in areas related to our research themes and programmes.
    Research Vacancies
  • Operations
    The CFF operations team provides a responsive and integrated infrastructure that supports the advancements of Crops For the Future.
    Operations Vacancies
  • Internships
    CFF takes great pride in mentoring research students. At CFF, students will interact with highly experienced researchers and will engage with its disciplinary activities that span the whole Research Value Chain.
    Internship Positions