By Dr Alec Christie – Centre for Environmental Policy, Imperial College London
We’re facing a global biodiversity crisis, but also an evidence crisis. Making the right decisions to protect species and habitats is crucial, but it’s tough when scientific knowledge on conservation actions is scattered across thousands of scientific studies. Thankfully, dedicated researchers compile this evidence into resources like the Conservation Evidence database (www.conservationevidence.com), summarizing what works and what doesn’t for conservation actions. Still, even with these summaries, finding the right information quickly can be a challenge for busy conservationists, meaning that such knowledge is still not being used to its full potential to conserve biodiversity.
Could Artificial Intelligence (AI) help? Tools like ChatGPT, known as Large Language Models (LLMs), are incredibly good at processing and generating text to answer questions and summarise text. It seems like a perfect match – feed an LLM a database of evidence on conservation actions and ask it questions, right?
Not so fast. While LLMs are powerful, they can also make mistakes, invent facts (“hallucinate”), or reflect biases from their training data. Using them carelessly – or “out-of-the-box” – for vital conservation decisions could lead to misinformation and wasted effort, potentially even harming wildlife. So, the big question is: can we design AI systems that reliably help access conservation evidence, without these risks?
An Undergrad Takes on the Challenge
That’s what our team, led by Radhika Iyer, an undergraduate student at the University of Cambridge funded through the AI@Cam initiative and the UROP summer research scheme, set out to investigate. We wanted to see if modern LLMs could accurately retrieve information from the Conservation Evidence database and answer simple questions, to compare how they stacked up against human experts. The paper is published here in the scientific journal PLOS One: https://doi.org/10.1371/journal.pone.0323563.
(Image generated using Chat-GPT-4-Omni)
Putting AI Through an Exam
Think of it like setting an exam. We used a LLM (Claude 3.5 Sonnet) to automatically generate thousands of multiple-choice questions based only on the information within the Conservation Evidence database. We created a large set of 1867 filtered questions to test different setups, and a smaller, carefully curated set of 45 questions that were checked by human researchers for clarity, accuracy, and representativeness.
Then, we gave this exam to ten different state-of-the-art LLMs at the time of doing the study:
- Llama 3.1 8B Instruct-Turbo (FP8)
- Llama 3.1 70B Instruct-Turbo (FP8)
- Gemma2 Instruct – 9B (BF16)
- Gemma2 Instruct – 27B (BF16)
- Mixtral 8x22B Instruct (BF16)
- Gemini 1.5 Flash (gemini-1.5-flash-001)
- Gemini 1.5 Pro (gemini-1.5-pro-001)
- Claude 3.5 Sonnet (claude-3-5-sonnet-20240620)
- GPT-4o (gpt-4o-2024-08-06)
- GPT-4o mini (gpt-4o-mini-2024-07-18).
We then tested them under various conditions:
Closed Book: The AI had to answer using only its pre-existing knowledge (like trying an exam without studying the textbook).
Open Book (Oracle): The AI was given the exact piece of text from the database that contained the answer (the source document).
Open Book (Confused): The AI was given the exact piece of text from the database that contained the answer AND a another randomly selected piece of text to potentially ‘confuse’ it.
Open Book (Retrieval): This is the realistic scenario. The AI had to first find the relevant information within the database using different search strategies before answering. We tested three common methods:
- Sparse retrieval: Like searching for keywords.
- Dense retrieval: Searching based on the meaning or semantics of the question.
- Hybrid retrieval: A combination of both keyword and meaning-based search, designed to get the best of both worlds.
Finally, we asked six human experts from the Conservation Evidence team – people deeply familiar with the database – to take the same 45-question curated exam, finding the source document and answering each question. This gave us a crucial benchmark: could the best AI setup match expert performance?
The Results: AI Can Compete, But Setup is Everything
What did we find?
Expert-Level Performance is Possible: When using the best setup (the “hybrid” search strategy), several top LLMs (like GPT-4o and Llama 3.1 70B) answered the curated questions with accuracy comparable to, or even slightly exceeding, the average human expert. All tested LLMs did vastly better than just random guessing.
How AI Finds Information Matters MOST: The search strategy was critical. The hybrid approach significantly outperformed both keyword-only and meaning-only searches in both finding the correct document and helping the AI answer correctly. Its ability to find the right document was also on par with the human experts.
“Out-of-the-Box” is Risky: Without access to the database documents (the “closed book” test), AI performance dropped considerably, though it still showed some background conservation knowledge – although we found this varied considerably by the field of conservation without there being any obvious patterns by taxonomic group or habitat. This confirms that relying on general LLM knowledge alone for specific evidence is unreliable.
Speed Advantage: LLMs provided answers almost instantly, whereas human experts took a median time of over two minutes per question.
AI is Improving Fast: We also tested an older model (GPT-3.5 Turbo) which performed much worse than all the newer LLMs, even smaller, cheaper ones. This shows rapid progress in AI capabilities.
Open-Source models can match closed source models: We found that amongst the top performing models were both closed and open-source models – suggesting that low-cost alternatives exist, particularly important for building decision support systems in conservation where funding is limited.
What Does This Mean for Conservation?
Our results are exciting because they show that carefully designed AI systems have the potential to act as expert-level assistants for accessing specific evidence from databases like Conservation Evidence. Imagine an intelligent search tool that quickly points conservationists to the most relevant evidence to address their specific problem.
However, our findings also come with a strong dose of caution. Simply plugging a question into a general chatbot is not the way to get reliable evidence-based answers. The setup – particularly how the system retrieves information – is crucial to avoid poor performance and misinformation.
This study is a first step, focusing on multiple-choice questions. Future research needs to explore how well these systems perform on more complex questions requiring nuanced, free-text answers. We expect performance to decline on such questions and once we understand where this threshold lies, we can design systems to stay within reliable boundaries.
There are ethical considerations too, including: ensuring equitable access to AI expertise and resources; minimizing environmental costs by using the lowest cost, most efficient (ideally open source) models that give acceptably high levels of performance; and avoiding over-reliance on AI at the expense of critical thinking and evaluation of evidence.
Looking Ahead
AI offers promising tools, but we must develop and use them responsibly. The next challenge is to see how AI performs with more complex, open-ended questions that require nuanced thinking and reasoning. The recent release of a series of AI models focused on reasoning provides an ideal testing ground for this work. There is also the potential to expand our approach to other databases in other fields and disciplines.
Whilst AI won’t replace the need for expert judgement and local knowledge in conservation, tools built carefully upon studies like ours could significantly speed up access to vital scientific evidence, helping conservationists make better-informed decisions for the future of biodiversity.
Acknowledgements
The work involved a multidisciplinary team: Radhika Iyer, Sam Reynolds, and William Sutherland (Department of Zoology, Cambridge), Alec Christie (Centre for Environmental Policy, Imperial College London), Sadiq Jaffer and Anil Madhavapeddy (Department of Computer Science, Cambridge). Radhika Iyer conducted the research as part of a summer undergraduate project at Cambridge, supported by the AI@Cam project and the UROP scheme, as well as an unrestricted donation from Tarides. Sadiq Jaffer was funded by an unrestricted donation from John Bernstein. Alec Christie was funded by an Imperial College Research Fellowship.