A dataset of insect sounds from 459 species for bioacoustic machine learning

Revolutionizing Insect Monitoring with AI: A New Dataset for Bioacoustic Machine Learning

This analysis explores 'InsectSet459', a groundbreaking dataset of insect sounds from 459 species, enabling advanced deep learning for biodiversity monitoring despite challenges in data volume and diversity.

Schedule Your Strategy Session

Executive Impact: Empowering Biodiversity Intelligence

The introduction of InsectSet459 dramatically expands the scope for AI in entomological research. With 226.6 hours of audio from 459 species, it allows for the development of highly accurate classification models, crucial for understanding and addressing global insect population declines. This dataset directly addresses the current poverty of monitoring information, offering a scalable solution for ecologists and conservationists to track species distribution and occurrence.

459 Insect Species Covered

226.6 hrs Hours of Audio Data

72.2% Peak Classification Accuracy

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Dataset Curation

Technical Validation

Usage Notes & Future Directions

Details on how the InsectSet459 dataset was built, including data sources, deduplication, file formatting, and dataset splits for machine learning.

26,298 Total Audio Files

Dataset Curation Process

Download from Xeno-Canto, iNaturalist, BioAcoustica

→

Deduplication (SHA256 checksums, user uploads)

→

Filter by 10+ sound examples per species

→

Trim audio files to max 2 minutes

→

Standardize to WAV/MP3, convert stereo to mono

→

Split 60/20/20 into Train/Validation/Test

Feature	InsectSet66 (Prior)	InsectSet459 (New)
Species Count	66	459
Total Audio Duration	24 hours	226.6 hours
Geographic Coverage	Limited	Heavily biased to Europe/N. America, improving
Ultrasonic Frequencies	Limited mention	Preserved where available (25% of data)
Weak Labels	Yes	Yes
File Segmentation	Pre-segmented (overlapping sections)	Continuous files, max 2 min trim

Analysis of the deep learning models (EfficientNetv2 and PaSST) used to benchmark the dataset, including performance metrics and challenges.

57.5% Best F1 Score Achieved (PaSST)

Model	F1 Score (%)	Accuracy (%)	Notes
InsectEffNet	56.8	72.2	Based on EfficientNetv2-S, ImageNet21k pre-trained. Uses 44.1 kHz, 128 Mel bands.
PaSST	57.5	68.1	Transformer-based, uses 32 kHz, 128 Mel bands. Achieved slightly higher F1 score.

Challenge: Long-tail Distribution and Data Imbalance

The dataset exhibits a significant long-tail distribution, with many species having fewer than 25 recordings. This imbalance presents a major challenge for deep learning models, leading to much lower F1 scores for less-frequent categories. While class weighting was applied, more advanced data augmentation or additional data for rare species is needed. This is a common issue in ecological datasets, requiring robust solutions for real-world deployment.

Opportunity: Multi-Sample-Rate Models for Ultrasonic Species

A significant portion of InsectSet459 (approx. 25%) contains ultrasonic frequencies. The current benchmarking models (InsectEffNet and PaSST) were limited to audible ranges (up to 22 kHz and 16 kHz respectively) for spectrogram generation. This suggests a clear opportunity for future work to develop and apply multi-sample-rate models, which could significantly improve performance for species that primarily vocalize in the ultrasonic spectrum, unlocking crucial information for better classification.

Recommendations for using InsectSet459, limitations, and potential avenues for future research and development in bioacoustic AI.

Strategic Use: Pre-training and Fine-tuning

InsectSet459 is ideal for pre-training deep learning models for insect sound recognition. Due to its broad species and sample-rate coverage, models pre-trained on this dataset can then be fine-tuned with smaller, more specific datasets (e.g., regional, taxonomic group-specific, or strongly labeled) to achieve high performance in targeted monitoring tasks. This approach leverages the dataset's diversity without requiring complete species coverage for every local deployment.

Weak Labels Focus on Classification, Not Event Detection

Leveraging Metadata: Location, Temperature, Background

The dataset's annotation file includes rich metadata such as geographic location, ambient temperature, and noted background species. This information can be leveraged to improve classifier performance by: 1. Limiting species predictions to sensible geographic ranges, 2. Incorporating temperature data to account for its influence on insect songs, and 3. Utilizing background labels (where available) to refine models for complex real-world recordings. Users combining datasets should also use observation records to prevent data leakage from duplicates.

Quantify Your AI Transformation ROI

Estimate the potential savings and efficiency gains your organization could achieve by integrating advanced AI solutions. Adjust the parameters to see your customized return on investment.

Your Industry

Number of Employees Impacted

Avg. Hours/Week on Manual Tasks per Employee

Avg. Hourly Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Discuss Your Implementation

Your AI Implementation Roadmap

A typical AI adoption journey involves several strategic phases, from initial assessment to ongoing optimization. We tailor each step to your specific needs and goals.

Phase 1: Discovery & Strategy

In-depth analysis of current workflows, identification of AI opportunities, and development of a bespoke AI strategy aligned with your business objectives.

Phase 2: Solution Design & Prototyping

Designing the AI architecture, selecting appropriate models and technologies, and developing initial prototypes to validate technical feasibility and impact.

Phase 3: Development & Integration

Building out the full AI solution, rigorous testing, and seamless integration into your existing enterprise systems and data infrastructure.

Phase 4: Deployment & Optimization

Rolling out the AI solution, comprehensive training for your teams, continuous monitoring of performance, and iterative refinement for maximum ROI.

Start Your AI Journey

Ready to Transform Your Enterprise with AI?

Book a complimentary 30-minute strategy session with our AI experts to explore how these insights can be applied to your specific business challenges.

Book Your Free Consultation

A dataset of insect sounds from 459 species for bioacoustic machine learning

Revolutionizing Insect Monitoring with AI: A New Dataset for Bioacoustic Machine Learning

Executive Impact: Empowering Biodiversity Intelligence

Deep Analysis & Enterprise Applications

Dataset Curation Process

Challenge: Long-tail Distribution and Data Imbalance

Opportunity: Multi-Sample-Rate Models for Ultrasonic Species

Strategic Use: Pre-training and Fine-tuning

Leveraging Metadata: Location, Temperature, Background

Quantify Your AI Transformation ROI

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Solution Design & Prototyping

Phase 3: Development & Integration

Phase 4: Deployment & Optimization

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai