Conference Paper Analysis

BEE: A HIGH-QUALITY CORPUS AND FULL-STACK SUITE TO UNLOCK ADVANCED FULLY OPEN MLLMS

Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. A comprehensive ablation study further dissects the impact of our data curation process, revealing that each stage provides significant performance gains across a wide range of benchmarks. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

Schedule Your Strategy Session

Executive Impact & Key Findings

BEE sets a new standard for fully open MLLMs, demonstrating the power of high-quality data curation.

0M+ High-Quality QA Pairs

0% Human-Preferred Reasoning

0% CountBench SOTA

0% MathVerse SOTA

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

HoneyPipe: Our Data Curation Pipeline

Our transparent, adaptable HoneyPipe systematically transforms raw data into a high-quality, dual-level CoT SFT dataset. This model-driven process ensures scalability and efficiency by leveraging model-based filtering and enrichment.

Data Aggregation & Deduplication

→

Noise & Irrelevance Filtering

→

Short CoT Enrichment

→

Fidelity Verification (Loop)

→

Long CoT Enrichment

→

Final Honey-Data-15M Corpus

Note: The "Fidelity Verification" stage involves a loop where instructions failing initial checks are routed to "Long CoT Enrichment" and then re-verified, ensuring high-quality outputs for complex problems.

Bee-8B Performance vs. Leading MLLMs

Bee-8B establishes a new state-of-the-art (SOTA) among fully open MLLMs, achieving competitive performance with, and in some cases surpassing, recent semi-open models. All percentages are scores.

Task	Benchmark	LLaVA OneVision-7B*	Molmo -7B-D*	Qwen2.5 -VL-7B†	Keye-VL -8B†	InternVL3.5 -8B†	Bee-8B -SFT*	Bee-8B -RL*
General VQA	AI2D	81.4	81.0	84.3	86.7	84.0	83.8	85.3
General VQA	CountBench	-	84.8	74.1	78.0	-	90.5	93.0
General VQA	MMMU val	48.8	45.3	58.6	71.4	73.4	66.8	66.1
General VQA	MMStar	61.7	56.1	63.9	75.5	69.3	69.0	71.4
General VQA	MMVet	57.5	41.5	67.1	79.0	83.1	83.3	83.9
General VQA	RealWorldQA	66.3	70.7	68.5	67.7	67.5	70.1	73.1
Table & Chart & OCR	CharXivRQ	-	-	42.5	45.4	44.4	55.3	57.3
Table & Chart & OCR	ChartQAtest	80.0	84.1	87.3	86.3	86.7	86.7	86.1
Table & Chart & OCR	DocVQAval	-	-	95.5	88.5	87.2	87.0	87.0
Table & Chart & OCR	InfoVQAval	-	-	81.4	67.4	72.3	72.3	72.9
Table & Chart & OCR	OCRBench	62.2	65.6	86.4	85.1	84.0	83.1	82.5
Math & Reasoning	LogicVista	33.3	-	44.1	54.8	57.3	56.8	61.3
Math & Reasoning	MathVersevision_only	26.2	4.2	25.1	59.8	61.5	61.9	67.0
Math & Reasoning	WeMath	20.9	-	35.2	60.7	57.0	55.0	59.8

(*) denotes fully open models, (†) denotes semi-open models. Higher scores are better. Bolded values are the top and second-best.

General VQA Capabilities

Bee-8B showcases robust and well-rounded performance across a wide array of general visual question answering tasks, demonstrating a strong general knowledge base and core visual skills.

73.1% RealWorldQA Accuracy

Bee-8B demonstrates superior real-world knowledge by securing the top rank on RealWorldQA with a score of 73.1%, confirming its broad understanding of multimodal information.

Case Study: Complex Scene Interpretation (General with Short CoT)

User Question: Can you tell what the people are doing?

Context: Image of a street scene with a large group of people, traffic cones, and signs like "ONLY", "NO TURN ON RED".

Bee-8B Enriched Response Snippet: "The image shows a large group of people gathered in a public space, likely on a street or near a building. Here's what can be inferred about their activities: 1. Crowd Gathering: The people are standing in a large group... 2. Event or Parade: The presence of traffic cones, barricades... 3. Traffic Management: The traffic lights and signs... Overall, the people appear to be participating in or observing a public event, such as a parade, festival, or community gathering."

General Scene with Short CoT Explanation

Case Study: Precise Object Counting & Analysis (Grounding & Counting with Short CoT)

User Question: Think out loud before giving the final conclusion. Is there a large blue cube made of the same material as the small sphere?

Context: Image of 3D geometric shapes (cylinders, cubes, spheres) with different colors and materials (matte/metallic).

Bee-8B Enriched Response Snippet: "To answer the question, let's analyze the image step by step: 1. Identify the objects in the image: ... 2. Examine the material of the small sphere: ... 3. Examine the material of the large blue cube: ... 4. Compare the materials: ... 5. Final conclusion: There is a large blue cube, and it appears to be made of the same material as the small red sphere."

Grounding and Counting with Short CoT Explanation

Document, Chart, & OCR Performance

Bee-8B demonstrates strong performance in tasks involving structured visual content like documents, tables, and charts, and excels in scientific document analysis and precise information extraction.

86.7% ChartQA Accuracy

Bee-8B excels in structured visual content, evident in its highly competitive ChartQA score of 86.7%, confirming robust data parsing abilities.

Case Study: Multi-Dataset Chart Reasoning (Chart with Long CoT)

User Question: Which algorithm has the smallest accuracy summed across all the datasets? Please conclude your answer as Answer: xxx at the end if possible.

Context: Bar chart titled "Accuracy of algorithms on different datasets" showing "bird" and "poet" accuracy for "artery" and "steel" algorithms.

Bee-8B Enriched Response Snippet: "To determine which algorithm has the smallest accuracy summed across all datasets, we analyze the two algorithms (artery and steel) by calculating the total accuracy across both datasets (bird and poet) for each: 1. Artery algorithm: ... Sum of accuracies: 1 + 6 = 7. 2. Steel algorithm: ... Sum of accuracies: 9 + 8 = 17. Comparing the two sums (7 for artery and 17 for steel), the smaller sum belongs to the artery algorithm."

Chart Analysis with Long CoT Explanation

Case Study: Detailed Document Comprehension (Document with Long CoT)

User Question: Which service has the next renewal date after Spotify?

Context: Image of a "Monthly Subscription Summary Report - October 2023" with renewal dates for Spotify, Netflix, Hulu, Adobe, and Amazon Prime.

Bee-8B Enriched Response Snippet: "To determine which service has the next renewal date after Spotify, we first identify Spotify's renewal date and then compare it with the renewal dates of other services: 1. Locate Spotify Premium's renewal date: November 1, 2023. 2. List the renewal dates of all other services: ... Identify the earliest date after November 1, 2023: November 7, 2023, which corresponds to Netflix Standard Plan."

Document Analysis with Long CoT Explanation

Math & Reasoning Breakthroughs

Bee-8B delivers significant advancements in complex math and reasoning, consistently achieving exceptional performance across benchmarks designed to test quantitative and logical problem-solving.

67.0% MathVerse RL Score

Bee-8B achieves a top-ranking 67.0% on MathVerse (RL-tuned), representing a clear improvement of 5.5% over strong semi-open models, showcasing superior visual-mathematical skills.

Case Study: Advanced Geometric Problem Solving (STEM with Long CoT)

User Question: Given: In △ABC, the perpendicular bisectors of AB and AC intersect BC at points M and N, respectively. AB = 4, AC = 7, BC = 10. Find the perimeter of △AMN.

Context: Geometric diagram of triangle ABC with internal points M and N related to perpendicular bisectors and median CE.

Bee-8B Enriched Response Snippet: "To solve the problem, we use the property of perpendicular bisectors: any point on the perpendicular bisector of a segment is equidistant from the segment's two endpoints. Step 1: Analyze the Perpendicular Bisectors... Step 2: Express the Perimeter of △AMN... Step 3: Relate to BC... Answer: The perimeter of △AMN is 10."

STEM Problem Solving with Long CoT Explanation

Case Study: Robustness through Fidelity Checks (Factual Inaccuracy Example)

User Question: What is the radius of the circle O if the length of arc AB is 4*pi and the measure of angle AOB is 40 degrees?

Context: Diagram of a circle with center O, an arc AB, and angle AOB = 40 degrees.

Filtering Reason: "The new answer introduces a numerical value 18 that conflicts with the original precise answer 9. (This case was filtered out during fidelity verification, demonstrating the pipeline's ability to maintain factual consistency and prevent erroneous data from corrupting the model.)"

Fidelity Check Example

Quantify Your AI Impact

Estimate the potential efficiency gains and cost savings for your enterprise by integrating advanced MLLMs.

Your Industry

Number of Employees (impacted by manual tasks)

Average Hours/Week on Manual Tasks per Employee

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Total Hours Reclaimed 0

Your Implementation Roadmap

A phased approach to integrate Bee-8B into your enterprise, ensuring robust performance and measurable ROI.

Phase 1: Foundation & Alignment

Establish core vision-language capabilities through MLP warmup and comprehensive multimodal alignment using curated datasets.

Phase 2: Advanced Instruction Tuning

Leverage Honey-Data-15M for large-scale SFT, instilling complex reasoning and instruction-following skills, including dual-level CoT strategies.

Phase 3: Refinement & Optimization

Apply efficient refinement SFT on targeted subsets and Group Relative Policy Optimization (GRPO) to enhance response quality and reliability.

Phase 4: Customization & Deployment

Tailor Bee-8B to specific enterprise needs, fine-tuning on proprietary data and deploying within your secure infrastructure.

Discuss Your Implementation Strategy

Ready to Unlock Advanced MLLMs?

Connect with our AI specialists to explore how Bee-8B and HoneyPipe can transform your enterprise's multimodal AI capabilities.

Book a Free Consultation

Conference Paper Analysis

BEE: A HIGH-QUALITY CORPUS AND FULL-STACK SUITE TO UNLOCK ADVANCED FULLY OPEN MLLMS

Executive Impact & Key Findings

Deep Analysis & Enterprise Applications

HoneyPipe: Our Data Curation Pipeline

Bee-8B Performance vs. Leading MLLMs

General VQA Capabilities

Case Study: Complex Scene Interpretation (General with Short CoT)

Case Study: Precise Object Counting & Analysis (Grounding & Counting with Short CoT)

Document, Chart, & OCR Performance

Case Study: Multi-Dataset Chart Reasoning (Chart with Long CoT)

Case Study: Detailed Document Comprehension (Document with Long CoT)

Math & Reasoning Breakthroughs

Case Study: Advanced Geometric Problem Solving (STEM with Long CoT)

Case Study: Robustness through Fidelity Checks (Factual Inaccuracy Example)

Quantify Your AI Impact

Your Implementation Roadmap

Phase 1: Foundation & Alignment

Phase 2: Advanced Instruction Tuning

Phase 3: Refinement & Optimization

Phase 4: Customization & Deployment

Ready to Unlock Advanced MLLMs?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai