Sep 5, 2023

Daoyuan Chen∗, Yilun Huang∗, Zhijian Ma∗, Hesen Chen∗, Xuchen Pan†, Ce Ge†, Dawei Gao†, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li‡, Bolin Ding‡, Jingren Zhou

Alibaba Group

ABSTRACT

The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, diverse, and high-quality data. Despite this, existing open-source tools for LLM data processing remain limited and mostly tailored to specific datasets, with an emphasis on the reproducibility of released data over adaptability and usability, inhibiting potential applications. In response, we propose a one-stop, powerful yet flexible and user-friendly LLM data processing system named Data-Juicer. Our system innovatively offers over 50 built-in versatile operators and pluggable tools, which synergize modularity, composability, and extensibility dedicated to diverse LLM data processing needs. By incorporating visualized and automatic evaluation capabilities, Data-Juicer enables a timely feedback loop to accelerate data processing iterations and gain data insights. To enhance usability, Data-Juicer provides out-of-the- box components for users with various backgrounds, and fruitful data recipes (formulated as configuration files) for LLM pre-training and post-tuning usages. Further, we employ multi-facet system optimization and seamlessly integrate Data-Juicer with both LLM and distributed computing ecosystems, to enable efficient and scalable data processing. Empirical validation of the generated data recipes reveals considerable improvements in LLaMA performance for various pre-training and post-tuning cases, demonstrating up to 7.45% relative improvement of averaged score across 16 LLM benchmarks and 16.25% higher win rate using pair-wise GPT-4 evaluation. The system’s efficiency and scalability are also validated, supported by up to 88.7% reduction in single-machine processing time, 77.1% and 73.1% less memory and CPU usage respectively, and 7.91x processing acceleration when utilizing distributed computing ecosystems. Our system, refined data recipes, and multiple interactive tutorial demos are released and actively maintained at https://github.com/alibaba/data-juicer, calling for broader research centered on LLM data.

1 INTRODUCTION

Background and Motivation. Large Language Models (LLMs) [10, 19, 61, 62, 81, 84] have achieved unprecedented intelligence recently, enabling various applications that would otherwise be in- feasible due to unsatisfied performance. As the “food” for LLMs, data plays a pivotal role in these exciting advancements [29, 54, 63, 97]. Generally speaking, LLMs are built by pre-training on large-scale general-purpose corpus and are post-tuned with specific-purpose data for alignment or downstream tasks. For pre-training data, a col- lection of diverse data types, including web text, dialogue, academic papers, code bases, and others, helps to develop the vast repository of knowledge and great applicability [10, 50, 67]. Post-tuning data, which focuses more on helpfulness, further refines LLMs and aligns model behavior with human values [3, 42, 60]. However, as “garbage in, garbage out” suggests, the essence of input data for machine learning has a direct impact on the quality of the derived models [39]. Unraveling an effective method for LLM data processing to enhance LLMs remains a sophisticated yet fully under-explored task, given the common challenges in processing both pre-training and post-tuning data, which requires striking a delicate balance between data quality, data diversity, and data volume.

Unfortunately, there exist only a few open-source projects contributing their training data and the corresponding processing codes [23, 44], particularly in comparison to numerous open-source projects on models and training infrastructures [7, 8, 20, 59, 72, 87, 99]. Such limited development of data processing will obstruct the progress of quantitatively understanding and enhancing LLMs from the perspective of data.

Challenges on LLM data processing. Existing studies concentrate on a restricted set of LLM datasets, lacking a systematized and modular processing capability to handle diverse datasets and processing pipelines for LLM. For instance, RedPajama [23] and BLOOM

[72] mainly release heuristic-specific processing scripts for several distinct pre-training datasets. The Alpaca [84] dataset targets the augmentation of diversity for LLaMA [87] post-tuning, whereas AlpaGasus [18] aims to filter out low-quality samples from Alpaca. These ad-hoc data processing practices may have limitations for other new LLM datasets. In addition, existing works prioritize data reproducibility over user-experience, thereby reducing adaptability for diverse user needs and alternative use cases such as assistant tools utilization and data insights exploration. In the meantime, performance optimization considerations are often bypassed, offering significant room for enhancement in managing burgeoning data volumes and facilitating cost-effective data processing. Comprehensive development of data processing systems tailored for the dynamic and emerging needs of LLM data remains largely unexplored, especially accompanied by the bloom of innovative LLM applications across different domains [86, 88, 96, 102].

Given these circumstances, this paper advocates for a one-stop data processing system that addresses these challenges, delivering versatile, flexible, user-friendly, and efficient data processing abilities to facilitate future LLM data processing and research.

Overview and Capabilities of the Proposed System. In this paper, we propose Data-Juicer as depicted in Figure 1, a comprehensive, efficient, and scalable data processing system tailored for

LLMs. The proposed system innovatively decouples the mixture components of existing LLM data processing elements, such as specific data types, auxiliary models, and downstream tasks; and fosters the abstraction and implementation of standardized and composable modules. Data-Juicer is versatile, featuring over 50 core built-in operators (OPs), including Formatters, Mappers, Filters, and Deduplicators, as well as an array of dedicated tools such as Analyzers, Visualizers, Quality Classifiers, and Reference LLMs. These tools, coupled with established timely multi-dimensional automatic evaluation capabilities, support a feedback loop at multiple stages in the data processing and LLM development, thereby promoting the production of valuable and diverse LLM data.

Shaping the User Experience. To meet diverse user backgrounds

and needs, we also design Data-Juicer as an easy-to-use, highly flexible and extensible system. Beginners can benefit from numerous ready-to-use datasets, data recipes (formulated as structured configuration files for data processing), and pluggable tools, sup- porting zero-code LLM data processing for various scenarios of LLM pre-training and post-tuning. Data-Juicer is also equipped with a flexible configuration module and a rich OP pool, which allows experienced users to simply modify built-in data recipes, reorganize the order of processing OPs, tune the value of their hyper-parameters, and leverage the provided tools to meet light- weight customization requirements. Thanks to the well-designed standardization and modular extension, advanced users are empowered to conveniently register their own OPs and tools into Data-Juicer, facilitating their quick engagement in flexible secondary development. Furthermore, we offer more than a dozen interactive tutorial demos implemented by streamlit [78] to help users with their LLM data processing journey.

Seamless Integration, Refined Data and Superior Performance. Data-Juicer hinges itself on the Huggingface-datasets library [48], providing an elaborate unified intermediate representation of data and achieving targeted efficiency, space, and robustness optimization through various techniques such as context management, OP fusion, caching, and checkpoint mechanisms. Further- more, Data-Juicer seamlessly integrate with LLM ecosystems such as HELM [52] and Megatron-LM [77], and distributed computing ecosystems such as Ray [58] and Beam [6], thus facilitating comprehensive LLM data processing and enhancing large-scale data processing capabilities.

Leveraging the proposed system, we have refined several open- sourced LLM datasets to provide numerous enhanced collections of pre-trained and post-tuning data. These refined datasets are not only higher in quality but also more digestible by LLMs, leading to performance improvements. Empirical analysis showcases a relative improvement of up to 7.45% averaged score across 16 LLM benchmarks using the refined pre-training data. Even with only 43% quantity of compared data, we observed superior performance over state-of-the-art (SOTA) LLMs. Moreover, we achieve an average of 16.25% higher win rate using pair-wise GPT-4 evaluation on Data-Juicer’s post-tuning data than LLaMA models trained on competitive open data. We then demonstrate how Data-Juicer excels in its user-friendly data-in-the-loop development workflows with great convenience and flexibility. Finally, we show the versatility of the system’s OPs and validate its superior processing time-space efficiency and scalability, by up to 88.7% reduction in single-machine processing time, 77.1% and 73.1% savings in memory and GPU resources respectively, and 7.91x processing acceleration with the help of integrated distributed computing ecosystem.

Contributions. We summarize our contributions as follows:

We propose a novel system for LLM data processing, Data-Juicer, which is featured by decoupled modules and over 50 versatile OPs and tools. To easily dive into data quality and insights, Data-Juicer fosters a feedback loop with interactive visualizations and automatic evaluation capabilities.
Demonstrated by extensive empirical evidence, Data-Juicer generates produces numerous high-quality data recipes, exhibits superior usability, great efficiency and scalability, which are powered by optimized system design and integrated distributed computing ecosystems.
We integrate one-stop processing methodologies with user-centric interface designs. Data-Juicer eases access for various users, fosters engagement, and democratizes LLM data processing.
To promote future development and research, our system, data recipes, and tutorial demos are maintained and publicly accessible at https://github.com/alibaba/data-juicer, which we hope can help pave the way for next-generation LLM data production paradigms.

Outline. The subsequent sections describe Data-Juicer in de- tails. After the discussion about the background and related studies in Sec. 2, Sec. 3 elaborates our design w.r.t four major challenges and an overview of Data-Juicer. Sec. 4 outlines our OP pool, as a response to high heterogeneity of LLM data (Challenge 1). Sec. 5 delves into our formulation of timely feedback loops and visualization utilities for LLM data processing (Challenge 2). Sec. 6 details our repository of data recipes and pluggable tools that counteract usability and customization issues (Challenge 3). Sec. 7 expounds on the employed system optimization to tackle massive data volume (Challenge 4). Sec. 8 focuses on an extensive empirical evaluation for the quality of data recipes and performance of Data-Juicer in LLM data processing. Lastly, we draw a summary and discuss implications in Sec. 9.

2 BACKGROUND AND RELATED WORKS

2.1 Large Language Model (LLM) Data

Large Language Models (LLMs). Language modeling is a crucial component for achieving machine intelligence [57, 104]. In the last few years, this field has witnessed remarkable advancements, particularly with the emergence of the pre-training and post-tuning paradigms, where language models undergo an initial phase of training with a general-purpose corpus before being post-tuned with specific-purpose tasks [25, 64]. This procedure has yielded exceptional performance across a spectrum of natural language processing (NLP) tasks [47, 68].

Recently, taking advantage of the highly parallelizable nature of the self-supervised Transformer architecture, the scales of model parameters and training corpus for LLMs have significantly been increased [26, 61]. Meanwhile, LLMs have aroused considerable interest in the potential of artificial general intelligence [11, 12, 28, 34, 38, 93, 103]. While model-centric studies proliferate, how to better process LLM data remains an intricate domain yet to be completely unfurled, whether for pre-training or post-tuning data.

Pre-training Data. Pre-training serves as the foundation for LLM intelligence. By being trained on large amounts of high-quality data, LLMs can acquire elementary language comprehension and generation capabilities [33]. Aiming to elucidate the link between data and LLMs intuitively, let us consider a typical pre-training objective prevalent among mainstream LLMs. Given a token sequence 𝑥 = [𝑥1, …, 𝑥𝑖 , …, 𝑥𝑛 ], where a token 𝑥𝑖 indicates a unit of text dependent on the used tokenizer, an LLM 𝜃 is trained to maximize the joint probability of tokens in the text as .𝑛 log 𝑝 (𝑥𝑖 |𝑥<𝑖 ), also termed as causal (auto-regressive) language modeling. This allows 𝜃 to predict the probability of the next token by adhering to the inherent sequential ordering of the language.

Exploiting this unified yet simple modeling goal, researchers collect a large volume and diverse range of corpus data, which usually contains hundreds of billion tokens or even trillion tokens. After tokenization and pre-training, LLMs have succeeded in stimulating a wide range of advanced capabilities. The general LLM pre-training data includes various types derived from the web crawlers [24, 63], dialogues or social media [101], book-length formal texts [32, 105], rigorous encyclopedias and academic texts [29, 94], structured coding texts [19, 50], and more texts from different kinds of domains with potential to be mined [51, 82, 98]. A challenge is nonetheless posed in the careful processing and formulation of pre-training data to filter noise, redundancy, irrelevance, and potentially toxic [30, 54]. Dedicated solutions are yet missing in the pre-training data processing that facilitates catering to more diverse demands and crafting higher-quality LLMs.

Post-tuning Data. Numerous studies have underscored that post-tuning – the process of refining pre-trained LLMs using a smaller, task-specific dataset – can further enhance or unlock additional capabilities of LLMs [36, 46, 91, 92]. Crucially, this process also paves the way for better aligning the behavior of these advanced models with human values and preferences [53, 60].

In this phase, though the data volume diminishes exponentially compared to the pre-training phase, the characteristics of the data type are quite different [65]. Post-tuning data is typically formulated as a triplet (𝑥, 𝑦, 𝑠), where 𝑥 represents user questions or additional background, 𝑦 indicates the corresponding responses or complementary texts the LLM is expected to generate, and 𝑠 stands for instructions specific to the task the model is supposed to deal with, potentially accompanied by a few optional demonstrative samples to encourage in-context learning [10].

The post-tuning data can be broadly categorized into four types: Instruction Fine-Tuning (IFT) datasets to enhance the instruction- following abilities of LLMs [62]; Supervised Fine-Tuning (SFT) datasets adapted from existing NLP benchmarks [4, 83]; Preference datasets with explicitly labelled human preference for pair-wise responses [5]; and Multi-Round Dialog (MRD) datasets to enhance the dialog ability with long contexts [85]. There are preliminary explorations emphasizing the importance of data diversity over volume for post- tuning data [21, 89]. Several studies also indicate that data types representing human values can potentially lead to degraded general performance, a phenomenon known as the “alignment tax” [62]. However, how to more effectively process the post-tuning data to maximize its usefulness and minimize potential risks remains an open area for further investigation.

The Symbiotic Nature of Pre-training and Post-tuning Data. It is worth pointing out the analogous properties shared between these two types of data, which motivate our synergetic approach when bearing quality, diversity, and volume considerations in mind.

Specifically, the quality aspect of the text has been studied extensively in existing literature [54]. Efforts have been made to enhance aspects such as text structure, the soundness of arguments, contextual richness, writing correctness, comprehensiveness, levels of anonymization, and harmlessness. The widespread use of cleaning, deduplication, and anonymization processes in pre-training data, along with the habitual iteration over multiple epochs for Wikipedia-style data, illustrates the aforementioned pursuit [87]. Similarly, post-tuning data processing also employs filtering, deduplication, and detoxification strategies, aiming to enhance the user experience and the degree of aid offered by LLMs [18, 30].

Diversity is another shared property studied at length in both types of data. Mixing various types of data and finding suitable mixture weights to achieve appropriate diversity has been a primary concern in pre-training data processing works [97]. Analogously, post-tuning data efforts aim to increase multi-view diversity such as tuning tasks and expression styles, which further underscores this shared property [62, 69, 84].

In addition, the pursuit of quality and diversity tends to trade off with data volume, which is also reflected in these two types of data. Researchers have incessantly strived to empower LLMs with massive amounts of data, hoping to encapsulate as much human knowledge as possible. For instance, there has been an influx in pre- training data volumes to terabyte levels [61, 63], and post-tuning data volumes have grown from mere thousands to millions [4, 90]. However, the counter effects of these initiatives are also brought into these large volumes of data, including heightened noise, potential inferior quality, increased bias, as well as a surge in data acquisition, processing, and training overheads, which necessitate additional data processing efforts.

2.2 Existing LLM Data Processing

LLM data processing is an early area that is still working towards common standards, and we aim to embody a pioneering system for the community. With a commitment to open-source ethos, Data-Juicer caters to the increasing demand for versatile, flexible, user-friendly and efficient data processing solutions, details of which will be described later. This contrasts the well-known LLMs that were largely closed-source in data or data processing, such as the GPT derivatives [10, 19, 61, 76], LLaMA derivatives [17, 20, 80, 84, 87], and others [1, 16, 71, 96, 101].

While some progress has been made in the open-source LLM data processing landscape, as demonstrable by efforts from BLOOM [44], PromptSource [4] and RedPajama [23], they have not fully delivered the breadth of functionalities that Data-Juicer aims to bring to the forefront of the field.

Examining this from the perspective of the target datasets, existing works typically fixate on specific LLM data types and use cases, spanning alignment of specialized English sub-datasets for LLaMA pre-training [87], assembly of multi-lingual corpora for pre-training [44], or crowdsourcing for post-tuning prompt data generation [4]. However, they lack the systematic and modular processing abilities required to proficiently manage a wide range of data, which is an area Data-Juicer strives to push its boundaries. These limitations become especially apparent when handling varied data types, engaging in language transfer, or implementing particular data cleaning and transformations for LLM applications.

Moreover, usability and data insights aren’t optimally addressed by existing works. Most of these works only offer the processed dataset along with purpose-built processing code specific to those datasets, lacking in ease-of-use considerations and supporting assistive toolkits. This hinders their adaptability for various users and alternative use cases. Users might face a daunting task when substituting data processing goals or conducting analyses due to a dearth of complementary data analytical capabilities. The comprehensive re-development of data processing tools and analytical methodologies, specifically tailored for LLM application scenarios, remains largely uncharted territory.

Furthermore, the focus of current works gravitates towards functionality rather than performance optimization, leaving large room for enhancement in efficiency, space management, scalability, and robustness. Noteworthy shortcomings include reliance on single- machine Python scripts, inappropriate handling of large-scale data, and poor data processing speeds due to the utilization of Python’s plain dict object.

3 DESIGN OF DATA-JUICER

3.1 Underlying Challenges

We first introduce the primary design and execution Challenges inherent in tailoring a data processing system for LLMs as follows: C1. High Heterogeneity of LLM Data. As introduced in Sec. 2.1, LLMs involve diverse developmental stages (e.g., pre-training and post-tuning) and highly extensive usages including knowledge comprehending, dialog assistance, and even aiming at Artificial General Intelligence (AGI). As a result, they demand a vast spectrum of data types, originating from various formats and heterogeneous sources with numerous fine-grained sub-categories such as web data (overwhelming user-generated electronic data), literature (ranging from science, humanities, to fiction), code repositories (covering hundreds of programming languages), and among others. Addressing this heterogeneity impels a powerful data processing system capable of handling variegated data types, formats, linguistic properties, and processing preferences. The primary challenge resides in the creation of highly adaptable components while with versatile processing abilities.

C2. Timely Feedback for Data Processing. The development phase of LLMs is resource-intensive, requiring substantial GPU calculations on a weekly or even monthly basis to iterate through all the available data just once [77]. Establishing feedback loops in the early stages of data and LLM development can greatly assist in data quality validation, data insight exploration, and timely identification of aspects requiring enhancement, which collectively contribute towards cost reduction and accelerated development iterations. However, the design of a system infused with visualization utilities, evaluation tools, and feedback mechanisms, particularly considering data heterogeneity, volume, and complexity, poses a significant challenge.

C3. Usability and Customization. Building upon and exacerbated by the above two challenges, users of LLM data processing encompass a highly broad spectrum of workflows, deployment strategies, tool-chain preferences, and familiarity levels with data processing procedures and LLMs. This diversification demands an intricate reconsideration of usability and customization in systems designed for LLM data. Users typically develop numerous specific, small-scale procedural steps, particularly in the context of LLM data processing. This requires the inclusion and use of multiple upstream and downstream codes and tools that support tasks ranging from data management and transformation to analysis. The crux of the problem becomes twofold: devising a system to accommodate these

detailed user-specific tendencies while proffering inclusive, layered, flexible interfaces for various user skill levels and needs. This means offering intuitive interfaces for beginners and permitting ease in workflow modifications. Additionally, the system should also provide experienced users with greater control and flexibility, ensuring smooth integration of existing and new operators, and tools into their workflows without causing disruptions.

C4. Massive Data Volume. LLMs are usually trained on vast corpora, with data volumes stretching to an unprecedented magnitude of billions or even trillions of tokens. Efficient LLM data processing of this volume is critical but arduous due to high resource requisites, varied data storage methods, diverse data processing needs and intricate procedures. Crafting an efficient, robust, and scalable system becomes essential to ensuring data processing stability and facilitating the procedures and deliveries of processed data, trained weights, and solutions for LLMs.

Addressing the aforementioned challenges demands a thoughtful design that strikes a delicate balance. On one hand, it involves promoting advanced, feature-rich, efficient, and scalable systems. On the other hand, it requires extending user-friendly, readily adapt- able, and flexibly customizable features in line with the specific needs of various users.

3.2 Architecture Overview

In response to the previously identified challenges, we strategically design Data-Juicer to process data and enhance its quality, making it more “juicy” and digestible for LLMs. This system, illustrated in a bottom-up view in Figure 1, is built upon a systematically structured foundation of versatile operators (OPs) and dedicated tools (highlighted by the green boxes). We achieved this by judiciously decoupling the data processing flow from dataset-specific, model- centric, and task-oriented aspects. This architecture, characterized by its modularity and composability, effectively addresses the challenge of data heterogeneity across various LLM data formats and types (detailed in Sec. 4).

As the yellow and orange boxes show, to shield users from underlying complexities, Data-Juicer employs a standardized and dynamically configurable module, facilitating end-to-end customizable data processing workflows. This approach fosters traceability and eases the modification of data recipes (as configuration files), thereby bolstering the reproducibility and comparability of various data processing tasks. Further, Data-Juicer enhances usability by offering and maintaining a collection of high-quality, pre-built data recipes tailored to meet diverse essential use cases such as LLM pre-training, post-tuning, and data exploration. The system enables cooperative automated evaluation of data and LLMs, as well as feed- back loops involving multiple stages of LLM development through detailed visualization reports, fostering data processing and insight acquisition (described in Sec. 5).

For users with diverse backgrounds and needs (marked by the left three rectangle boxes), we have thoughtfully designed and provided a suite of resources and APIs that are optimized for differentiated adoption and user-friendly access. Such a user-centric and one-stop design strategy positions Data-Juicer as an adaptable, accessible, yet formidable toolkit in numerous data processing scenarios (detailed in Sec. 6).

Lastly, we meticulously optimize the system efficiency and scalability for Data-Juicer, implementing strategies such as OP fusion and data checkpoints (we will introduce in Sec. 7). Furthermore, Data-Juicer is designed to seamlessly integrate with the prevailing LLM and distributed computing ecosystems to promote inter- operability (the right two circles). Thus, it allows for the utilization of existing software tools without disrupting useful yet established workflow protocols.

4 STANDARDIZED OPERATOR POOL

In addressing the heterogeneity of data types prevalent in LLMs (Challenge 1 in Sec. 3.1), we devise a set of standardized operator (OP) pool. As outlined in Table 1, the OPs are organized into four primary categories: Formatters, Mappers, Filters, and Deduplicators, which incorporate diverse categories, functions, inputs, processing levels, outputs, and application scenarios. Core principles of decoupling and composability guide their structuring, resulting in a varied yet standard set of procedures that contribute to flexibility and user interaction at multiple processing levels. This strategic implementation enhances reusability and reduces complexity, aiding a streamlined and decoupled data processing environment.

4.1 Unified Data Representation

We first introduce Formatter OPs designed to unify diverse data sources into an intermediate data representation. Specifically, backed by the Huggingface-datasets [48] and Apache Arrow [2], a frame- work offering column-oriented memory infrastructure, our system delivers a unified data interface that simplifies the process design for follow-up OPs. Our system supports numerous text input for- mats – txt, JSON, parquet, html, md, pdf, code files such as .py and .cpp, amongst others – and homogenizes them into a structured format composed of certain columns with nested access support, which are conceptually organized by three primary parts “text”, “meta”, and “stats”. These parts respectively hold the raw textual data, metadata information (e.g., date and version), and statistical data that can be generated and consumed by Data-Juicer’s other OPs and tools. This interface works at either the text sample or dataset level, and is independent of underlying in-memory or disk data layout, alleviating the potential worry over heterogeneous data formats by OP developers.

4.2 Versatile Data Processing

Next, we elaborate on the functionality of the versatile OP pool in Data-Juicer, which is pivotal to the comprehensive data processing tailored for LLMs. Besides the Formatter OPs, which play an essential role in unifying data formats and ensuring a consistent and efficient data flow throughout the processing pipeline, we now give more details about the other three types of data transformation OPs in Table 1.

Mappers facilitate crucial in-place text editing functionalities, necessary for single-sample or multi-sample levels processing across diverse LLM data needs, such as modifying texts for pre-training and enhancing text diversity for post-tuning. They effectively handle tasks like the removal of specific headers, messy code rectification, and text enhancements.

Table 1: Overview of the operator (OP) pool in Data-Juicer, with a detailed list continuously maintained at the official documentation: https://github.com/alibaba/data-juicer/blob/main/docs/Operators.md.

Category	Function	Input	Process Level	Output	OP Usage Examples
Formatters	Data format unifying	Dataset	Dataset	Dataset	Load and unify dataset-hub, txt, json, md, codes, html, pdf, docx, …
Mappers	In-place text editing	Sample	Single-sample; Multi-samples	Sample; Samples	Transform specified headers, textual elements; Fix messy codes; Enable text enhancement
Filters	Conditional text removing	Sample	Single-sample; Dataset	Boolean	Filter by meta-info, stats (e.g., lines count); model scores; external resources (e.g., flagged words)
Dedup- licators	Duplication removing	Single or Paired Dataset	Dataset	Dataset	Compare with hash-based and vector-based deduplication methods

Filters come into play by conditionally filtering texts using individual- sample metrics, dataset-level statistics, or external resources like stop word lists. In doing so, they eliminate unnecessary text samples, contributing to data focus, cleanliness, and the cost reduction of follow-up LLM training processes significantly.

Deduplicators function at the dataset level, eliminating duplications that could potentially waste storage spaces and compromise efficiency. Moreover, as indicated by several studies [14, 41, 45], du- plicate samples adversely affect both pre-training stability and the performance of LLMs. Meanwhile, deduplicators also help prevent unintentional data leakage during training into evaluation bench- marks, particularly for zero-shot or few-shot tasks like MMLU [35]. In order to ensure accurate detection and removal of duplication, we provide efficient and robust methods including hash-based and vector-based comparisons [9, 15, 73].

It is noteworthy that we have further decoupled the implementations of statistics computation and text processing for Filters and Deduplicators OPs, as depicted in Listing 1 in the Appendix A.1. This segregation results in two key advantages. Firstly, it enables our dedicated analyzer-related tools (detailed in Sec. 6.2) to utilize these computed statistics for the entire dataset, rather than a filtered subset. Alternatively, users can generate fingerprints for specific partial samples. Secondly, this decoupling facilitates the effective re-use of the Dataset.map and Dataset.filter interfaces to undertake these separate processes in a streamlined manner.

4.3 Composability

Data-Juicer’s OPs serve as a testament to our system’s versatility. They enable users to effortlessly process a variety of data types in a composable and modular manner, showcasing Data-Juicer’s dedication to user adaptability and high-quality data production for LLMs. Besides the functions, inputs, outputs and processing levels summarized in Table 1, this composability is embedded in more facets, including the fields to be processed, OP hyper-parameters, and recommended use cases of each OP.

Each OP in Data-Juicer is designed to serve a distinct function and can be commanded by users to process different text fields. For example, OP A could process the sample field “text.abstract”, while OP B could focus on “text.main_body”. By default, each OP process on “text” field, which can be freely specified to other “meta” or “stats” related data fields according to users’ needs. This adaptability allows for immense flexibility by simultaneously using OPs with different fields, enabling users to easily manipulate specific text snippets such as removing GitHub codes based on their star counts. Moreover, these OPs establish a one-size-fits-all solution that encompasses a multitude of configurable parameters such as the number of tokens, filtering thresholds, auxiliary models, and much more. This adjustability of a single OP, amalgamated with the composability of OP pipelines, empowers Data-Juicer to manage a spectrum of input, output, and processing granularity, contributing to its powerful processing abilities.

For usage combinations, OPs are labeled with typical usage scenarios. We maintain OP tags as general usage, LaTeX source files, programming codes, financial data processing, or language-specific processing such as English and Chinese, and so on. These labels facilitate easy navigation and operation, underscoring our aim to blend simplicity with power in Data-Juicer’s architecture.

5 FEEDBACK-DRIVEN DATA PROCESSING

Figure 2: The feedback loop of Data-Juicer.

Addressing Challenge 2 outlined in Sec. 3.1, we incorporate a dynamic feedback loop into the data processing pipeline. This enables users to understand data patterns effectively by leveraging extensive visualization tools and automated tracking abilities. For holistic data and model processing cycles, we offer a one-stop inter- active data visualization feature, permitting timely perception and swift iteration of data interventions, as demonstrated in Figure 2.

We will discuss the modeling of the data processing feedback in a hyper-parameter optimization (HPO) perspective (Sec. 5.1), and go through the mechanisms of the interactive visualization (Sec. 5.2), and the integration of LLM-Ecosystems (Sec. 5.3). The synergy of these techniques offers an efficient and effective solution to debug and dive into LLM data processing.

5.1 HPO for Data Processing

In Data-Juicer, we incorporate the concept of hyper-parameter optimization (HPO) into the data processing procedure. This is done by tying data-processing-specific hyper-parameters to a variety of feedback signals, including custom target metrics and visualization results. We enhance our system’s functionality by innovatively speeding up the data processing iteration through Checkpoint and Caching mechanisms, and by integrating an automated HPO tool.

5.1.1 Acceleration with Checkpoint and Caching. LLM data processing often necessitates frequent re-conduction due to the alterations in OP hyper-parameters and potential conduction failures, especially for massive datasets. Accordingly, we provide built-in checkpoint and caching management to foster resilient and reliable data processing. Based on a carefully organized directory structure, Data-Juicer automatically monitors every running process for configuration changes, and creates new files to safely store data and processing states only when any error or exception occurs. While the checkpoint preserves the whole dataset and processing state enabling complete recovery of the processing site, the cache solely saves the dataset object for each OP and is more suited for smaller- scale adjustments as it reduces the overhead of pre-order caches. These techniques allow for a swift recovery during system restarts or failures, reverting to the most optimal recent processing state stored in the checkpoints, thus mitigating processing redundancy and increasing the feedback frequencies.

Additionally, the proposed state-saving mechanism enables a flexible space-time trade-off at different feedback stages. Users have the option to save states after each OP in the data processing flow, ensuring minimal re-execution time at the cost of maximum storage overhead. Conversely, they could choose to only save the last OP’s checkpoint and cache, incurring minimal storage overhead but increased re-execution time, especially when needing to revert to early steps in the process.

To facilitate a good space-time trade-off, we further perform space complexity analysis for individual OPs, which aids in predicting peak space occupancy and guides us in determining how many checkpoints and caches to store based on available space. By default, Data-Juicer actively monitors disk space usage at the start of data processing, and automatically determines if, and when, checkpoints and cache should be deployed. User-specified saving frequencies and rules are also supported. Consequently, strategic checkpoint and cache managements reinforce both the resilience and efficiency of the feedback loop for LLM data processing. The detailed space usage analysis can be found in Appendix A.2.

5.1.2 Auto-HPO. We incorporate an automated HPO tool1 into Data-Juicer to streamline the finding of good data processing hyper-parameters. To boost overall system efficiency and reduce search costs, we support advanced HPO algorithms such as Bayesian optimization [74], employing progressive early-stop strategies, such as the Hyperband algorithm [49], and built-in LLM-oriented sampling strategies (detailed later in Sec. 6.2). Specifically, given a target metric, users are allowed to investigate correlations and importance scores of specific hyper-parameters within a data processing con- figuration. Here, we give an illustrative example as follows: Example of Data Mixing with HPO:

Suppose we aim to find a good set of sampling weights for 𝑀 datasets to be mixed, where our search space is defined as 𝑤𝑖 ∈ [0, 1], 𝑖 ∈ [1, 𝑀]. The pipeline can be structured as follows:

We specify the target text fields across all 𝑀 datasets, and con- duct the field unification process if necessary.
We leverage meta-tag filters to cater to specific usage scenarios. Here we only include samples with the language tag “EN”.
A datasets D𝑚𝑖𝑥 is generated from the 𝑀 datasets, with mixture weights [𝑤𝑖 ] drawn by the HPO scheduler to maximize the target metric in step (5).
A pre-configured data processing including de-duplication OPs is executed on the mixed dataset, ensuring dataset cleanness.
The target metric is calculated on D𝑚𝑖𝑥 as (𝑛/𝑁 + 𝑠), where 𝑁 is the total number of tokens of all 𝑀 datasets, 𝑛 and 𝑠 is the number of tokens and average quality score of D𝑚𝑖𝑥 using built- in GPT-3 quality classifier (detailed in Sec. 6.2) respectively.

The mixture dataset D𝑚𝑖𝑥 is iteratively refined by carrying out iterations steps (3)∼(5) to get a larger quantity and better quality.

Figure 3: The demonstration of HPO for data processing.

The HPO results provide a powerful means of visualizing and understanding data insights as shown in Figure 3. The target metric could be other trade-off terms combined by intrinsic data measures – for instance, toxic or other quality scores predicted by auxiliary models – or even performance measures of LLMs like training loss or benchmark scores (we will discuss later in Sec. 5.3).

5.2 Interactive Visualization

Interactive visualization is integral to multiple feedback stages of Data-Juicer. As Figure 4.(a) demonstrates, users can visually track the effects of individual OPs w.r.t. the processed data samples and the impact of OP hyper-parameters. This is facilitated by an innovative built-in tool, Tracer, which records sample changes after applying each operation. For example, Tracer presents discarded samples for Filters, pre- and post-editing differences for Mappers, and (near) duplicate sample pairs for Deduplicators. Coupling this tracking ability with fruitful built-in sampling methods and visualization tools enhances users’ control over the data processing and boosts their rationals and confidence in the process.

Figure 4: The interactive visualization of Data-Juicer.

Transitioning to the mid-term stage, Data-Juicer offers a com- parative visualization of the data before and after the entire process- ing, as depicted in Figures 4.(b) and 4.(c). Aided by a built-in tool, Analyzer, Data-Juicer provides statistical analysis (counts, means, standard deviations, min/max, and quantile points, etc.) to allow a deep understanding of the data. By default, the summary of per- sample statistics covers 13 dimensions and automatically displays histograms and box plots for each statistical variable, including diverse criteria like sample perplexity, word count, flagged word percentage, and paragraph length, among others. Users also have the flexibility to adjust observation dimensions, with a bespoke visualization and data processing experience.

5.3 Feedback with LLM Ecosystem Integration

During the late stages of our pipeline, we leverage rich ecosystems of LLMs to support seamless functioning of mainstream training libraries such as Megatron-LM [77], Deepspeed [70], and HuggingFace- transformers [95]. With the integration, we can easily evaluate the performance of downstream models.

Notably, our system facilitates timely assessment of model abilities incorporating multiple dimensions. The system’s capability to swiftly identify potentially ineffective data and training enables us to terminate unwanted data and LLM processing in a timely manner. Since basing model performance purely on training loss provides a limited view, we support assessing the model’s abilities across various metrics or benchmarks and tracking shifts in target scores. As a result, we can discern whether continued training of a specific model and dataset is justifiable, assisting us in minimizing data processing and LLM training costs.

Specifically, Data-Juicer’s evaluator supports state-of-the-art LLM benchmarks such as HELM [52] and GPT-API-based evaluation [20], as well as the extension of customized evaluation benchmarks and tasks. We also make available Reference Models – these are model checkpoints binding with traceable training data in Data-Juicer, popular LLM architectures, training parameters, computation costs, and corresponding evaluation results. They facilitate effortless comparison among different training config- urations, particularly for further research on diverse, iteratively developed data recipes.

Additionally, we support the dynamic expansion of evaluation metrics during the training process, allowing subsequent scaling predictions. By analyzing the trend of evaluation scores against the volume of training data, our system can predict the models’ capabilities post-training with larger data volumes, as explored in GPT-4 [61]. For a balanced and straightforward evaluation, Data-Juicer supports a leaderboard-style comparison by consolidating results from different target evaluation scenarios, such as ranking averaging, score normalised averaging or other customised strategies. The leaderboard scoring utility enhances the visualization of strengths and weaknesses of models, guiding subsequent iterations of data or LLMs refinement.

5.4 Feedback Loop Showcase

The general feedback loop has been discussed before in Figure 2. We now further expound on this by presenting a concrete development example. Here, we intertwine several previously mentioned tools to demonstrate the Data-in-the-LLMdev-Loop process, which results in improved LLM data.

As illustrated in Figure 5, we begin with a raw dataset and aim to refine it for better training or post-tuning of an LLM. The entire process flows as per the following steps:

Analyze the original dataset. We can opt to utilize an existing data recipe (a specific configuration file) or craft a new one based on prior understanding of data processing needs. Our built-in Analyzer and Visualizer facilitate this process by computing more than a dozen measures such as linguistic diversity, textual statistics, and others to generate a data probe. The pie plot within Figure 5 indicates the top 20 most common root verbs (inner circle) and their top 4 direct noun objects (outer circle) in the generated “text.instructions”.

Refine parameters of the original recipe. Based on the data probe, we can figure out the weaknesses of the original dataset, such as low diversity in expression manners, and long-tail statistics of word counts. Then we refine the parameters in the recipe by adding/removing some OPs or tightening/relaxing filter ranges. During refining, we could find out the effect of each OP easily based on the interactive visualization tool mentioned in Sec. 5.2.
Process the original dataset with the refined recipe. Then we process the original dataset with the refined recipe using Data-Juicer and get a refined dataset and several saved check- points for further adjustments.
Analyze the refined dataset. Like the step (1), we analyze the refined dataset again to obtain a new data probe. Based on the statistics and visualization results, we assess the degree of improvement in the data quality. If the refined data fails to meet our expectations, we revert to step 2 to manually adjust the data recipe or employ our HPO tool for automatic refinement (refer Sec. 5.1).
Get LLMs with the refined dataset. Then we can train or post-tune LLMs with the refined dataset and mainstream training frameworks seamlessly integrated into Data-Juicer. During the training or fine-tuning process, our automatic evaluation tools offer timely, multi-view assessments of LLMs. These tools inspect numerous metrics across multiple evaluation datasets. This feature provides us the advantage of halting the process prematurely if

Figure 5: The demonstration of data processing feedback of Data-Juicer.

the refined data weakens LLM performance, thereby preventing unnecessary costs.

Collate results and compare with reference models. Finally, Data-Juicer automatically collates the evaluation results and compares them with reference models in the data leaderboard, providing a clear representation of the effects of data processing alone. Consequently, we derive either a superior LLM, which can be auto-registered as a reference model, or additional refining guidance from the LLM perspective to further enhance data recipes.

Among these steps, the steps (1) and (2) collectively create the innermost and smallest loop depicted in Figure 2. The steps (1)∼(4) together form the middle loop, while all steps drive the outermost and thorough feedback loop. Users can also leverage other built-in tools or flexibly expand the loop to cater to their specific needs.

6 BOOSTING USABILITY WITH BUILT-INS

In response to the challenge of supporting diverse user customization preferences and disparity in technical expertise (Challenge 3 in Sec. 3.1), we provide a unified configuration paradigm, fruitful off-the-shelf config recipes for data processing, and extensive tools as introduced below.

6.1 Data Recipe: Configuring the Whole Process

Built upon Jsonargparse [40], we provide unified, flexible, easy to use and powerful configuration capabilities in Data-Juicer. It is engineered to automatically register configuration items for OPs and tools, and accept varying sources of configurations such as command line entries, yaml and jsonnet files, environment variables, default hard-coded values, and a mixture of those for convenient incremental modifications. Notably, we make the end-to-end pipeline of data processing configurable, including specified processing environment parameters, OP lists, tools used, and so on. This principle of all-in-one configuration ensures reproducibility and traceability and simplifies changing specifications in data processing, thereby facilitating the formation of configuration recipes for further reuse.

For example, users can easily build up their own config files by two recommended methodologies—“subtraction” or “addition”. The “subtraction” approach utilizes a pre-set configuration file containing all available OPs, tools, and their default parameters. Users can simply remove or re-order these OPs and adjust these parameters per their requirements. Conversely, the “addition” approach lets users build their configuration files from scratch, leveraging our extensive examples of pre-built data processing recipes for totally more than 20 high-quality and diverse data recipes for pre-training, post-tuning, English, Chinese, etc. More statistics and quantitative analysis on certain recipes can be found in our experiments (Sec. 8.1).

6.2 Dedicated Pluggable Tools

To further enhance usability, facilitate system customization and augment users’ data handling capabilities, Data-Juicer includes an extensible collection of powerful dedicated tools that can be conveniently plugged into different stages of the LLM data processing.

Quality Classifier. As an illustrative example, we describe our text quality classifier for culling high-quality text from heterogeneous data sources like CommonCrawl. This tool is a reproduced model based on the closed-source GPT-3 quality scorer [10]. More- over, we have expanded its applicability to Chinese text and various code types. Encapsulated as a callable pipeline, this tool provides users with the freedom to adapt it to other various scenarios.

The functionality of the classifier is backed by PySpark’s standard Tokenizer or Sentencepiece model [43], along with HashingTF as the feature extractor. It then applies a binary logistic regression classifier to gauge the quality of a text. We empirically confirm that it can achieve high recall rates in appropriate domains, and possesses a filtering ability comparable to the GPT-3 scorer. More experiment details can be found in our Appendix B.1.

Enhanced Sampler for LLM data. In Data-Juicer, we have designed several advanced data sampling utilities specialized for large-scale data chunk handling in LLMs. Our solutions effectively streamline representative extraction, optimize processing time and resources, and meet the distinctive needs of LLM developers.

Our stratified sampling technique is noteworthy in this LLM data context. It capitalizes on information within metadata or statistical fields, thus accommodating varied selection metrics in crafting an effective data sample. To ensure a comprehensive yet flexible representation of the data corpus, we consider heterogeneous criteria such as document length, token count, the frequency of boolean predicates post-conditional checks, and even linguistic diversity formulated via verb-noun pair occurrences. These dynamic criteria are tailored to distinct analytic needs and promote efficient data processing, seamlessly integrating with downstream OPs and tools.

Full Toolkit. As for other tools, readers can refer to Sec. 5 for an examination of multiple previously discussed tools, including analyzers (Sec. 5.2), evaluators and reference models (Sec. 5.3). We diligently maintain and evolve the toolkit in Data-Juicer, and make the full set publicly accessible.

6.3 User-Friendly Experiences in Data-Juicer

Data-Juicer is designed not just for functionality but also for adaptability, catering to an extensive user base with diverse expertise and skill sets. While abstracting the intricate system internals, we provide user-friendly interfaces and extensive customizable components. Accordingly, users can embark on zero-code data processing, engage in low-code customization, or delve into in-depth extensions for complex requirements.

Zero-Code Processing: For novice users, Data-Juicer sup- plies a series of ready-to-use data recipes and plug-in tools for immediate use. This requires no knowledge of advanced system architectures or OPs, as discussed in Sec. 6.1 and Sec. 6.2.
Low-Code Customization: Intermediate users can enjoy the flexibility to alter configurations, data, and external resources to suit their specific needs. They can readily reuse, combine, and edit built-in data configurations; customize quality classifiers and tokenizers; refine data based on our pre-developed recipes; or provide fresh links to auxiliary models or vocabularies from our unified, routinely updated public cloud drive.
Advanced Extension: Experienced users are allowed to easily introduce new OPs by deriving from base classes and implementing their specific “process()” and “compute_stats()” func- tions, as demonstrated in the code listing 1. This grants the users an end-to-end view of the process for a single sample, while Data-Juicer handles the nitty-gritty of configuration registra- tion and efficiency optimization.

Additionally, Data-Juicer’s decoupled design facilitates the smooth incorporation of new tools for users at all stages of LLM data processing, ranging from novel visualization dimensions and evaluation datasets to pre- or post-processing scripts.

To enhance the system’s ease of adoption and use, apart from the comprehensive suite of pre-built data recipes (refer Sec. 6), we provide a series of interactive demos, implemented in streamlit, for varied profiles and scenarios. This hands-on learning approach has been designed to enable users of varying skill levels to quickly familiarize themselves with and effectively use Data-Juicer.

7 MULTI-FACETED SYSTEM OPTIMIZATION

To handle large-scale data (Challenge 4 in Sec. 3.1), we employ a series of optimizations in Data-Juicer from various aspects.

Optimized Data Unification: Lazy Mode. We center our design principle for data unification (Sec. 4) on effectively harnessing laziness, thereby minimizing redundant data transfers and memory reallocations during format harmonization. Reflecting the capacity- generic nature of LLMs, the processed data often exhibits high het- erogeneity, consisting of various text fields and meta fields. During the unification process, we allow users to either (a) keep the original field names, or (b) explicitly reconfigure the data into nested column names with dot-delimited references, such as “text.instruction” and “meta.language”. For case (a), we simply load the specified data from the disk into in-memory “Dataset” objects, delaying potential field and type conflict resolution until we trigger runtime exceptions. For case (b), we utilize the rename interfaces of Huggingface-datasets and augment the implementation of the data access logic to support automatic and efficient querying of nested fields through function closures. In doing so, our system hides tedious details and internally automates the unification process for users and developers, enabling flexible data access with hierarchical organization.

Optimized Computation: Context management, Operator (OP) Fusion and Reordering. To elevate computational efficiency in LLM data processing, we provide advanced context management, operator fusion, and operator reordering techniques for nuanced implementation contributions. The manager meticulously handles shared intermediate variables, such as segmented words, split lines, and others derived from the original textual corpus, across different operators. It allows seamless reuse of these context variables across multiple operators, thereby mitigating the necessity for computationally expensive re-evaluations.

Based on the context manager, the proposed operator fusion method is another new contribution to the field. We propose to identify fusible operators that either share the same contexts or computation sub-procedures. It detects the OP groups first. Successive OPs in the same group should be commutative with each other. It then amalgamates identified fusible operators in each group into a single fused OP, enabling them to be executed faster with a larger localized perspective. The contexts of each sample will be cleaned up after each fused OP, hence little extra memory is required for context management and operator fusion.

Due to the time-consuming increase of single fused OP, we further introduce an operator reordering strategy designed to optimize the execution sequence of the OP list after fusion. For example, based on the commutativity of Filters, we delay the running of time-consuming OPs (such as fused Filters) and prioritize other less time-consuming OPs. As a result, these time-consuming OPs only need to handle fewer samples because the preceding operators have filtered out some of them, enhancing overall computational efficiency.

The whole procedure of OP fusion is summarized in Figure 6. These amalgamation strategies serve dual purposes. Firstly, it minimizes redundant computation, eliminating the need for repetitive yet shared computations. Secondly, it mitigates the overhead of initializing multiple processes by reducing the total count of processing operations, thus maintaining expeditious data processing routines.

Figure 6: The OP fusion procedure for an OP list.

Optimized Space Utilization: Caching OPs and Compression. Recognizing the inadequacies of the original cache management protocol in the Huggingface-datasets library, especially pertaining to the handling of non-serializable third-party models and functions in certain OPs, we designed a dedicated and simple hashing method for Data-Juicer by bypassing the serialization procedures of those non-serializable objects. This novel approach ensures suc- cessful hashing of each operator, resolves this caching issue for Data-Juicer, and permits Data-Juicer to leverage optimal cache management.

Furthermore, we incorporated the ability for users to activate advanced compression technologies, such as Zstandard (zstd) [22] and LZ4 [56], in Data-Juicer. It will automatically compress cache files after each OP and decompress these compressed files back to nor- mal cache files when rerunning this OP in the same configuration. Compared with the processing time, compressing/decompressing time is relatively negligible due to the high efficiency of the com- pression technologies mentioned above. This feature substantially reduces the volume of cache data storage, facilitating the processing of larger datasets without compromising speed or stability.

Optimized Scalability: Distributed Data Processing. The volume of LLM training data can be extremely large, making it difficult to process with a single machine. To address the challenge, Data-Juicer synergistically meshes with distributed processing frameworks such as Ray [58], as well as big data systems including Apache Beam [6] and Apache Flink [13]. This integration enables distributed data processing and facilitates efficient management of large volumes of data. As a distinct contribution in this realm, we offer the ability to seamlessly translate a data processing pipeline running on a single node into a multi-node cluster, thereby harnessing the value of cluster computing resources and accelerating the processing and delivery iteration.

In a nutshell, all of these optimizations described in this section contribute to enhancing Data-Juicer’s system ability from various views, to handle the vast amount of data involved in LLMs, ensuring robust and efficient processing while minimizing resource requirements.

8 QUANTITATIVE EVALUATION

8.1 Can Data-Juicer Make Better Data Recipes?

The inherent significance of a valuable LLM data processing system lies not only in its comprehensive and flexible operability but also in its ability to generate high-quality data that is more readily “digestible” by LLMs. Data-Juicer offers dedicated and meticulous processing capabilities for various types of text data. This deviates from the traditional simplistic filtering approach, with the aim of generating rich datasets that contain a significant breadth of learnable information. Specifically, we examine the quality of Data-Juicer-generated recipes for both LLM pre-training and post-tuning data.

Figure 7: Evaluation results of reference models on different datasets with the same pre-training procedures.

8.1.1 Refined Pre-training Data Recipes. The pre-training data we produced consists solely of publicly available sources, exemplifying the core principles of transparency and reproducibility. Primarily extracted from RedPajama and the Pile, this dataset has undergone data merging and quality enhancement. For detailed statistics and processing information, please refer to Appendix B.2. To verify the quality of the dataset, we use refined recipes to pre-train LLMs utilizing the mainstream LLaMA architecture and assess the models’ performance across 16 core HELM tasks. For further training details and per-task evaluation, please refer to Appendix B.3.1 and B.4. The evaluation results are visualized in Figure 7, where we evaluated checkpoints throughout the pre-training process with an increasing number of tokens at 50B, 100B, and 150B. Notably, through fair comparisons with equivalent training tokens, LLMs pre-trained on Data-Juicer-recipes consistently out- performed those using only RedPajama or a simple union with the Pile, reinforcing the effectiveness of Data-Juicer.

Moreover, we compared the performant Data-Juicer models with several state-of-the-art baselines and summarized the results in Table 2. With only half the data volume (150B tokens), LLaMA-1.3B pre-trained on Data-Juicer recipe outperformed Pythia-1.4B [7] (300B tokens), and even beats highly competitive Falcon-1.3B [63] trained on 350B tokens. Notably, we further labeled 17 subsets from Alpaca-CoT (a collection of 39 public post-tuning datasets) with the “Instruction-Following Tuning (IFT)” tag and performed data merging and cleaning using Data-Juicer. Following the usual practice [99], we incorporated these large-volume IFT data into the pre-training phase and executed continuous training upon the checkpoint of Data-Juicer (RedPajama+Pile)-150B. As reflected in the last two rows of Table 2, Data-Juicer further gains a 4.9% relative improvement over the original Alpaca-CoT-IFT while utilizing only ∼30% of the data.

Table 2: The average score of the pre-trained LLMs on the 16 HELM core tasks. Individual task results and data recipes are detailed in Appendix B.4. “IFT” denotes the datasets tagged with “Instruction-Following Tuning” in our context.

Taken together, these findings underscore the potential of the Data-Juicer system to generate high-quality data and verify the excellence of Data-Juicer-recipes in terms of enhancing LLM performance while reducing LLM training costs.

8.1.2 Refined Post-tuning Data Recipes. For the Alpaca-CoT collection, besides the “IFT” tag as validated in Table 2, we also labeled datasets within it with “Supervised Fine-Tuning (SFT)” for human alignment usage. To examine their quality, we used the SFT and EN tags to filter out 4 competitive subsets (Alpaca, GPTeacher, FastChat, and gpt4all), and generate two new datasets whose amount of data is close to the original Alpaca. Then we conduct post-tuning on the generated datasets based on the open- source mainstream English LLaMA-7B. More statistics and training details are in Appendix B.3.2.

For a thorough and comparative performance evaluation, we used GPT-4 API for pairwise scoring and tallying of wins and ties. The results are consolidated in Table 3, from which we can see that LLMs utilizing Data-Juicer-recipes consistently demonstrate high validity. Firstly, we observe 16.25% win rate improvement in models trained on Data-Juicer data compared to LLM trained on the competitive post-tuning open datasets, Alpaca. Secondly, compared to the LLMs trained on the datasets randomly sampled from the same candidate subsets (SFT, EN, Random), our model still gains 7.5% higher win rate, which attests to the effectiveness of our enhanced sampling strategy and quality of Data-Juicer-recipes for LLMs again.

8.2 How does Data-Juicer perform and scale?

8.2.1 End-to-End System Performance. To examine the end- to-end processing performance of Data-Juicer, we consider Red- Pajama as the state-of-the-art baseline and choose 2 data recipes in various data sizes and processing OPs. For a fair comparison, we use the official code repository of RedPajama, and run our Data-Juicer on their aligned recipes to process the whole Books and arXiv datasets.

Table 3: Results of pair-wise model comparisons using GPT4 scoring. “SFT” and “EN” indicate tags as Supervised FineTuning and English text respectively

Figure 8: Comparison of stand-alone processing performance, the marker size is proportional to average CPU usage (↓).

We conduct multiple rounds of experiments on different numbers of processes (np=[32, 64, 128]) and monitor several core metrics, including processing time, average memory usage, and average CPU utilization. The monitored time is the wall-clock time of the whole processing pipeline. Average memory and CPU utilization are tracked and calculated by our resource monitoring tool. For more experimental details, please refer to Appendix B.3.3.

The experimental results are summarized in Figure 8. Notably, for all datasets and various numbers of processes, Data-Juicer re- quires an average of 55.6% less processing time, 63.0% less memory, and 52.2% less CPU utilization. In particular, it saves at most 88.7% processing time and 73.1% CPU resources for the arXiv dataset compared with the baseline. Also, it takes up to only 22.9% memory of baseline for Data-Juicer to process the Books dataset, which is mainly because the processing procedure of the baseline loads the whole dataset at once. Besides, it’s worth noticing that in Data-Juicer, the IO of reading and writing cache files is the actual bottleneck. Thus Data-Juicer can process datasets faster with less CPU and memory resources. In a nutshell, Data-Juicer achieves better end-to-end processing performance than the existing method in multiple aspects.

8.2.2 Effect of Context Management, OP Fusion, and Re- ordering. As introduced in Sec. 7, our system utilizes dedicated optimization to minimize redundant computations, thereby saving processing time. To examine the effect of these optimization strategies, we prepared three test datasets of varied sizes and sample counts. Each dataset goes through the same processing recipe which includes 14 OPs (5 Mappers, 8 Filters, and 1 Deduplicator), with 5 of these OPs being fuse-able. We conduct comparison experiments with 4 processes, except for the largest dataset, where we utilize 50 processes to assess if these techniques remain effective on larger scales.

The experimental results are shown in Figure 9. Here, both the normalized and actual time consumption for each experimental setup are indicated. In this figure, ‘#p’ denotes the number of processes used in processing the dataset. The results signify that our optimization strategy effectively saves up to 24.91% of the total time for the entire process and saves at most 42.04% of time for those fusible OPs. In addition, the findings showcase that the optimization performs efficiently regardless of variations in dataset sizes or the number of processes utilized.

Figure 9: Time comparison before and after OP fusion.

8.2.3 System Scalability. To verify the enhanced scalability of our system (as detailed in Sec. 7), we carry out a series of experiments to measure data processing times across multiple servers. Specifically, we adopt the StackExchange and arXiv datasets from RedPajama. The total size of the StackExchange and arXiv datasets are 65GB and 140GB in jsonl format, respectively. We compare the performance of Data-Juicer on Ray, Data-Juicer on Beam (using the Flink backend), and original Data-Juicer in these tests. More details about the implementation and experimental platforms are in Appendix B.3.4.

The experiment results are illustrated in Figure 10. Notably, thanks to various optimizations, our original system outperforms both Ray and Beam in the single server scenario. Moreover, as the number of nodes increases, the processing time of our system on Ray decreases proportionally (up to 87.4% and 84.6% time reduction on StackExchange and arXiv respectively), demonstrating its effective scalability across multiple servers.

Nonetheless, the processing time of Data-Juicer on Beam re- mains almost unchanged as the number of nodes increases. Upon further investigation of the processing workflow, we found that the limited scalability of Data-Juicer on Beam is primarily con- strained by the data loading component of Beam, which leads to a dominant file loading time ratio and requires substantial development changes for adaptation and further performance optimization.

9 CONCLUSIONS AND FUTURE WORKS

To conclude, the development and deployment of Data-Juicer reflect an inventive step forward in the field of LLM data processing. By offering a user-centric, versatile, and efficient solution, Data-Juicer effectively addresses the existing limitations of open-source tools, which lean towards data reproducibility at the expense of adaptability and usability. The innovative decoupling of traditionally linked components fosters greater abstraction and modularity, and the organic arrangement of over 50 built-in operators, dedicated tools, and abundant data recipes serves diverse LLM pre-training and post-tuning needs. Beyond facilitating automatic, multi-dimensional evaluation, Data-Juicer is carefully optimized and seamlessly integrated with both LLM and distributed computing ecosystems. Empirical validation bears witness to substantial improvements in LLM performance using Data-Juicer’s data recipes, and demonstrated advances in space-time efficiency and scalability. As such, Data-Juicer stands as a compelling addition to the toolkit for LLM data processing, which we hope can shed light on more data-centric, user-friendly, and broader research and developments in the field.

Future Works. In this pioneering exploration of incorporat- ing Hyper-Parameter Optimization into LLM data processing, we demonstrate the potential of connecting data quality and LLM per- formance with data processing hyper-parameters. This is an area with extensive room for further investigation which we plan to delve into more comprehensively.

Our research faced limitations in terms of available resources and time. In regards to pre-training data, the validation of our model’s quality was carried out with a significant scale of 1.3B parameters. Going forward, we’re set on validating larger scales, such as 3B and 7B, to tap into the possible greater value encased within the data.

Additionally, the implementation of more improved operator optimization strategies, advancements in distributed computing efficiency, and incorporation of support for cloud platforms like Ali Cloud, embody the next important steps in our journey.

REFERENCES

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cap- pelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. Falcon-40B: an open large language model with state-of-the-art performance. (2023).
Apache Arrow. 2023. https://arrow.apache.org/
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a Laboratory for Alignment. CoRR abs/2112.00861 (2021).
Stephen H. Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M. Saiful Bari, Thibault Févry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Alan Fries, Maged Saeed AlShaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Dragomir R. Radev, Mike Tian-Jian Jiang, and Alexander M. Rush. 2022. Prompt- Source: An Integrated Development Environment and Repository for Natural Language Prompts. In ACL (demo). 93–104.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, Benjamin Mann, and Jared Kaplan. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022).
Apache Beam. 2023. https://beam.apache.org/
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. In ICML, Vol. 202. 2397–2430.
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. CoRR abs/2204.06745 (2022).
Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. 2000. Min-Wise Independent Permutations. J. Comput. System Sci. 60, 3 (2000), 630–659.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- ford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS.
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lund- berg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. CoRR abs/2303.12712 (2023).
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2023. Large Language Models as Tool Makers. CoRR abs/2305.17126 (2023).
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a single engine. IEEE Data Eng. Bull. 38, 4 (2015).
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramèr, and Chiyuan Zhang. 2023. Quantifying Memorization Across Neural Language Models. In ICLR.
Moses S. Charikar. 2002. Similarity Estimation Techniques from Rounding Algorithms. In STOC. 380–388.
ChatGLM2-6B . 2023. https://github.com/THUDM/ChatGLM2-6B
ChatLLaMA. 2023. https://github.com/nebuly-ai/nebuly/tree/main/ optimization/chatllama
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. 2023. AlpaGasus: Training A Better Alpaca with Fewer Data. CoRR abs/2307.08701 (2023).
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder,
Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021).
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://vicuna.lmsys.org
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fe- dus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Web- son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yan- ping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling Instruction-Finetuned Language Models. CoRR abs/2210.11416 (2022).
Yann Collet and Murray Kucherawy. 2021. Zstandard Compression and the ’application/zstd’ Media Type. RFC 8878.
Together Computer. 2023. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset. https://github.com/togethercomputer/RedPajama-Data
Common Crawl. 2023. https://commoncrawl.org/
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1). 4171–4186.
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P. Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen S. Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V. Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. 2022. GLaM: Efficient Scaling of Language Models with Mixture- of-Experts. In ICML. 5547–5569.
[27] EleutherAI. 2023. Pythia-1.4B. https://huggingface.co/EleutherAI/pythia-1.4b
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. 2023. Towards Revealing the Mystery behind Chain of Thought: a Theoretical Perspective. CoRR abs/2305.15408 (2023).
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs/2101.00027 (2021).
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In EMNLP (Findings). 3356–3369.
Xinyang Geng and Hao Liu. 2023. OpenLLaMA: An Open Reproduction of LLaMA. https://github.com/openlm-research/open_llama
[32] Project Gutenberg. 2023. https://www.gutenberg.org/
Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, Wentao Han, Minlie Huang, Qin Jin, Yanyan Lan, Yang Liu, Zhiyuan Liu, Zhiwu Lu, Xipeng Qiu, Ruihua Song, Jie Tang, Ji-Rong Wen, Jinhui Yuan, Wayne Xin Zhao, and Jun Zhu. 2021. Pre- trained models: Past, present and future. AI Open 2 (2021), 225–250.
Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2023. ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings. CoRR abs/2305.11554 (2023).
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. In ICLR.
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2022. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. CoRR abs/2212.09689 (2022).
Technology Innovation Institute. 2023. Falcon-RW-1B. https://huggingface.co/ tiiuae/falcon-rw-1b
Gautier Izacard, Patrick S. H. Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Few-shot Learning with Retrieval Augmented Language Models. CoRR abs/2208.03299 (2022).
Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. 2020. Overview and importance of data quality for machine learning tasks. In KDD. 3561–3562.
jsonargparse. 2023. https://github.com/omni-us/jsonargparse
Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In ICML. 10697–10707.
Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. 2023. OpenAssistant Conversations – Democratizing Large Language Model Alignment. CoRR abs/2304.07327 (2023).
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In EMNLP (Demonstration).
Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Vil- lanova del Moral, Teven Le Scao, Leandro von Werra, Chenghao Mou, Ed- uardo González Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Sasko, Quentin Lhoest, Angelina McMillan-Major, Gérard Dupont, Stella Biderman, Anna Rogers, Loubna Ben Allal, Francesco De Toni, Giada Pistilli, Olivier Nguyen, So- maieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa, Paulo Villegas, Tristan Thrush, Shayne Longpre, Sebastian Nagel, Leon Weber, Manuel Muñoz, Jian Zhu, Daniel van Strien, Zaid Alyafeai, Khalid Almubarak, Minh Chien Vu, Itziar Gonzalez-Dios, Aitor Soroa, Kyle Lo, Manan Dey, Pe- dro Ortiz Suarez, Aaron Gokaslan, Shamik Bose, David Ifeoluwa Adelani, Long Phan, Hieu Tran, Ian Yu, Suhas Pai, Jenny Chim, Violette Lepercq, Suzana Ilic, Margaret Mitchell, Alexandra Sasha Luccioni, and Yacine Jernite. 2022. The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. In NeurIPS.
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In ACL (1). 8424–8445.
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In EMNLP (1). 3045–3059.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL. 7871–7880.
Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Sasko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Pa- try, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément De- langue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander M. Rush, and Thomas Wolf. 2021. Datasets: A Community Library for Natural Language Processing. In EMNLP (Demos). 175–184.
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18 (2017), 185:1–185:52.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Ko- cetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier De- haene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy V, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Car- los Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023. StarCoder: may the source be with you! CoRR abs/2305.06161 (2023).
Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, and You Zhang. 2023. ChatDoc- tor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge. CoRR abs/2303.14070 (2023).
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaud- hary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2022. Holistic Evaluation of Language Models. CoRR abs/2211.09110 (2022).
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. CoRR abs/2303.16634 (2023).
Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. 2023. A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. CoRR abs/2305.13169 (2023).
Ilya Loshchilov and Frank Hutter. 2017. Fixing Weight Decay Regularization in Adam. CoRR abs/1711.05101 (2017).
LZ4. 2023. https://www.lz4.org/
Kamil Malinka, Martin Peresíni, Anton Firc, Ondrej Hujnak, and Filip Janus. 2023. On the Educational Impact of ChatGPT: Is Artificial Intelligence Ready to Obtain a University Degree? CoRR abs/2303.11146 (2023).
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jor- dan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In OSDI. 561–577.
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In ICLR.
OpenAI. 2022. Our approach to alignment research. OpenAI Blog (August 2022).
[61] OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023).
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In NeurIPS.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb Dataset for Falcon LLM: Outperform- ing Curated Corpora with Web Data, and Web Data Only. CoRR abs/2306.01116 (2023).
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In NAACL-HLT. 2227–2237.
Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2023. Reasoning with Language Model Prompting: A Survey. arXiv:2212.09597 [cs.CL]
Zheng Lin Qingyi Si. 2023. Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface. https://github.com/PhoebusSi/alpaca-CoT
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. (2020), 140:1–140:67.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020.
Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD. 3505–3506.
Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grigory Arshinov, An- drey Bout, Irina Piontkovskaya, Jiansheng Wei, Xin Jiang, Teng Su, Qun Liu, and Jun Yao. 2023. PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. CoRR abs/2303.10845 (2023).
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muen- nighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jer- nite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, and et al. 2022. BLOOM: A 176B- Parameter Open-Access Multilingual Language Model. CoRR abs/2211.05100 (2022).
Omid Shahmirzadi, Adam Lugowski, and Kenneth Younge. 2019. Text similarity in vector space models: a comparative study. In ICMLA. 659–666.
Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Fre- itas. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 104, 1 (2015), 148–175.
Noam Shazeer. 2020. GLU Variants Improve Transformer. abs/2002.05202 (2020).
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580 (2023).
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs/1909.08053 (2019).
[78] Streamlit. 2023. https://streamlit.io/
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. 2021. RoFormer: Enhanced Transformer with Rotary Position Embedding. CoRR abs/2104.09864 (2021).
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. 2023. PandaGPT: One Model To Instruction-Follow Them All. CoRR abs/2305.16355 (2023).
Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai Yu, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. CoRR abs/2107.02137 (2021).
Zhongxiang Sun. 2023. A Short Survey of Viewing Large Language Models in Legal Aspect. CoRR abs/2303.09136 (2023).
Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. 2022. MVP: Multi-task Supervised Pre-training for Natural Language Generation. CoRR abs/2206.12131 (2022).
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_ alpaca.
Ryoko AI Team. 2023. ShareGPT52K. https://huggingface.co/datasets/RyokoAI/ ShareGPT52K
Augustin Toma, Patrick R. Lawler, Jimmy Ba, Rahul G. Krishnan, Barry B. Rubin, and Bo Wang. 2023. Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding. CoRR abs/2305.12031 (2023).
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023).
Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023. HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge. CoRR abs/2304.06975 (2023).
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amir- reza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Savan Doshi, Shailaja Keyur Sampat, Siddhartha Mishra, Sujan Reddy A, Sumanta Patro, Tanay Dixit, and Xudong Shen. 2022. Super- NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In EMNLP. 5085–5109.
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models are Zero-Shot Learners. In ICLR.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fe- dus. 2022. Emergent Abilities of Large Language Models. CoRR abs/2206.07682 (2022).
Jerry W. Wei, Le Hou, Andrew K. Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V. Le. 2023. Symbol tuning improves in-context learning in language models. CoRR abs/2305.08298 (2023).
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, Yong Jiang, and Wenjuan Han. 2023. Zero-Shot Information Extraction via Chatting with ChatGPT. CoRR abs/2302.10205 (2023).
[94] Wikipedia. 2023. https://en.wikipedia.org/wiki/Main_Page
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement De- langue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP (Demos). 38–45.
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David S. Rosenberg, and Gideon Mann. 2023.
BloombergGPT: A Large Language Model for Finance. CoRR abs/2303.17564 (2023).
Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu. 2023. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. CoRR abs/2305.10429 (2023).
Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. FinGPT: Open- Source Financial Large Language Models. CoRR abs/2306.06031 (2023).
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, and Jie Tang. 2022. GLM-130B: An Open Bilingual Pre-trained Model. abs/2210.02414 (2022).
Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normalization. In NeurIPS. 12360–12371.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuo- hui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. CoRR abs/2205.01068 (2022).
Xuanyu Zhang, Qing Yang, and Dongliang Xu. 2023. XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters. CoRR abs/2305.12002 (2023).
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. CoRR abs/2303.18223 (2023).
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. CoRR abs/2304.06364 (2023).
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Ur- tasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In ICCV. 19–27.

APPENDIX OF DATA-JUICER: A ONE-STOP DATA PROCESSING SYSTEM FOR LARGE LANGUAGE MODELS

A ADDITIONAL DETAILS OF DATA-JUICER

A.1 Base Classes of OPs in Data-Juicer

We illustrate the core base classes of operators (OPs) in Data-Juicer at listing 1.

Listing 1: The illustration of OP base classes in Data-Juicer.

A.2 Theoretical Analysis of Space Usage for Caches and Checkpoints

Caches are generated after some of the functions of Dataset, such as map, filter. Generally, caches can be categorized into cache data and indices. The total size of a set of indices is very small so we can ignore these parts when conducting the space usage analysis. On the contrary, the size of the cache data is nearly the same as the input dataset. Here we assume that the sizes of cache data and checkpoints are all the same as the input dataset’s size. And there must be one cache data file for the original dataset after it’s loaded.

Assume that there are 𝑀 Mappers, 𝐹 Filters, and 𝐷 Deduplicators in the processing configuration, and the size of the original dataset is 𝑆, the detailed analysis for cache mode and checkpoint mode is shown below.

Space Usage of Cache Mode. Caches are generated after each OP. Mappers, Filters, and Deduplicators only generate one set of cache data. Besides, the first Filter would generate an extra set of cache data because a new column for storing statistics will be added to the dataset. Therefore the total disk space usage of caches is:

𝑆𝑝𝑎𝑐𝑒[𝑐𝑎𝑐ℎ𝑒_𝑚𝑜𝑑𝑒 ] = (1 + 𝑀 + 𝐹 + I(𝐹 > 0) + 𝐷) × 𝑆

where I(·) is the indicator function, which returns 1 when · is true, otherwise returns 0.

Space Usage of Checkpoint Mode. Checkpoints are only generated when any exception or error occurs. However, caches are still stored after disabling the cache mode due to the features of

Dataset. We clean up older caches after each OP. The detailed cleanup pipeline is: 1). OP𝑖 finished, 2). caches for OP𝑖 generated, 3). caches for OP𝑖−1 cleaned up. Thus there exists at most two sets of caches at the same time theoretically in step 2. Considering the caches of the original dataset, the peak disk space usage of caches in checkpoint mode is:

𝑆𝑝𝑎𝑐𝑒[𝑐ℎ𝑒𝑐𝑘𝑝𝑜𝑖𝑛𝑡 _𝑚𝑜𝑑𝑒 ] = 3 × 𝑆

B ADDITIONAL NUMERICAL RESULTS

Table 4: Evaluation results of three types of quality classifiers.

Table 5: Comparison of keeping ratio on CommonCrawl.

B.1 Quality Classifier

Firstly, we will show how we can reproduce the GPT-3 and achieve comparable performance.

We follow the training procedure of quality classifier in GPT-3 [10] that used a logistic regression classifier with features from standard tokenizer and HashingTF of PySpark. Based on this, we expand this training pipeline to Chinese text and various code types. The training details are listed in Table 6, where the keeping method includes:

label: 𝑑𝑜𝑐_𝑠𝑐𝑜𝑟𝑒 > 0.5
pareto [10]: 𝑑𝑜𝑐_𝑠𝑐𝑜𝑟𝑒 > 1 − 𝑛𝑝.𝑟𝑎𝑛𝑑𝑜𝑚.𝑝𝑎𝑟𝑒𝑡𝑜 (𝛼), 𝛼 = 9

We split these datasets into training and evaluation splits with a split ratio of 4:1. Then these classifiers trained on the training split are evaluated on the evaluation split. Experimental results are shown in Table 4. As we can see, reproduced GPT-3 and its Chinese version perform well except for the Code version. We speculate that the positive and negative splitting method for Code quality classifier now might not be a good choice, and we leave this issue to future research.

Besides, we compare keeping ratios when using these classifiers to re-sample CommonCrawl between the original GPT-3 quality classifier and our reproduced classifiers, which is shown in Table 5. The keeping ratio of the original GPT-3 quality classifier is estimated

Table 6: Training configuration of 3 types of quality classifiers.

by the data size before and after filtering described in GPT-3 paper [10]. We can see that the keeping ratios of our reproduced GPT-3 quality classifiers are basically aligned with the original one.

B.2 Data Recipes

For pre-training data, we acquired a vast amount of raw textual corpora primarily following the procedural guidelines of RedPajama [23] and the Pile [29]. The common subsets were merged and subjected to Data-Juicer refinements. The resultant data recipe is presented in Table 7, which covers 15 prominent components. We use the SentencePiece [43] tokenizer as implemented in GPT-NeoX- 20B [8] to prepare text and report the counted number of tokens. The sampling proportion is the normalization of token numbers, except for Books and Wikipedia, which undergo 2 and 2.5 epochs respectively, to enhance the weighting of high-quality corpora.

Table 7: Statistics of Data-Juicer’s pre-training data.

For post-tuning data, we merge and refine tens of Alpaca-CoT datasets. Each dataset can be categorized into English / Chinese / Multilingual by language, multi-round dialog / instruction finetuning / supervised fine-tuning / preference by usage, multi-task / task-specific by task type, and human-generated / self-instruct / mixed / collection of datasets by the generation method. The detailed numbers of datasets for each category are presented in Table 8.

Table 8: Statistics of Data-Juicer post-tuning data used in our experiments. ∗These tags are newly added by Data-Juicer compared to the original tag sets of Alpaca-CoT [66].

More information about these datasets can be found on the Data-Juicer recipes page2 of our repository.

B.3 Experiments Details

B.3.1 Models and Training For Pre-training Data. We adhere to the official paper [87] and leverage open-source implementation [31] to build standard LLaMA models. Basically, it is to apply RMSNorm [100], the SwiGLU activation [75], and rotary positional embedding [79] on the decoder-only transformer architecture. The LLaMA-1.3B model is composed of 24 transformer layers, each with 16 self-attention heads and 2048 bottleneck units.

LLMs are pre-trained using the AdamW optimizer [55] with hyper-parameters 𝛽₁ = 0.9 and 𝛽₂ = 0.95. For LLaMA-1.3B, the initial learning rate gradually increases to 2e-5 using 1% warm-up steps and finally decays to 10% through a cosine schedule. The weight decay is set to 0.1 and the gradient ℓ₂-norm is clipped to 1.0.

B.3.2 Models and Training of Post-Tuning Data. In post-tuning, we choose LLaMA-7b as our basic model and fine-tuned it for 3 epochs. We follow the hyper-parameter settings in Alpaca [84]. Specifically, the optimizer is AdamW with a learning rate of 2e-5, global batch size of 256, and weight decay of 0. The learning rate schedules in a cosine style with 3% initial warm-up steps.

B.3.3 System Performance Experiments. The experiments of end-to-end processing mentioned in section 8.2.1 are all conducted on the same machine with 128 cores of Intel(R) Xeon(R) Platinum 8369B models and about 990GB memory. Before starting these experiments, the original datasets, third-party models, and other as- sets will be prepared in advance for both baselines and Data-Juicer, and the intermediate cache files will be cleaned after every complete process for Data-Juicer. After processing, we use the same number of processes for processing the dataset to export the result dataset to the local SSD.

As for the resource monitoring tool, it’s implemented based on the psutil3 library. It samples the CPU utilization of all CPUs and memory for all related processes every second during the processing pipeline. Then we compute the average CPU utilization ratio by summing the CPU utilization ratios and dividing by the number of processes used in each experiment. Finally, we aggregate all data and compute the average CPU utilization ratio and memory usage over time.

B.3.4 Scalability. Our experiments are performed on a platform comprising 16 servers, each equipped with a 64-core Intel(R) Xeon(R) Platinum CPU (mix of 8269CY and 8163 models) and 512 GB of memory. The network bandwidth shared among these servers is 20 Gbps. We utilize NAS storage to house both the raw data and the processed results.

We consider the two baselines as follows:

Data-Juicer on Ray: We implement a Ray [58] executor for Data-Juicer, which only adaptes the underlying interfaces of the HuggingFace-datasets with Ray-datasets, while all OPs of Data-Juicer remain unchanged. This implies that users’ code based on our native Python version can be seamlessly migrated from a single-machine version to distributed computing environments.
Data-Juicer on Beam: This method is based on Apache Beam with the Apache Flink Runner. When compared to the Ray version, the Beam version requires additional code development to meet the demands of the Beam data processing pipeline. This includes the adaptations of several OPs and the replacement of the Formatter/Exporter with counterparts in Beam.

B.4 Per-Task Evaluation

For a thorough and consolidated assessment, we have summarized the individual scores of evaluated LLMs on the 16 core HELM assessment tasks in Table 9.

Table 9: Evaluation results on 16 core tasks of HELM benchmark.

Task	Falcon-1.3B	Pythia-1.4B	LLaMA-1.3B (Data-Juicer)	LLaMA-1.3B (Data-Juicer IFT)
MMLU	24.7	26.0	25.9	27.0
BoolQ	63.0	56.0	49.0	56.0
NarrativeQA	32.1	31.5	38.2	49.9
NaturalQuestions (closed-book)	10.7	10.5	10.1	11.2
NaturalQuestions (open-book)	50.0	49.8	45.9	54.3
QuAC	24.3	26.5	26.0	21.7
HellaSwag	67.0	57.0	56.0	52.0
OpenbookQA	44.0	34.0	40.0	43.0
TruthfulQA	19.0	21.0	33.0	33.0
MS MARCO (regular)	16.8	12.9	11.2	12.1
MS MARCO (TREC)	33.5	27.4	26.9	28.1
IMDB	55.0	84.0	80.0	84.0
XSUM	5.7	6.5	5.2	5.3
CNN/DailyMail	4.0	8.4	7.8	11.1
CivilComments	49.4	49.7	50.1	50.0
RAFT	44.3	42.3	42.1	49.3

Data-Juicer: A One-Stop Data Processing System for Large Language Models