Skip to main content
Uncategorized

AN EXPLORATION OF IN-CONTEXT LEARNING FOR SPEECH LANGUAGE MODEL

October 19, 2023

Ming-Hao Hsu1, Kai-Wei Chang1, Shan-Wen Li2, Hung-yi Lee1

National Taiwan University1, Meta AI2

ABSTRACT

Ever since the development of GPT-3 in the natural language processing (NLP) field, in-context learning (ICL) has played an important role in utilizing large language models (LLMs). By presenting the LM utterance-label demonstrations at the input, the LM can accomplish few-shot learning without relying on gradient descent or requiring explicit modification of its parameters. This enables the LM to learn and adapt in a black- box manner. Despite the success of ICL in NLP, little work is exploring the possibility of ICL in speech processing. This study proposes the first exploration of ICL with a speech LM without text supervision. We first show that the current speech LM does not have the ICL capability. With the proposed warmup training, the speech LM can, therefore, perform ICL on unseen tasks. In this work, we verify the feasibility of ICL for speech LM on speech classification tasks.

Index TermsIn-context learning, speech language model, prompt tuning, few-shot learning, speech classification

1.    INTRODUCTION

Large language models (LLMs) [1, 2] have gained significant attention in recent years. With the development of LLMs like GPT-3 [3], researchers have discovered the potential for per- forming in-context learning (ICL) [3, 4]. ICL is a technique that enables LMs to learn new tasks from a small number of demonstrations which are presented at the input of the LM. Formally, we consider a set of data points, denoted as xi, along with their corresponding labels, denoted as yi. Additionally, we have a target data point, xt, for which we want the LLM to make an inference. To achieve this, we prepend the demonstrations, consisting of the data points and their labels, to the input sequence as follows: [x1, y1, x2, y2, . . . , xn, yn, xt]. By learning the analogy between the data points and their labels, the LLM is capable of directly predicting the label of xt. It is important to note that throughout this learning process, the LLM remains fixed, and there is no gradient backward process involved. Instead, the LLM relies solely on the input demonstrations to acquire knowledge.

ICL, being a new paradigm for LLM, offers several advantages. First, ICL simplifies the integration of human knowl- edge into the LM by providing demonstrations. This process

Fig. 1: This figure illustrates the framework of the proposed approach, where warmup training utilizes a variety of train- ing tasks to instil in-context learning abilities in the speech language model. This allows the model to utilize task demon- strations effectively to tackle novel tasks.

resembles the analogy reasoning process of humans [1]. Second, ICL incorporates demonstrations at the input, eliminating the need for backpropagation and gradient flow to establish connections between data points and their labels. As a result, computational costs are reduced. Third, LLMs are often re- leased as a service in the real-world application [3, 5, 6, 7]. ICL is particularly suitable for LM deployment since it only modifies the input, allowing LLMs to adapt to new tasks de- fined by the users [5]. Overall, in the field of natural language processing (NLP), ICL has emerged as a powerful paradigm for utilizing LLMs. However, despite the recent advancements in large speech language models, there is a notable lack of research on ICL in the domain of speech processing.

In recent years, several speech language models (speech LMs) have emerged. Speech LMs quantize speech representations into discrete speech tokens and engage in the next token prediction pre-training task, akin to language models in the NLP field. Demonstrating robust capabilities, these speech LMs can generate novel speech unconditionally or conditioned on specific speech segments. Notable examples include the generative spoken language model (GSLM)[8], pGSLM[9], audioLM [10], TWIST [11], and others.

In this paper, we first examine the ICL ability of the current largest and open-sourced generative speech LM, GSLM [8]. We observe that GSLM fails to comprehend the provided speech-label pairs, indicating a lack of capability to perform ICL. To build a speech LM with ICL ability, we propose con- ducting a simple warmup training with prompt tuning [12] on a set of training tasks to enable the speech LM to understand the demonstrations and make predictions.

The experimental results indicate that GSLM, when subjected to warmup training, demonstrates the capability to per- form ICL not only on seen tasks but also, surprisingly, on unseen tasks. It surpasses the random guessing baseline for all tasks and is comparable to the linear classifier in specific tasks. It’s worth noting that, in this paper, we aim to show the feasibility of ICL for speech LM, not outperform the current state-of-the-art methods. Our contributions are as follows:

  • We investigate the in-context learning capability of the existing speech LM and identify its limitations in this regard.
  • We introduce the first speech LM that incorporates warmup training, enabling it to perform in-context learning effectively. This is the first speech LM with such capabilities.
  • We empirically demonstrate that the speech LM can ef- fectively learn and adapt to unseen tasks through ICL and achieve non-trivial results, surpassing the perfor- mance of a random sampling baseline.

2.    RELATED WORKS

2.1.    In-context Learning

In recent years, there has been a growing interest in large- language models (LLMs). One notable LLM is GPT-3 [3], which has introduced a new paradigm known as in-context learning (ICL) [4]. ICL does not involve gradient descent and parameter updates during learning. It relies on the demonstrations given at the input to guide the model’s prediction. While some LLMs have shown the ability for ICL, a disparity persists between the LM’s pre-training task and ICL. There- fore, some studies attempted to perform warmup training [4] to enhance the ICL capability in a supervised [13] or self- supervised [14, 15] manner. However, it is worth mentioning that these efforts have primarily concentrated on LLMs in the NLP domain. Despite the development of several speech- based LMs in recent years, there has been limited exploration of ICL on speech LMs.

In speech processing, one relevant work is WavPrompt [16]. It integrates an audio encoder and a text LM (specifically, GPT-2 [17]). This model is then pre-trained using paired data with an automatic speech recognition (ASR) task, allowing the model to make predictions based on task demonstrations that incorporate both speech and text question-answer pairs. Another recent work utilizing ICL for speech processing is presented in [18]. This work uses speech-transcription pairs as demonstrations, which are then input into Whipser [19] for test-time adaptation in ASR. However, both [16] and [18] focus on limited speech processing tasks, spoken language understanding (SLU) and ASR, and utilize models pre-trained on speech-text paired data. In contrast, our work emphasizes the use of speech LM for diverse tasks without relying on text supervision.

2.2.    Speech Language Models

With the advancements in self-supervised speech models [20], including CPC [21], wav2vec2.0 [22], HuBERT [23], and w2v- BERT [24], training speech LMs based on these informative representations has become promising. Generative Spoken Language Modeling (GSLM) [8] is a pioneer work. GSLM proposed performing speech quantization on these speech representations to obtain discrete speech tokens. A generative LM is then trained using these speech tokens, and it has shown promising results in generating novel speech conditionally or unconditionally.

Besides GSLM, other speech LMs are also proposed with training larger models and datasets [11], or incorporating speech neural codecs [25, 26, 27] to provide more informative information for language modeling [10]. For this research, we have chosen to adopt GSLM as our backbone model because of its availability as an open-source generative speech LM. We believe that as diverse speech LMs continue to advance, the significance of ICL behavior will become increasingly evident, mirroring the observations made in the field of natural language processing (NLP) [28, 29].

VALL-E [30, 31] also claims to demonstrate ICL capability. However, we would like to emphasize the differences between our work and VALL-E. Firstly, VALL-E is a text-to- speech model trained in a language modeling manner with text supervision, while our work focuses on textless generative speech LMs. Secondly, the ICL capability in VALL-E is to learn the acoustic conditions of a given speech segment rather than performing a new task that hasn’t been encountered during the pre-training stage. VALL-E consistently performs the text-to-speech task, whereas our work focuses on the typical ICL scenario in an LM and is expected to learn the analogy of the demonstrations, with particular emphasis on whether ICL is feasible for unseen tasks.

2.3.    Prompting Speech Language Models

In this paper, we apply prompt tuning to a speech LM to perform warmup training for ICL. This approach is inspired by the recent works [32, 33], which have shown, with a small set of trainable parameters, prompts can effectively modify the behavior of a pre-trained model. SpeechPrompt v2 [34] further demonstrates the ability of prompt tuning for transfer learning in a wide range of speech classification tasks. In this work, rather than directly performing speech classification tasks with prompt tuning, we adopt prompt tuning to empower the speech LM with ICL ability. The speech LM then performs ICL on unseen tasks that no longer involve parameter updates.

Model reprogramming [35, 36] is another method similar to prompting and can be used for task adaptation. It involves applying a task-specific transformation to the input data, al- lowing the pre-trained model to adapt to a new task. For ex- ample, in [37], model reprogramming is applied to an acoustic model to adapt it to low-resource spoken command recognition (SCR). Additionally, [38, 39] use reprogramming for domain adaptation in dialect identification and speech recognition. Although the aforementioned works can apply prompting or reprogramming methods to adapt a model to new tasks, they do not perform ICL and require further training to adapt to new tasks.

3.    METHOD

This paper first investigates the ICL ability of the pre-trained GSLM [8], which is the largest open-sourced generative speech LM. In this paper, we focus on simple speech classification tasks. We provide speech-label as demonstrations, an example is shown in Fig. 2 at the input and let the model predict the label of a target utterance 1.

Our findings indicate that the current GSLM does not possess ICL ability. As shown in Table 1, “w/o Warmup” is the performance of directly applying ICL on GSLM without warmup training. “Random” is the performance of randomly guessing a label for the speech classification tasks. We find that applying ICL directly to GSLM yields results worse than random guessing.

To address this limitation and build a speech LM with ICL ability, we propose conducting warmup training on GSLM. For the warmup training, we utilize a set of training tasks denoted as Ttrain to enhance GSLM’s ICL capability. Specifically, we employ the prompt tuning [12, 32] method. The choice of applying prompt tuning is deliberate, and other tuning methods can also be adopted in warmup training, as dis- cussed in previous works [4]. However, we choose prompt tuning for two reasons: (1) Preliminary experiments revealed that fine-tuning the entire model for warmup training is unsta- ble and might lead to inferior ICL performance. (2) Prompt tuning involves prepending prompt vectors at the input side while keeping the pre-trained GSLM fixed. This preserves the generative capability of the pre-trained GSLM, which is beneficial for future applications.

In the following sections, we outline our approach to con- ducting warmup training with prompt tuning on the set of training tasks Ttrain and evaluate the ICL capability of the model on the training tasks Ttrain (seen tasks) and testing tasks Ttest (unseen tasks).

3.1.    Warmup Training

Given a speech LM M that performs next token prediction on discrete speech tokens x:

xt+1 = M(x1, x2, . . . , xt),                    (1)

where t is the timestep, we first collect a set of training tasks Ttrain to perform warmup training. Each task Ti in Ttrain uses its own dataset. To form one training data point for ICL warmup training, we conduct the following procedure:

(1) We randomly sample n utterances and their correspond- ing labels from a training task as demonstrations. (2) Follow- ing GSLM [8], the utterances will first be encoded into discrete unit sequences x1, x2, . . . , xn. Additionally, we randomly map the labels to discrete units using a verbalizer, generat- ing labels for the demonstrations y1, y2, . . . , yn. The verbal- izer [12] serves as a mapping table, a construct that can also be defined through human effort or heuristic methods [32]. This mapping bridges the gap between the labels of down- stream tasks and the pre-trained LM’s vocabulary. (3) Each unit sequence is then truncated or padded to the same utterance length L, yielding x˜1, x˜2, . . . , x˜n. We found this step critical as it provides a standardized format to the speech LM and simplifies the training. (4) The input data is constructed as

X = [x˜1, s, y1, s, . . . , x˜n, s, yn, s, x˜t, s⟩],       (2)

where “⟨s⟩” is a trainable separation token in GSLM’s vocabulary, and x˜t is the target utterance which is the data we want the model to predict its label. During warmup training, we randomly sample the target utterance from the demonstrations, that is, x˜t ∈ {x˜1, x˜2, . . . , x˜n}. The model then learns to compare the target utterance and the demonstrations in order to

predict the correct label. We find this step simple but effective since it simplifies the training objective. The model is tasked with comparing the target utterance with each demonstration and outputting the corresponding label. The learned behavior benefits ICL in the next stage. In the preliminary study, we find it necessary to introduce this step. Not duplicating a

Fig. 2: The figure illustrates the in-context learning method in our work. The model predicts the label of the target utterance conditioned on demonstrations and the prompts. During the whole process, the backbone GSLM is fixed.

demonstration to the target makes the training unstable. The underlying mechanism and ways to improve warmup training remain a future work.

We follow SpeechPrompt [34] to perform deep prompt tuning to guide the model in making predictions based on the demonstrations. Prompt tuning methods involve training a set of prompt vectors P that are prepended at the input of the speech LM. The speech LM then makes the prediction conditioned on the demonstrations X and the prompt vectors P. We then apply cross entropy (CE) loss on the model prediction and the ground truth label of the target utterance yt for optimizing the prompts:

L = CE(M(X; P), yt))                     (3)

3.2.    In-context Learning

After completing the warmup training, the model becomes ready to perform ICL on the training tasks Ttrain (seen tasks) and testing tasks Ttest (unseen tasks). The process of preparing demonstrations during this stage is similar to the warmup stage as described in the formula (2). However, in the ICL stage, the target utterance x˜t is no longer included in the demonstrations. Instead, its corresponding label yt ∈ {y1, y2, . . . , yn} should be included, enabling the model to learn to make predictions based on analogies.

4.    EXPERIMENTAL SETUP

4.1.    Tasks and Datasets

We evaluate our proposed ICL method on a diverse set of speech classification tasks with 8 datasets. We include di- verse tasks, including speech command recognition (SCR), fake speech detection (FSD), emotion recognition (ER), language identification (LID), and sarcasm detection (SD). Also, the datasets involve varying languages, accents, domains, and label distributions, allowing comprehensive assessment. We select two groups of training and testing tasks (see Table 1). These tasks are combined in ways that introduce variety, for instance, tasks with different numbers of categories or different types of tasks. This approach enables us to evaluate our model’s performance in ICL and gauge the impact of the training tasks on performance across a range of testing tasks. Our attention is specifically directed toward the model’s capacity to utilize learned knowledge during training when faced with a new task. This capability is essential for many practical uses in a multitude of real-world situations.

We’ve chosen particular datasets for both training and testing, as depicted in Table 1. Our aim for each training dataset group is to generate a balanced dataset to avoid any bias. To meet this goal, we collect 10,000 data points, each including 4 demonstrations from the training tasks. The same amount of data points is ensured for every task, providing a balanced dataset as each task gets an equal proportion of demonstrations. If we neglect to do this and simply use the entire dataset, some of the larger datasets would dominate a significant portion of the combined dataset, resulting in a skewed dataset.

4.2.    Implementation detail

We adopt GSLM [8] as our backbone speech LM. Specifically, the GSLM is trained on top of discrete units encoded by HuBERT [23] SSL speech model and K-means clustering algorithm with 100 clusters.

In warmup training, we conduct prompt tuning and use the prompt length equal to 5. This approach introduces a small fraction of trainable parameters, specifically less than 0.1% of the total 150 million parameters of the speech LM, simplifying the learning process. As described in Section 3, our initial approach involves enforcing a fixed length for utterances. In

Table 1: This table displays the accuracy for ICL with warmup training (w/ Warmup), random guessing (Random), and ICL without warmup training (w/o Warmup) and SVC on seen tasks (Ttrain) and unseen tasks (Ttest) for the two task groups. The results demonstrate that warmup training enables effective in-context learning, outperforming both baselines on diverse speech tasks.

the primary experiment in Sec. 5.1, we fix the utterance length L to be 50. This standardization ensures consistent utterance lengths across multiple datasets and simplifies the training. We also investigate the impact of varying utterance lengths in Sec. 5.4, specifically testing lengths of 10, 30, and 50 units respectively. We provide four demonstrations and one target utterance in both warmup training and the ICL stage2.

Given the limited research of ICL on speech LM, we com- pare our proposed method with three baselines: (1) random sampling, (2) ICL on GSLM without warmup training, and (3) a support vector classifier (SVC). The random sampling method entails making predictions by selecting labels at random from those present in the demonstrations. Additionally, the SVC will be trained using the provided examples to make predictions. To determine this method’s reliability on a test set, we repeat the experiments five times and compute the mean accuracy alongside its standard deviation, offering a more balanced evaluation of the model’s performance.

5.    RESULTS

5.1.    Main Result

In Table 1, we provide a detailed performance comparison of our proposed ICL with warmup training (w/ Warmup), random guessing (Random), ICL without warmup training (w/o Warmup), and a support vector classifier (SVC). This performance comparison is given across two different task groups (Group 1 and Group 2) for both training tasks (Ttrain) and testing tasks (Ttest).

The results show that ICL with warmup training consis- tently outperforms the Random and w/o Warmup baselines. In certain instances, it also outperforms SVC. In Group 1, when it comes to testing tasks (tasks that were not previously seen by GSLM), our approach demonstrates strong performance on the Arabic SC dataset, achieving a score of 40.9. This performance surpasses both the Random and w/o Warmup methods by a considerable margin. However, it falls short of the performance achieved by SVC, which boasts a score of 50.8. Likewise, on the ASVspoof dataset, our method achieves a commendable score of 51.9, slightly surpassing the Random approach and significantly outperforming the w/o Warmup method. Nevertheless, our method still lags behind SVC, which attains an impressive score of 89.7 in this context. Regarding the IEMOCAP dataset, our method excels by surpassing other baseline methods, including SVC. For the training tasks, the w/ Warmup method continues to outperform the other three methods. Particularly on Google SC v2 and Lithuanian SC, it achieves a score of 79.6 and 80.5, substantially higher than the other three methods.

Group 2 displays a consistent trend, with the w/ Warmup method consistently outperforming the other methods in vari-

Table 2: Guessing rate comparison between different methods on seen and unseen tasks. The results demonstrate that warmup

Fig. 3: The figure illustrates, while predicting, how attention weights are distributed across different positions in each layer. Initially, the model primarily attends to the demonstrations within the first two layers. It then gradually shifts its focus to the target in layers 3 through 7, and finally directs attention towards labels in the demonstrations in the final layers.

training significantly improves the model’s ability to make predictions based on the provided demonstrations.

ous situations. For instance, when considering the Google SC v2 dataset during test tasks, the w/ Warmup method attains a substantial score of 48.0. This score significantly surpasses both the Random and w/o Warmup methods and slightly edges out SVC in terms of performance. In training tasks, on the Dysarthric Mandarin SC dataset, our method shows a perfor- mance of 56.0, substantially outperforming the other methods. These results underline the efficacy of our proposed warmup training method in ICL, as it consistently outperforms w/o Warmup and Random baselines across a diverse range of speech classification tasks. This success opens up new paths for future studies and holds potential for more improvements

in this field.

5.2.    Guessing Rate Analysis

Warmup training primarily aims to equip the model with the ability to identify, compare demonstrations, and comprehend the target task. To investigate whether during ICL, the model predicts the label based on the demonstrations, we study the guessing rate of the model. The guessing rate signifies the probability of the model correctly identifying the label utilized in the demonstrations. It serves as a metric, where a higher rate indicates that the model effectively performs ICL by making predictions derived from the provided demonstrations.

Table 3: Accuracy comparison across different utterance lengths. An appropriate utterance length (e.g. 30-50 units) provides the best performance by balancing information and differentiation between examples.

Table 4: Accuracy comparison between prompt tuning and fine-tuning the whole model in warmup training. Fine-tuning the whole model causes instability for ICL.

Table 2 provides a detailed comparison of guessing rates between models with and without warmup training for two distinct groups, including both seen and unseen tasks. Without warmup training, the model shows a low guessing rate in both Group 1 and Group 2. However, with warmup training, the guessing rates dramatically increase to over 90%.

These results underscore that warmup training equips the model with superior capabilities for both seen and unseen tasks during the ICL stage, with guessing rates exceeding 90%. This showcases that the trained prompts in warmup training effectively guide the model to consider demonstrations and make predictions.

5.3.    Attention Map Analysis

We further examine the model’s behavior while predicting the label in ICL. The attention map in the model’s attention layers during the execution of ICL is depicted in Figure 3. The figure reveals that the initial two layers mainly concentrate on the demonstrations. The focus then shifts to the target utterance in the middle layers (3rd to 7th) and finally shifts to the demonstrations’ labels in the last layers (8th to 12th). Also, the model’s continued attention to the prompts in these scenarios highlights the utility of warmup training in ICL. From the studies in this section, we can see that the warmup training effectively steers the model to perform ICL for unseen tasks.

5.4.    Utterance Length Analysis

As shown in Table 3, the model’s performance is influenced by the number of units in each utterance. If the length of the utterance is too long, although it might hold sufficient information, the GSLM could struggle with modeling such long sequences as reported in [32]. On the other hand, if utterances are too short, the model may lack the necessary information, leading to random guessing.

5.5.    Fine-tuning Analysis

As illustrated in Table 4, fine-tuning shows instability performance in ICL. This could be due to the relatively small size of GSLM, making it prone to overfitting on the training datasets. As a result, we have chosen prompt tuning as our primary method because of its more reliable performance.

6.    CONCLUSION AND FUTURE WORKS

This paper presents the first successful application of in- context learning (ICL) to speech LM. We initially investigated the limitations of the current speech LM in performing ICL. With the proposed warmup training, the generative spoken language model (GSLM) demonstrates the ability to perform ICL on seen tasks and successfully achieve non-trivial results on unseen tasks. This paper does not aim to achieve competitive performance with ICL on speech LM but to show the feasibility of it. We’re also aware that the current capacity of the backbone GSLM is restricted. GPT-3 contains 175B parameters while GSLM has only about 150M parameters, which might cause the limitation of the ICL performance. Future works include investigating ICL on more diverse speech LM and developing more effective warmup strategies for ICL.

7.    REFERENCES

  • [1]    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xi- aolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  • [2]    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bern- stein, Jeannette Bohg, Antoine Bosselut, Emma Brun- skill, et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  • [3]    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al., “Language models are few-shot learners,” Ad- vances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  • [4]    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui, “A survey on in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
  • [5]    Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu, “Black-box tuning for language- model-as-a-service,” in International Conference on Ma- chine Learning. PMLR, 2022, pp. 20841–20855.
  • [6]    OpenAI,  “Introducing ChatGPT,” 2022, https://openai.com/blog/chatgpt.
  • [7]    OpenAI, “Gpt-4 technical report,” 2023.
  • [8]   Kushal Lakhotia et al., “On generative spoken language modeling from raw audio,” Transactions of the Associ- ation for Computational Linguistics, vol. 9, pp. 1336– 1354, 2021.
  • [9]    Eugene Kharitonov et al., “Text-free prosody-aware gen- erative spoken language modeling,” in Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), 2022, pp. 8666–8681.
  • [10]    Zalán Borsos et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [11]    Michael Hassid et al., “Textually pretrained speech lan- guage models,” arXiv preprint arXiv:2305.13009, 2023.
  • [12]    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Comput- ing Surveys, vol. 55, no. 9, pp. 1–35, 2023.
  • [13]   Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han- naneh Hajishirzi, “Metaicl: Learning to learn in con- text,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, 2022, pp. 2791–2809.
  • [14]   Mingda Chen, Jingfei Du, Ramakanth Pasunuru, Todor Mihaylov, Srini Iyer, Veselin Stoyanov, and Zornitsa Kozareva, “Improving in-context few-shot learning via self-supervised training,” in Proceedings of the 2022 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, 2022, pp. 3558–3573.
  • [15]   Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang, “Pre- training to learn in context,” in ACL (1). 2023, pp. 4849– 4870, Association for Computational Linguistics.
  • [16]   Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang, Shiyu Chang, and Mark Hasegawa-Johnson, “Wavprompt: Towards few-shot spoken language understanding with frozen language models,” in INTERSPEECH. 2022, pp. 2738–2742, ISCA.
  • [17]   Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, pp. 9, 2019.
  • [18]   Siyin Wang, Chao-Han Huck Yang, Ji Wu, and Chao Zhang, “Can whisper perform speech-based in-context learning,” arXiv preprint arXiv:2309.07081, 2023.
  • [19]    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
  • [20]   Abdelrahman Mohamed et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  • [21]   Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Rep- resentation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  • [22]   Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self- supervised learning of speech representations,” Advances in neural information processing systems, 2020.
  • [23]    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrah- man Mohamed, “Hubert: Self-supervised speech repre- sentation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 29, pp. 3451–3460, 2021.
  • [24]   Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu, “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Under- standing Workshop (ASRU). IEEE, 2021, pp. 244–250.
  • [25]   Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  • [26]    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, “High fidelity neural audio compression,” 2022.
  • [27]    Yi-Chiao Wu, Israel D. Gebru, Dejan Markovic´, and Alexander Richard, “AudioDec: An open-source stream- ing high-fidelity neural audio codec,” in ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  • [28]    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Bar- ret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al., “Emer- gent abilities of large language models,” Transactions on Machine Learning Research, 2022.
  • [29]   Taylor Webb, Keith J Holyoak, and Hongjing Lu, “Emer- gent analogical reasoning in large language models,” arXiv preprint arXiv:2212.09196, 2022.
  • [30]    Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huam- ing Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei, “Neural codec language models are zero-shot text to speech synthesizers,” 2023.
  • [31]    Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei, “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” 2023.
  • [32]    Kai-Wei Chang, Wei-Cheng Tseng, Shang-Wen Li, and Hung yi Lee, “An Exploration of Prompt Tuning on Gen- erative Spoken Language Model for Speech Processing Tasks,” in Proc. Interspeech 2022, 2022, pp. 5005–5009.
  • [33]   Haibin Wu, Kai-Wei Chang, Yuan-Kuei Wu, and Hung yi Lee, “Speechgen: Unlocking the generative power of speech language models with prompts,” 2023.
  • [34]   Kai-Wei Chang, Yu-Kai Wang, Hua Shen, Iu thing Kang, Wei-Cheng Tseng, Shang-Wen Li, and Hung yi Lee, “SpeechPrompt v2: Prompt tuning for speech classifi- cation tasks,” 2023.
  • [35]   Gamaleldin F. Elsayed, Ian J. Goodfellow, and Jascha Sohl-Dickstein, “Adversarial reprogramming of neural networks,” in ICLR (Poster). 2019, OpenReview.net.
  • [36]   Pin-Yu Chen, “Model reprogramming: Resource- efficient cross-domain machine learning,” CoRR, vol. abs/2202.10629, 2022.
  • [37]   Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, and Yu Tsao, “Neural model reprogramming with similarity based map- ping for low-resource spoken command classification,” arXiv preprint arXiv:2110.03894, 2021.
  • [38]   Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Narsis A Kiani, David Gomez- Cabrero, and Jesper N Tegner, “A parameter-efficient learning approach to arabic dialect identification with pre-trained general-purpose speech model,” arXiv preprint arXiv:2305.11244, 2023.
  • [39]   Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N Sainath, and Trevor Strohman, “From english to more languages: Parameter- efficient model reprogramming for cross-lingual speech recognition,” arXiv preprint arXiv:2301.07851, 2023.
  • [40]   Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer, “Rethinking the role of demonstrations: What makes in- context learning work?,” 2022.
  • [41]   Lina Tarek Benamer and Osama AS Alkishriwo, “Database for arabic speech commands recognition,” in CEST, 2020.
  • [42]   Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi H Kinnunen, Ville Vestman, Massimiliano Todisco, Héc- tor Delgado, Md Sahidullah, Junichi Yamagishi, and Kong Aik Lee, “Asvspoof 2019: spoofing countermea- sures for the detection of synthesized, converted and replayed speech,” IEEE Trans. Biom. Behav. Identity Sci., vol. 3, no. 2, pp. 252–265, 2021.
  • [43]   Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  • [44]   Pete Warden, “Speech commands: A dataset for limited- vocabulary speech recognition,” 2018.
  • [45]   Aliaksei Kolesau and Dmitrij Šešok, “Unsupervised pre- training for voice activation,” Applied Sciences, vol. 10, no. 23, pp. 8643, 2020.
  • [46]   Yu-Yi Lin, Wei-Zhong Zheng, Wei Chung Chu, Ji-Yan Han, Ying-Hsiu Hung, Guan-Min Ho, Chia-Yuan Chang, and Ying-Hui Lai, “A speech command control-based recognition system for dysarthric patients based on deep learning technology,” Applied Sciences, 2021.
  • [47]   Ken  MacLean,                     “Voxforge,” Available: http://www.voxforge.org/home.
  • [48]   Santiago Castro, Devamanyu Hazarika, Verónica Pérez- Rosas, Roger Zimmermann, Rada Mihalcea, and Sou- janya Poria, “Towards multimodal sarcasm detection (an_obviously_ perfect paper),” in Proceedings of the 57th Conference of ACL, 2019, pp. 4619–4629.