InstructSpeech:Following Speech Editing Instructions via Large Language Models

Anonymous Authors

Abstract. Instruction-guided speech editing aims to follow the user’s natural language instruction to manipulate the semantic and acoustic attributes of a speech. In this work, we construct triplet paired data (instruction, input speech, output speech) to alleviate data scarcity and train a multi-task large language model named InstructSpeech. To mitigate the challenges of accurately executing user’s instructions, we 1) introduce the learned task embeddings with a fine-tuned Flan-T5-XL to guide the generation process towards the correct generative task; 2) include an extensive and diverse set of speech editing and speech processing tasks to enhance model capabilities; 3) investigate chainof- thought reasoning for free-form semantic content editing; and 4) propose a hierarchical adapter that effectively updates a small portion of parameters for generalization to new tasks. To assess instruction speech editing in greater depth, we introduce a benchmark evaluation with contrastive instruction-speech pretraining (CISP) to test the speech quality and instruction-speech alignment faithfulness. Experimental results demonstrate that InstructSpeech achieves state-of-the-art results in eleven tasks, for the first time unlocking the ability to edit the acoustic and semantic attributes of speech following a user’s instruction.

Overview



Table of Contents

Acoustic attribute editing - Style

In this section, we provide the generated audio samples in style part with other systems on the acoustic attribute editing task.

Instruction Before Edit Prompt YourTTS Result Result (ours)

Acoustic attribute editing Energy

In this section, we provide the generated audio samples in energy part with other systems on the acoustic attribute editing task.

Instruction Transcription Before Edit Result (ours)

Acoustic attribute editing - Speed

In this section, we provide the generated audio samples in speed part with other systems on the acoustic attribute editing task.

Instruction Transcription Before Edit Base Result (ours) Medium Result (ours) Target Result (ours)

Acoustic attribute editing - Emotion

In this section, we provide the generated audio samples in emotion part with other systems on the acoustic attribute editing task.

Instruction Before Edit Prompt YourTTS Result Base Result (ours) Medium Result (ours) Target Result (ours)

Region-based semantic editing add

In this section, we provide the generated audio samples in region-based add.

Input Transcription Target Transcription Before Edit Speechedit Result A3t Result Base Result (ours) Medium Result (ours) Large Result (ours)

Region-based semantic editing delete

In this section, we provide the generated audio samples in region-based delete.

Input Transcription Target Transcription Before Edit Speechedit Result A3t Result Base Result (ours) Medium Result (ours) Large Result (ours)

Region-based semantic editing edit

In this section, we provide the generated audio samples in region-based edit.

Input Transcription Target Transcription Before Edit Speechedit Result A3t Result Base Result (ours) Medium Result (ours) Large Result (ours)

Free-form semantic editing: Add

In this section, we provide the generated audio samples free_form region in add task.

Instruction Transcription Target Transcription Before Edit Speechedit Result A3t Result Base Result (ours) Medium Result (ours) Large Result (ours)

Free-form semantic editing: Delete

In this section, we provide the generated audio samples free_form region in delete task.

Instruction Transcription Target Transcription Before Edit Speechedit Result A3t Result Base Result (ours) Medium Result (ours) Large Result (ours)

Free-form semantic editing: Replace

In this section, we provide the generated audio samples free_form region in replace task.

Instruction Transcription Target Transcription Before Edit Speechedit Result A3t Result Base Result (ours) Medium Result (ours) Large Result (ours)

Multi-turn editing

In this section, we provide the generated audio samples in our multi-turn editing.

Before Edit Emotion Style Speed Energy Semantic