On September 15, the 2020 Tencent AI Lab Rhino-Bird Focused Research Program Conference took place. Tsinghua SIGS Division of Information Science and Technology Associate Professor Zheng Hai-Tao received the Innovation in Technology Award for his project “Incorporating Multi-Type and Multi-Structure Knowledge into Pretrained Language Model.” SIGS Associate Research Fellow Wu Zhiyong’s project, “Fine-grained duration and pitch modeling for high-quality singing voice synthesis,” was given the Outstanding Project Award.
"Incorporating Multi-Type and Multi-Structure Knowledge into Pretrained Language Model" is a joint project with Tsinghua University and Tencent AI Lab which explores the understanding and generation of complex language knowledge in pretrained language models from various perspectives. Its results have been presented at top natural language processing and artificial intelligence conferences, such as ACL, EMNLP, AAAI, etc. The project is innovative in several ways.
First, the project integrates lexical and syntactic linguistic knowledge and improves the language model's adversarial robustness and semantic sensitivity through comparative learning. Specifically, the proposed method, Contrastive LearnIng with semantic Negative Examples (CLINE), constructs semantic negative examples unsupervised to improve robustness under semantically adversarial attacking. By comparing similar and opposite semantic examples, the model can effectively perceive the semantic changes caused by small perturbations.
Second, the project team designed a variational generative clustering algorithm for latent semantic learning and topic mining from reader comments. Mining the topic knowledge in reader comments improves the understanding and generation ability of the language model for complex concepts.
In addition, the project also improves the fine-grained semantic recognition ability of the model by automatically constructing grayscale data. Specifically, the method employs off-the-shelf response retrieval models and response generation models as automatic grayscale data generators. With the constructed grayscale data, through multi-level ranking objectives for training, a matching model can be taught to capture more fine-grained context-response relevance differences and reduce the train-test discrepancy in terms of distractor strength.
Tsinghua University Knowledge Engineering Laboratory (THUKE), founded by Associate Professor Zheng Hai-Tao, has been working with Tencent AI Lab for many years and has achieved many research outcomes. In the project "Incorporating Multi-Type and Multi-Structure Knowledge into Pretrained Language Model," THUKE and Tencent AI Lab jointly trained 2 interns, published 4 top conference papers in the field of natural language processing, and applied for 2 patents for inventions. The project outcomes have a variety of applications, such as malicious text detection, auxiliary writing systems, and chatbots.
"Incorporating Multi-Type and Multi-Structure Knowledge into Pretrained Language Model" project framework
In the project “Fine-grained duration and pitch modeling for high-quality singing voice synthesis,” Associate Research Fellow Wu Zhiyong’s team proposes an end-to-end neural network-based method that incorporates information on context, phonemes, and the singer for accurate modeling of the duration and pitch of a song. Different from speech synthesis, the variations of pitch and duration of each phoneme in singing voice synthesis are more complex.
To produce intelligent, on-key, and expressive singing voices, the project proposes an accurate duration and pitch modeling method based on the FastSpeech speech synthesis model. First, the team designed a novel duration prediction module to replace the original duration predictor in FastSpeech that takes the speaker’s characteristics information into account. Second, for the accuracy and controllability of the pitch, the project proposes a new pitch prediction module, which predicts the f0 (or log f0) and voice/unvoice (vuv) flags simultaneously. Third, instead of Mel-spectrogram, the whole model predicts the Mel-generalized cepstrum and the band aperiodicity as the outputs and utilizes WORLD vocoder to generate the final singing voices. Experimental results have indicated that the average prediction error of each phoneme in the test set of the duration model is fewer than 14 frames. In their project, the team tested various designs of the pitch control methods and did comparative experiments to validate the pitch prediction efficiencies when ornaments (e.g. appoggiatura, slide, trill) are involved.
Compared to current methods, this project’s results are an improvement on duration and pitch modeling and have also increased the controllability of singing voice synthesis, creating a more natural and fluent synthesized singing voice overall.
"Fine-grained duration and pitch modeling for high-quality singing voice synthesis" project framework
Associate Research Fellow Wu Zhiyong founded the Human-Computer Speech Interaction Lab at Tsinghua University (THUHCSI). The lab’s team has collaborated with the Tencent AI Lab for many years and has achieved considerable research success. Their project “Research on controllable end-to-end expressive visual speech synthesis technology” received the Distinguished Project Award at the 2019 Tencent AI Lab Rhino-Bird Focused Research Program.
Source: Division of Information Science and Technology
Editor: A.S.