|"By training the model on a large-scale collection of online videos, we are able to capture correlations between speech and visual signals, such as mouth movements and facial expressions, which can then be used to separate the speech of one person in a video from another, or to separate speech from background sounds. This technology not only achieves state-of-the-art results in speech separation and enhancement (a noticeable 1.5dB improvement over audio-only models), but in particular, can improve the results over audio-only processing when there are multiple people speaking, as the visual cues in the video help determine who is saying what."|
Speech-enhancement to roll out on a wider scale?
It goes without saying that you need 10,000 subscribers to access YouTube Stories creation - but that might change in the future. Plus, when fully developed, speech-enhancement might well roll out on a wider scale - It's a case of 'watch this space,' or maybe listen to it anyway.