Main Feature Extraction Technology

The performance of the pre-processing step can be improved by high-resolution frequency extraction. As a result, the overall performance and efficiency of the speech-related artificial intelligence system can be improved.

Deep learning-based speech recognition technology using a high-resolution spectrogram

Given a voice, a high-resolution spectrogram is extracted. Then, speech recognition is performed with a more precise time unit of the spectrogram by extracting 1) the probability distribution of phonemes in each unit time or 2) phoneme time intervals which consist of several time units. Since the situations where more than one phoneme co-occur in one-time unit are reduced, it is possible to increase the accuracy of speech recognition.

The image below is an example that shows the probability distributions of phonemes in a unit of time. If the unit of time is large, there may be more than one phoneme in the time unit with a high probability of occurrence. The smaller the unit of time, the smaller the probability of having two phonemes in one unit of time.

Low-Resolution Spectrogram
High-Resolution Spectrogram

If the unit of time is large, it becomes similar to the situation of recognizing a word where several people simultaneously speak each syllable constituting the word. For example, given the word "recognition", if four different people pronounce each of the syllables "re", "cog", "ni", and "tion", respectively, and they speak sequentially, it will sound exactly like "recognition". However, if four people speak simultaneously, it becomes difficult to hear "recognition". If we utilize a high-resolution spectrogram in the scenario, the accuracy of speech recognition can be improved.

Deep learning-based speaker recognition technology using a high-resolution spectrogram

A high-resolution spectrogram allows for precise extraction of voice pitch which distinguishes human voices, syllable duration, or changes in voice tone. Therefore, if speaker recognition is performed with the precisely extracted information, we can expect to improve the accuracy of the speaker recognition.

The pitch of a voice is determined by the fundamental frequency of the voice. As shown in the figure below, the fundamental frequency of a voice is generally higher for women than for men and higher for children than for adults.

Using a high-resolution spectrogram, you can precisely extract the fundamental frequency and tone changes of a voice. Especially in a noisy environment, the advantage of using a high-resolution spectrogram is more evident than the conventional method.

Algorithm & Application patent

Patent title Date Application (registration) number
KR Frequency extraction method by DJ transformation 2019.1.11 10-2019-0003620
PCT Frequency extraction method by DJ transformation 2019.11.26 PCT/KR2019/016347
KR Fundamental frequency extraction method based on DJ transformation 2020.10.05 10-2164306
KR Method of extracting pure tones constituting compound tones 2020.7.21 10-2020-0089961
PCT Fundamental frequency extraction method based on DJ transformation 2020.11.12 PCT/KR2020/015910
PCT Method of extracting pure tones constituting compound tones 2021.2.10 PCT/KR2021/001807
US Frequency extraction method using DJ transform 2021.2.12 US17/268,444
US Fundamental frequency extraction method using DJ transform 2021.4.23 US17/288,459