• Nie Znaleziono Wyników

Podsumowanie

W dokumencie Index of /rozprawy2/11076 (Stron 85-94)

W niniejszej pracy opisane zostały główne mo˙zliwo´sci zastosowa´n matryc wielomikrofo-nowych w technologii mowy, a w szczególno´sci w rozpoznawaniu mówcy i diaryzacji nagra´n. Głównym celem niniejszej pracy było rozwini˛ecie opisywanych wcze´sniej w literaturze [58, 78] systemów diaryzacji nagra´n bazuj ˛acych na wykorzystaniu jednocze´snie informacji bazuj ˛acej na cechach cz˛estotliwo´sciowych i poło˙zeniu mówcy (MFCC-TDOA). Zastosowane zostało, po-dobnie jak we wspomnianych pracach, modelowanie bazuj ˛ace na miksturach gaussowskich. Nowo´sci ˛a jest zastosowanie dynamicznego doboru proporcji pomi˛edzy informacj ˛a pochodz ˛ac ˛a z poszczególnych strumieni w zale˙zno´sci od warunków akustycznych. Wyniki zaprezentowane w pracy pokazuj ˛a, ˙ze ilo´s´c bł˛ednie opisanych ramek (DER) spada nawet o 30% w stosunku do systemu ze stał ˛a proporcj ˛a pomi˛edzy strumieniami informacji. Autor wykazał, ˙ze w trudnych warunkach akustycznych korzystniejsze jest oparcie si˛e w wi˛ekszym stopniu o cechy cz˛esto-tliwo´sciowe (MFCC). Wraz z popraw ˛a SNR w coraz wi˛ekszym stopniu wł ˛acza´c mo˙zna cechy zwi ˛azane z poło˙zeniem (TDOA).

Dzi˛eki zaproponowanej przez autora pracy fuzji informacji cz˛estotliwo´sciowej i zwi ˛azanej z poło˙zeniem mówcy z dynamicznie zmieniaj ˛ac ˛a si˛e proporcj ˛a, uzyskane za pomoc ˛a opisy-wanego wcze´sniej algorytmu MFCC-TDOA wyniki zostały znacznie poprawione. Dodatkowo wykazana została wi˛eksza ni˙z klasycznych rozwi ˛aza´n, odporno´s´c autorskiego algorytmu na zakłócenia. Wyniki te s ˛a interesuj ˛ace bior ˛ac pod uwag˛e coraz szersze wykorzystanie matryc wielomikrofonowych w urz ˛adzeniach mobilnych, które ze swojej natury nie pracuj ˛a w stacjo-narnych warunkach szumowych. Co istotne, przedstawione wyniki s ˛a efektem eksperymentów przeprowadzonych w nieprzystosowanych akustycznie pomieszczeniach bardzo zbli˙zonych do tych, w których docelowo b˛ed ˛a działa´c zaproponowane przez autora algorytmy. Niebadane dotychczas w tym kontek´scie wykorzystanie matrycy wielomikrofonowej jest zatem w pełni uzasadnione. Przede wszystkim pod k ˛atem wdro˙zeniowym, poniewa˙z ze wzgl˛edu na niewielki koszt sensorów d´zwi˛eku zwi˛ekszanie ich liczby w urz ˛adzeniach elektronicznych jest ekono-micznie uzasadnione.

Spadaj ˛ace ceny mikrofonów wykonanych w technologii MEMS powoduj ˛a, ˙ze coraz cz˛e´sciej urz ˛adzenia mobilne wyposa˙zane s ˛a w coraz wi˛eksz ˛a liczb˛e mikrofonów. Jak podaje raport firmy IHS z 2014 roku [12], całkowita sprzeda˙z tego typu sensorów wzrosła z 1,9 miliarda sztuk w 2013 do 2,6 miliarda sztuk w 2014. Raport prognozuje, ˙ze do 2017 roku roczna sprzeda˙z

86

mikrofonów wzro´snie do 5,4 miliarda sztuk rocznie. Taka tendencja pozwala przypuszcza´c, ˙ze wszelkie rozwi ˛azania algorytmiczne oparte o wykorzystanie systemów wielomikrofonowych znajd ˛a w najbli˙zszych latach coraz wi˛eksze zastosowanie.

Przedstawione rozwi ˛azanie b˛edzie dalej rozwijane zarówno od strony algorytmicznej, jak i wdro˙zeniowej, co zaowocuje powstaniem protypu demonstracyjnego. Dalszemu rozwojowi z pewno´sci ˛a poddane b˛ed ˛a metody fuzji decyzji tak, aby jeszcze zwi˛ekszy´c efektywno´s´c re-akcji algorytmu na zmieniaj ˛ace si˛e warunki akustyczne. Wnikliwiej zostanie równie˙z zbadany wpływ liczby mikrofonów w matrycy na skuteczno´s´c pracy systemu. W ramach tych prac zba-dane zostan ˛a równie˙z inne topologie rozmieszczenia mikrofonów. W tym równie˙z rozmieszcze-nie losowe. Przetestowane b˛ed ˛a inne (poza GMM) metody modelowania poło˙zenia mówców. Dalsza poprawa skuteczno´sci działania systemu b˛edzie mogła by´c dokonana poprzez zastoso-wanie w fuzji innych algorytmów rozpoznawania mówcy (np. opartych o iVectors) oraz innych metod fuzji decyzji (np. opartych o gł˛ebokie sieci neuronowe).

Przedstawione w rozprawie wyniki pokazuj ˛a, ˙ze w trudnych warunkach akustycznych wy-korzystanie takiego rozwi ˛azania mo˙ze zmniejszy´c DER o ponad 10%. Tak dobry wynik b˛edzie z pewno´sci ˛a motywacj ˛a do dalszych prac zwi ˛azanych z zastosowaniem bardziej zaawansowa-nych algorytmów kształtowania wi ˛azki (np. LCMV, GSC) jako stopnia wej´sciowego całego układu.

Planowane jest równie˙z rozwini˛ecie algorytmu w ten sposób, aby w fazie treningu nie było konieczne wskazywanie momentów, w których ko´ncz ˛a mówi´c poszczególne osoby. Takie roz-wi ˛azanie b˛edzie wymagało zastosowania technik uczenia maszynowego bez nauczyciela (ang. unsupervised learning).

Bibliografia

[1] V. Agrawal and Y. Lo. Distribution of sidelobe level in random arrays. Proceedings of the IEEE, 57(10):1764–1765, Oct 1969.

[2] J. Ajmera, G. Lathoud, and L. Mccowan. Clustering and segmenting speakers and their locations in meetings. Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference on, pages I–605–8 vol.1, 2004.

[3] H. Altinçay and M. Demirekler. Speaker identification by combining multiple classifiers using dempster-shafer theory of evidence. Speech Communication, 41(4):531–547, 2003. [4] V. Alvarado and H. Silverman. Experimental results showing the effects of optimal spa-cing between elements of a linear microphone array. In Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on, pages 837–840 vol.2, Apr 1990.

[5] X. Anguera, C. Woofers, J. Pardo, and J. Hernando. Automatic weighting for the com-bination of tdoa and acoustic features in speaker diarization for meetings. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages IV–241–IV–244, April 2007.

[6] X. Anguera, C. Wooters, and J. Hernando. Acoustic beamforming for speaker diari-zation of meetings. Audio, Speech, and Language Processing, IEEE Transactions on, 15(7):2011–2022, Sept 2007.

[7] J. Benesty, J. Chen, and Y. Huang. Microphone Array Signal Processing. Springer Topics in Signal Processing. Springer Berlin Heidelberg, 2008.

[8] L. Besacier, J. Bonastre, and C. Fredouille. Localization and selection of speaker-specific information with statistical modeling. Speech Communication, 31(2–3):89 – 106, 2000. [9] L. Besacier and J.-F. Bonastre. Subband architecture for automatic speaker recognition.

Signal Processing, 80(7):1245 – 1259, 2000.

[10] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Sta-tistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

BIBLIOGRAFIA 88

[11] J. Bitzer, K. Simmer, and K. D. Kammeyer. Theoretical noise reduction limits of the gene-ralized sidelobe canceller (gsc) for speech enhancement. In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, volume 5, pages 2965–2968 vol.5, 1999.

[12] M. Boustany. MEMS Microphones Report - 2014. IHS Technology, 2014.

[13] C. S. Burrus and T. W. Parks. DFT/FFT and Convolution Algorithms: Theory and Imple-mentation. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1991.

[14] G. C. Carter, A. H. Nuttall, and P. Cable. The smoothed coherence transform. Proceedings of the IEEE, 61(10):1497–1498, Oct 1973.

[15] B. Champagne, S. Bedard, and A. Stephenne. Performance of time-delay estimation in the presence of room reverberation. Speech and Audio Processing, IEEE Transactions on, 4(2):148–152, 1996.

[16] C. Che, Q. Lin, J. Pearson, B. De Vries, and F. J. L. Microphone arrays and neural ne-tworks for robust speech recognition. Human Language Technologies, 1994.

[17] K. Chen, L. Wang, and H. Chi. Methods of combining multiple classifiers with diffe-rent features and their applications to text-independent speaker identification. IJPRAI, 11(3):417–445, 1997.

[18] J. J. Christensen and J. Hald. Beamforming. Brüel & Kjær Technical Review, (1):1–48, 2004.

[19] T. Conte and A. Wolfe. Noise cancellation for phone conversation, Sept. 2 2014. US Patent 8,824,666.

[20] R. I. Damper and J. E. Higgins. Improving speaker identification in noise by subband processing and decision fusion. Pattern Recognition Letters, 24(13):2167–2173, 2003. [21] A. Davis, S. Nordholm, and R. Togneri. Statistical voice activity detection using

low-variance spectrum estimation and an adaptive threshold. Audio, Speech, and Language Processing, IEEE Transactions on, 14(2):412–424, March 2006.

[22] J. R. Deller, Jr., J. G. Proakis, and J. H. Hansen. Discrete Time Processing of Speech Signals. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 1993.

[23] G. DeMuth. Frequency domain beamforming techniques. Proc. IEEE Int. Conference on Acoustics, Speech, Signal Processing, 2:713–715, 1977.

[24] J. DiBiase, H. Silverman, and M. Brandstein. Robust localization in reverberant rooms. In M. Brandstein and D. Ward, editors, Microphone Arrays: Signal Processing Techniques and Applications, chapter 8. Springer, 2001.

BIBLIOGRAFIA 89

[25] H. Do, H. Silverman, and Y. Yu. A real-time srp-phat source location implementation using stochastic region contraction(src) on a large-aperture microphone array. In Aco-ustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 1, pages I–121–I–124, April 2007.

[26] J. G. Fiscus, J. Ajot, and J. S. Garofolo. The rich transcription 2007 meeting recognition evaluation. In Multimodal Technologies for Perception of Humans, International Eva-luation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers, pages 373–389, 2007.

[27] G. Friedland, H. Hung, and C. Yeo. Multi-modal speaker diarization of real-world me-etings using compressed-domain video features. In Acoustics, Speech and Signal Pro-cessing, 2009. ICASSP 2009. IEEE International Conference on, pages 4069–4072, April 2009.

[28] I. Frost, O.L. An algorithm for linearly constrained adaptive array processing. Proceedings of the IEEE, 60(8):926–935, Aug 1972.

[29] S. Gannot and I. Cohen. Speech enhancement based on the general transfer function gsc and postfiltering. Speech and Audio Processing, IEEE Transactions on, 12(6):561–571, Nov 2004.

[30] J. Gałka, M. Grzywacz, and R. Samborski. Playback attack detection for text-dependent speaker verification over telephone channels. Speech Communication, 67(0):143 – 153, 2015.

[31] S. Gelfand. Hearing: An Introduction to Psychological and Physiological Acoustics. Mar-cel Dekker, 2004.

[32] A. Gilloire and M. Vetterli. Adaptive filtering in sub-bands. In Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, pages 1572– 1575 vol.3, Apr 1988.

[33] A. Gilloire and M. Vetterli. Adaptive filtering in subbands with critical sampling: analy-sis, experiments, and application to acoustic echo cancellation. Signal Processing, IEEE Transactions on, 40(8):1862–1875, Aug 1992.

[34] S. Golan, S. Gannot, and I. Cohen. Performance analysis of a randomly spaced wireless microphone array. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 121–124, May 2011.

[35] T. Hain, L. Burget, J. Dines, G. Garau, M. Karafiát, M. Lincoln, and V. Wan. The ami meeting transcription system. In Proc. NIST Rich Transcription 2006 Spring Meeting

Re-BIBLIOGRAFIA 90

cognition Evaluation Worskhop, page 12. National Institute of Standards and Technology, 2006.

[36] J. Harrington and S. Cassidy. Techniques in Speech Acoustics. Kluwer Academic Publi-shers, Foris, Dordrecht, 1999. ISBN: 0-7923-5731-0.

[37] J. Haykin, H. Silverman, and M. Brandstein. Robust localization in reverberant rooms. In M. Brandstein and D. Ward, editors, Microphone Arrays: Signal Processing Techniques and Applications, chapter 8. Springer, 2001.

[38] S. Haykin. Adaptive Filter Theory. Prentice-Hall, 1986.

[39] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am., 57(4):1738–52, Apr. 1990.

[40] D. H. Johnson and D. E. Dudgeon. Array Signal Processing: Concepts and Techniques. Simon & Schuster, 1992.

[41] M. Kajala and M. Hamaldinen. Broadband beamforming optimization for speech enhance-ment in noisy environenhance-ments. In Applications of Signal Processing to Audio and Acoustics, 1999 IEEE Workshop on, pages 19–22, 1999.

[42] F. Khalil, J. P. Jullien, and A. Gilloire. Microphone array for sound pickup in teleconfe-rence systems. J. Audio Eng. Soc, 42(9):691–700, 1994.

[43] T. Kinnunen and H. Li. An overview of text-independent speaker recognition: From fe-atures to supervectors. Speech Communication, 52(1):12 – 40, 2010.

[44] C. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. Acoustics, Speech and Signal Processing, IEEE Transactions on, 24(4):320–327, Aug 1976.

[45] I. Kodrasi, T. Rohdenburg, and S. Doclo. Microphone position optimization for planar superdirective beamforming. In ICASSP, pages 109–112. IEEE, 2011.

[46] B. Kollmeier, T. Brand, and B. Meyer. Perception of speech and sound. In Springer Handbook of Speech Processing. Springer-Verlag, Berlin Heidelberg, 2008.

[47] G. Lathoud and I. McCowan. Location based speaker segmentation. Multimedia and Expo, 2003. ICME ’03. Proceedings. 2003 International Conference on, pages 621 – 624, 2003.

[48] N. Levinson. The Wiener RMS (root mean square) error criterion in filter design and prediction. J. Math. Phys., 25(4):261–278, 1947.

BIBLIOGRAFIA 91

[49] Q. Lin, E.-E. Jan, and J. Flanagan. Microphone arrays and speaker identification. Speech and Audio Processing, IEEE Transactions on, 2(4):622–629, Oct 1994.

[50] W. Liu and S. Weiss. Wideband Beamforming. Concept and Techniques. John Wiley & Sons, 2010.

[51] L.J. Ziomek. Fundamentals of acoustic field theory and space time signal processing. CRC Press, 1995.

[52] Y. Lo. A mathematical theory of antenna arrays with randomly spaced elements. Antennas and Propagation, IEEE Transactions on, 12(3):257–268, May 1964.

[53] J. H. Mathews and K. K. Fink. Numerical Methods Using Matlab. Prentice-Hall Inc., 2004.

[54] I. A. Mccowan. Microphone arrays - tutorial. 2001.

[55] J. A. Nelder and R. Mead. A simplex method for function minimization. Computer Journal, 7:308–313, 1965.

[56] H. Nyquist. Certain topics in telegraph transmission theory. Transactions of the AIEE, 47:617–644, 1928.

[57] J. Pardo, X. Anguera, and C. Wooters. Speaker diarization for multiple-distant-microphone meetings using several sources of information. Computers, IEEE Transac-tions on, 56(9):1212–1224, Sept 2007.

[58] J. Pardo, R. Barra-Chicote, R. San-Segundo, R. de Cordoba, and B. Martinez-Gonzalez. Speaker diarization features: The upm contribution to the rt09 evaluation. Audio, Speech, and Language Processing, IEEE Transactions on, 20(2):426–435, Feb 2012.

[59] D. Reynolds and R. Rose. Robust text-independent speaker identification using gaussian mixture speaker models. Speech and Audio Processing, IEEE Transactions on, 3(1):72– 83, Jan 1995.

[60] D. Reynolds and P. Torres-Carrasquillo. Approaches and applications of audio diarization. In Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE In-ternational Conference on, volume 5, pages v/953–v/956 Vol. 5, March 2005.

[61] D. A. Reynolds, G. R. Doddington, M. A. Przybocki, and A. F. Martin. The nist spe-aker recognition evaluation - overview methodology, systems, results, perspective. Speech Commun., 31(2-3):225–254, June 2000.

[62] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaus-sian mixture models. In Digital Signal Processing, page 2000, 2000.

BIBLIOGRAFIA 92

[63] P. R. Roth. Effective measurements using digital signal analysis. Spectrum, IEEE, 8(4):62– 70, April 1971.

[64] Y. Rui and D. Florencio. Time delay estimation in the presence of correlated noise and reverberation. In Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference on, volume 2, pages ii–133–6 vol.2, May 2004. [65] R. Samborski and M. Ziółko. Speaker localization in conferencing systems employing

phase features and wavelet transform. 2012 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), 0:000333–000337, 2013.

[66] R. Samborski, M. Ziółko, B. Ziółko, and J. Gałka. Speech extraction from jammed si-gnals in dual-microphone systems. Proc. IASTED International Conference on Signal Processing, Pattern Recognition and Applications, 2010.

[67] R. Samborski, M. Ziółko, B. Ziółko, and J. Gałka. Wiener filtration for speech extraction from the intentionally corrupted signals. 2010 IEEE International Symposium on Indu-strial Electronics, pages 1698–1701, 2010.

[68] Y. Sasaki, S. Kagami, and H. Mizoguchi. Multiple sound source mapping for a mobile robot by self-motion triangulation. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 380–385, Oct 2006.

[69] M. Seltzer and R. Stern. Subband likelihood-maximizing beamforming for speech reco-gnition in reverberant environments. IEEE Transactions on Audio, Speech, and Language Processing, (14):2109 – 2121, 2006.

[70] J. Sherman and W. J. Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Annals of Mathematical Statistics, 21(4):620–624, 12 1950.

[71] S. Shum, N. Dehak, R. Dehak, and J. Glass. Unsupervised methods for speaker diarization: An integrated and iterative approach. Audio, Speech, and Language Processing, IEEE Transactions on, 21(10):2015–2028, Oct 2013.

[72] J. Shynk. Frequency-domain and multirate adaptive filtering. Signal Processing Magazine, IEEE, 9(1):14–37, Jan 1992.

[73] H. F. Silverman, Y. Yu, and J. M. Sachar. Performance of real-time source-location estima-tors for a large-aperture microphone array. IEEE Trans. Speech Audio Process., 13:593– 606, 2005.

[74] B. D. Steinberg. Principles of aperture and array system design : including random and adaptive arrays / Bernard D. Steinberg. Wiley New York, 1976.

BIBLIOGRAFIA 93

[75] H. Sun, B. Ma, S. Z. K. Khine, and H. Li. Speaker diarization system for rt07 and rt09 meeting room audio. In ICASSP, pages 4982–4985. IEEE, 2010.

[76] S. Tranter and D. Reynolds. An overview of automatic speaker diarisation systems. IEEE Transactions on Audio, Speech, and Language Processing, pages 1557–1565, 2006. [77] S. V. Vaseghi. Advanced Digital Signal Processing and Noise Reduction. John Wiley &

Sons, 2006.

[78] D. Vijayasenan and F. Valente. Diartk : An open source toolkit for research in multistream speaker diarization and its application to meetings recordings. In INTERSPEECH, pages 2170–2173. ISCA, 2012.

[79] D. Vijayasenan, F. Valente, and H. Bourlard. An information theoretic combination of mfcc and tdoa features for speaker diarization. Audio, Speech, and Language Processing, IEEE Transactions on, 19(2):431–438, Feb 2011.

[80] D. Vijayasenan, F. Valente, and P. Motlicek. Multistream speaker diarization through information bottleneck system outputs combination. In Acoustics, Speech and Signal Pro-cessing (ICASSP), 2011 IEEE International Conference on, pages 4420–4423, May 2011. [81] B. Widrow, J. Glover, J.R., J. McCool, J. Kaunitz, C. Williams, R. Hearn, J. Zeidler, J. Eu-gene Dong, and R. Goodlin. Adaptive noise cancelling: Principles and applications. Pro-ceedings of the IEEE, 63(12):1692–1716, Dec 1975.

[82] B. Widrow and S. D. Stearns. Adaptive Signal Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1985.

[83] N. Wiener. Extrapolation, Interpolation, and Smoothing of Stationary Time Series. The MIT Press, 1964.

[84] T. Yamada, A. Tawari, and M. Trivedi. In-vehicle speaker recognition using independent vector analysis. Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on, pages 1753 – 1758, 2012.

[85] C. Zhang, D. Florencio, D. Ba, and Z. Zhang. Maximum likelihood sound source locali-zation and beamforming for directional microphone arrays in distributed meetings. IEEE Transactions on Multimedia, 10(3):538–548, April 2008.

[86] T. Zieli´nski. Cyfrowe przetwarzanie sygnałów. Wydawnictwa Komunikacji i Ł ˛aczno´sci, Warszawa, 2009.

[87] M. Ziółko, J. Gałka, B. Ziółko, T. Jadczyk, D. Skurzok, and J. Wicijowski. Automatic speech recognition system based on wavelet analysis. In ICSC, pages 450–451. IEEE Computer Society, 2010.

BIBLIOGRAFIA 94

[88] M. Ziółko, R. Samborski, J. Gałka, and B. Ziółko. Wavelet-Fourier analysis for speaker recognition. 17th National Conference on Applications of Mathematics in Biology and Medicine, page 129 – 134, 2011.

[89] D. Zotkin and R. Duraiswami. Accelerated speech source localization via a hierarchical search of steered response power. Speech and Audio Processing, IEEE Transactions on, 12(5):499–508, Sept 2004.

W dokumencie Index of /rozprawy2/11076 (Stron 85-94)

Powiązane dokumenty