A Pitch Detection Algorithm
A Pitch Detection Algorithm
A Pitch Detection Algorithm
usually a digital recording of speech or a musical note or tone. This can be done in the time domain or the frequency domain. PDAs are used in various contexts (e.g. phonetics, music information retrieval, speech coding, musical performance systems) and so there may be different demands placed upon the algorithm. There is as yet no single ideal PDA, so a variety of algorithms exist, most falling broadly into the classes given below.[1]
Time-domain approaches
In the time domain, a PDA typically estimates the period of the quasiperiodic signal, then inverts that value to give the frequency. One simple approach would be to measure the distance between zero crossing points of the signal (i.e. the Zero-crossing rate). However, this does not work well with complex waveforms which are composed of multiple sine waves with differing periods. Nevertheless, there are cases in which zero-crossing can be a useful measure, for example in some speech applications where a single source is assumed. The algorithm's simplicity makes it "cheap" to implement. More sophisticated approaches compare segments of the signal with other segments offset by a trial period to find a match. AMDF (average magnitude difference function), ASMDF (Average Squared Mean Difference Function), and other similar autocorrelation algorithms work this way. These algorithms can give quite accurate results for highly periodic signals. However, they have false detection problems (often "octave errors"), can sometimes cope badly with noisy signals (depending on the implementation) and - in their basic implementations - do not deal well with polyphonic sounds (which involve multiple musical notes of different pitches). Current time-domain pitch detector algorithms tend to build upon the basic methods referred to above, with additional refinements to bring the performance more in line with a human assessment of pitch. For example, the YIN algorithm[2] and the MPM algorithm[3] are both based upon autocorrelation.
Frequency-domain approaches
In the frequency domain, polyphonic detection is possible, usually utilizing the Fast Fourier Transform (FFT) to convert the signal to a frequency spectrum. This requires more processing power as the desired accuracy increases, although the well-known efficiency of the FFT algorithm makes it suitably efficient for many purposes. Popular frequency domain algorithms include: the harmonic product spectrum;[4][5] cepstral analysis[6] and maximum likelihood which attempts to match the frequency domain characteristics to pre-defined frequency maps (useful for detecting pitch of fixed tuning instruments); and the detection of peaks due to harmonic series.[7]
To improve on the pitch estimate derived from the discrete Fourier spectrum, techniques such as spectral reassignment (phase based) or Grandke interpolation (magnitude based) can be used to go beyond the precision provided by the FFT analysis.
Overview
Main article: source-filter model of speech production
LPC starts with the assumption that a speech signal is produced by a buzzer at the end of a tube (voiced sounds), with occasional added hissing and popping sounds (sibilants and plosive sounds). Although apparently crude, this model is actually a close approximation of the reality of speech production. The glottis (the space between the vocal folds) produces the buzz, which is characterized by its intensity (loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is characterized by its resonances, which give rise to formants, or enhanced frequency bands in the sound produced. Hisses and pops are generated by the action of the tongue, lips and throat during sibilants and plosives. LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal after the subtraction of the filtered modeled signal is called the residue. The numbers which describe the intensity and frequency of the buzz, the formants, and the residue signal, can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing the process: use the buzz parameters and the residue to create a source signal, use
the formants to create a filter (which represents the tube), and run the source through the filter, resulting in speech. Because speech signals vary with time, this process is done on short chunks of the speech signal, which are called frames; generally 30 to 50 frames per second give intelligible speech with good compression.
[edit] Applications
LPC is generally used for speech analysis and resynthesis. It is used as a form of voice compression by phone companies, for example in the GSM standard. It is also used for secure wireless, where voice must be digitized, encrypted and sent over a narrow voice channel; an early example of this is the US government's Navajo I. LPC synthesis can be used to construct vocoders where musical instruments are used as excitation signal to the time-varying filter estimated from a singer's speech. This is somewhat popular in electronic music. Paul Lansky made the well-known computer music piece notjustmoreidlechatter using linear predictive coding.[1] A 10th-order LPC was used in the popular 1980's Speak & Spell educational toy. Waveform ROM in some digital sample-based music synthesizers made by Yamaha Corporation may be compressed using the LPC algorithm.[citation needed] LPC predictors are used in Shorten, MPEG-4 ALS, FLAC, and other lossless audio codecs. Pitch detection
%Frequency Domain Pitch Detection %f_y = pitch_detec(x, window, hop, xformlength) function f_y = pitch_detec(x, window, hop, xformlength) %Windowing input signal numwinds = ceil((size(x,1) - window)/hop) + 1;
windstart = 1; h=1; for(windnum = 1:numwinds) %First fetch the samples to be used in the current window, zeropadding %if necessary for the last window. if(windnum ~= numwinds) windx = x(windstart:windstart + window - 1); else windx = x(windstart:size(x,1)); windx(size(windx,1) + 1:window) = 0; y(size(x, 1) + 1:windstart + window - 1) = 0; end %Apply the Hanning window function to the samples. windx = windx .* hanning(window); %STFT (Convert from Time Domain to Freq Domain using fft of length 4096) f_x = fft(windx, xformlength); %HPS %function f_y = hps(f_x) f_x = f_x(1 : size(f_x,1) / 2); f_x = abs(f_x); %HPS, PartI: downsampling for i = 1:length(f_x) f_x2(i,1) = 1; f_x3(i,1) = 1; f_x4(i,1) = 1; % f_x5(i,1) = 1; end for i = 1:floor((length(f_x)-1)/2) f_x2(i,1) = (f_x(2*i,1) + f_x((2*i)+1,1))/2; end for i = 1:floor((length(f_x)-2)/3) f_x3(i,1) = (f_x(3*i,1) + f_x((3*i)+1,1) + f_x((3*i)+2,1))/3; end for i = 1:floor((length(f_x)-3)/4) f_x4(i,1) = (f_x(4*i,1) + f_x((4*i)+1,1) + f_x((4*i)+2,1) + f_x((4*i)+3,1))/4; end % for i = 1:floor((length(f_x)-4)/5) % f_x5(i,1) = (f_x(5*i,1) + f_x((5*i)+1,1) + f_x((5*i)+2,1) + f_x((5*i)+3,1) + f_x((5*i)+4,1))/5; % end %HPS, PartII: calculate product f_ym = (1*f_x) .* (1.0*f_x2);% .* (1*f_x3)
%HPS, PartIII: find max f_y1 = max(f_ym); for c = 1 : size(f_ym) if(f_ym(c, 1) == f_y1) index = c; end end % Convert that to a frequency f_y(h) = (index / xformlength) * 44100; % Do a post-processing LPF if(f_y(h) > 600) f_y(h) = 0; end %Don't forget to increment the windstart pointer. windstart = windstart + hop; % f_y(h) = f_y1; h=h+1; f_y = abs(f_y)';
end.
pitch determination
function [target, change] = correctfreq(w, allowed, avglength) % % % % % % % % % % % % % % % % % % % % [target, change] = correctfreq(w, allowed, avglength) Computes the target frequencies and relative change for each detected pitch given in w. w is the vector of pitches calculated for each window. allowed is a vector of allowed notes in the FIRST octave. avglength is the length of the smoothing window, set to zero if no smoothing is required. Example scale: 55.0000 58.2700 61.7400 65.4100 69.3000 73.4200 77.7800 82.4100 87.3100
% % %
% Generate the possible frequencies possible = zeros(5 * size(allowed,1),1); for i = 1 : 5 for j = 1 : size(allowed, 1) possible((i-1) * size(allowed, 1) + j, 1) = allowed(j,1) * (2^i); end end % Take the logarithm index = dsearchn([0; log(possible)], log(w)); possible = [0 ; possible]; for i = 1 : size(w, 1) target(i, 1) = possible(index(i, 1), 1); end if(avglength ~= 0) change = target - w; change = filter(ones(1,avglength)/avglength,1, change); target = w + change; else change = 0;