Automatic Tuning System For Polyphonic Sound
Automatic Tuning System For Polyphonic Sound
Automatic Tuning System For Polyphonic Sound
Motivation. We all know of the existence of plug-in's and other computer software that allow us to tune a melodic line in real time or by postprocessing, whether human voice or other musical instrument such as a guitar or violin. However there is still no commercial polyphonic tuning system. Consider that in a recording studio, input is given as a sequence of guitar chords. For some avatar of fate, some of the chords are out of tune and they cannot be recorded again, however, we want to respect them because their absence would cause a vacuum in the musical piece. The solution would lie in the computer-aided tuning of the chord or chords in question. The system would then be presented to aid in the event that this problem is given. Another similar utility would be provided by the need to improve the tune on a song that you want to remasterate. In many audio remastering cases, there is no disaggregation of tracks, needed to handle each line separately. Tuning system aids again in this case.
Description of the solution. The key to the solution offered by this system lies in the work done in the frequency domain because, otherwise, would be almost impossible, a priori, to tune polyphony in time domain. We show an approximation by just generating a clean ideal chord only consists of the fundamental tones and their harmonics attenuated exponentially. We present below the block diagram, proceeding then to a breakdown and description:
The polyphonic sound out of tune should be picked first by the Fundamental Tones Estimator block. This first stage determines the polyphonic fundamental tones by studying maximums in its spectrum. To this end, let the chord out of tune in the time domain a set of samples ( ), we obtain its spectrum by DFT operation of the form: ( )= ( )
Then, we work with the modulus of the transform (remember that relevant sound information is in the modulus of the spectrum and not in the phase). Therefore, we calculate the modulus samples (for simplicity we assume now that we also have samples of the DFT) as: | ( )| = { ( )} + { ( )} = 0,1, , 1
Once calculated, we obtain a representative sample of positions that are maximums. These maximum levels must correspond to the peaks of the tones and harmonics that make up the chord out of tune. As we do not know, a priori, the number of fundamental tones (or we do not must to know it), we can take the positions of a wide range of maximums, although, with experience, we can begin to acquire a compromise to maximize the efficiency, so the program meets its mission to perfection in the shortest time possible. Through a traversal algorithm and comparison of all the positions of the maximums obtained, we classify the fundamental tones and harmonics are discarded. The waste of harmonics is done by comparison of ratio of the potential harmonic and the potential fundamental tone. Thus, if one of the maximum corresponds to a harmonic, the ratio ideally should be a natural number . Since there may be some deviation, again, by a stochastic database will be determined the best to discern between a harmonic and a non-harmonic peak. In principle, = 0.05, more than enough. Once estimated all fundamental tones, the stage will return a vector of fundamental tones (which are assumed out of tune).
Then, Linear Classifier block shall, by a linear decision, to determine what is the fundamental tone to which they should tune each of the component notes of polyphonic sound. This makes it adapting previously linearly exponential frequency scale to a logarithmic function. Recall that the auditive human perception is not linear with frequency, but linearity feeling comes from the exponential growth of frequency. Drawing on the following expression: ( , ) = 440
(( ) ( )) ( )
Where log (2) refers to the natural logarithm of 2. We care that the function is dependent only of , where corresponds to the following association: n 1 2 3 4 5 6 7 8 9 10 11 12 ote C C# D D# E F F# G G# A A# B
Of course, to finish characterizing the note we need to specify the octave in which it is located, and it is responsible for the variable . We express the octave, , depending on the frequency as: = log Consequently, the function is: = 440 = 10 +1 4 + log(2) 32.7 12 10 + 1 4 log(2) + log(2) 32.7 12
. ( )
32.7
+1
log
log = 440
440 +
= log(2) log ( )
log
( )
The function follows a linear decision-maker, so that, by rounding, we can set the note which, while respecting the auditive perception, is closer to the note out of tune. It follows immediately that if n = 1,2, ..., 12, the note is perfectly in tune and need not take any tuning by spectrum displacement.
The Fundamental Tone Estimator sends tones vector to Linear Classifier, so that, for example, imagine that one of the fundamental tones was situated, as estimated, to 450Hz ( = 450 ). Applying the above expression deduced: = + ( ) ( ) . + = .
Since 0.3891 < 0.5, n = 10 and therefore corresponds to an A4 (at 440Hz). We can modify decisory-table so that the pitch does not correspond to distances of half-tone, but brings together specific harmonies, this being mere particularization of the above. Note also that we consider a C# as a Db, although this is not strictly orthodox in practice. Filter block is responsible for filtering the entire spectrum corresponding to the introduced chord out of tune, respecting only the components associated with each tone separately and estimated by the Fundamental Tones Estimator block. Making this separation, which we managed to simplify the problem, leading subsequently to a simple single tuning by spectrum displacement and then adding up all the tuned samples in the time domain rebuilding the polyphony. The filter is generated for each particular case as each one of the estimated frequencies. This filter is only the overlap in the time domain of the estimated fundamental tone and all audible harmonics with a dispersion of 2Hz, to trying not to alter the timbre of the instrument. Applying the DFT operation described above, we obtain the filter response in frequency domain, so that the convolution in the time domain filter with the chord becomes the product of the spectrum and filter in frequency domain, selecting only one note of the polyphonic sound: ( ) { ( )} = ( ) ( ) { ( )} = ( ) ( ) ( )
( )= ( ) ( )=
( )=
( ) ( )
Therefore, this block would be responsible for each of the estimated fundamental frequencies, returning the spectrum component notes, where in consequence, = 1,2, , . Now, using the Linear Classifier, Displacement Spectrum Tuner will refine each of the component notes from the spectrum provided by Filter. This block uses the knowledge of the fundamental frequency estimated and actual or ideal frequency to define the parameter = .
We take the time scaling property of the Fourier transform that tells us: ( ) { ( )} = | | 1
( ), becomes:
( )=
| |
Where rates of samples should be rounded to the nearest integer (the prototype works with a sample by hertz, although much improved if it does its performance with a higher resolution only by scaling the above results). For the parameter , the estimated frequency comes directly from Fundamental Tones Estimator while the ideal frequency is calculated by using the expression above: ( , ) = 440
(( ) ( )) ( )
Where is the rounding parameter returned by Linear Classifier and octave of the note in question in the words already mentioned: = log 32.7 +1
refers to the
Once tuned the ith component note of the sound by spectrum displacement, this is passed and converted by the inverse DFT to the Adder Buffer in Time Domain block: ( )= 1 ( )
Here all the notes wait until ingredients are all ready. When it occurs they are superimposed to compose the tuned polyphonic sound. Here is an example of system procedures. Suppose we have an A minor chord built on the 4th octave (ideally 440 Hz), A-C-E. However, the frequencies of the fundamental tones of the component notes of the chord are 435Hz (for A at 440 Hz), 520Hz (for the C at 523.25Hz) and 650Hz (for E at 659.26Hz), so it is out of tune. The following figure shows a set of 1,000 samples of the chord:
The polyphonic sound is sampled at 44100Hz (audio quality) and comprises, apart from three basic tones, by 20 harmonics whose amplitudes undergo exponential decay for each of the fundamental tones. Then the estimator block calculates the fundamental tones of the DFT modulus of the chord between 0 and . As shown in figure 3, this phase implements an algorithm to select the maximum position that correspond with the frequencies of the fundamental tones. For simplicity, each sample refers to hertz, so that on the horizontal axis can be seen directly the fundamental frequency of each of the component notes of the chord:
= (435,520,650).
Now, for each detected note, proceed as follows. The filtering stage, known fundamental frequency of the ith note, with i = 1,2, ..., M, where in this case M = 3, generate the filter in time domain described above. The following sequence of figures collected 1,000 samples from each of the temporal responses of the filters according to the vector v:
Figure 4. Time domain response of the filter for the note whose fundamental frequency is at 435Hz.
Figure 5. Time domain response of the filter for the note whose fundamental frequency is at 520Hz.
Figure 6. Time domain response of the filter for the note whose fundamental frequency is at 650Hz.
For each one of the filters generated previously estimated, the product of their frequency responses and the spectrum of the chord out of tune, which returns us in this case, three spectrums (since M = 3) which correspond to the component notes of the polyphonic sound separately. In parallel, Linear Classifier uses the estimated frequency of the fundamental tone to determine what note should be sharpened to each of the components of polyphony as described. That is: = 10 + 12 435 log log (2) log log (2) 440 435 +1 4 32.7 = 10.2390 = 0.8921 = 4.7552
= 10 + = 10 +
12 520 log log (2) log log (2) 440 12 650 log log (2) log log (2) 440
We make a simple classification based on linear distance minimization in the space of musical notes, so we deduce that the component notes are A4-C5-E5, as can be gleaned from the table presented above. The actual frequencies of the notes are, therefore: (4,10) = 440
(( ) ( )) ( )
(5,1) = 440
(5,5) = 440
(( ((
) ( ) (
)) ))
( ) ( )
= 523.25 = 659.26
= 440
All these data are collected by the Displacement Spectrum Tuner block. Thus, for each of the component notes of the chord, we calculate the parameter :
440 = 1.011494253 435 523.25 = = 1.010096154 520 659.26 = = 1.014246154 650 = These are the temporal scaling parameters motivated by the need to change the pitch of each one of the notes out of tune. Consider that the fundamental tone of each of the component notes of polyphony can be approximated by the function cos (2 ), therefore, to ensure that this tone is equal to the frequency now , shall need to multiply the previous argument by cos (2 ). This idea we match with the temporal scaling property of Fourier transform previously exposed. Then, for each of the component notes spectrum of the chord from filtering stage, we apply: ( )= | | 1 = / , so that cos( 2 ) = cos 2 =
Where is the frequency response of the ith note of polyphony. Below is the sequence of tuned spectral components:
10
The last stage of the block diagram would be responsible for adding the inverse transforms (samples in time) of each one of the spectrums of the component notes of the chord, resulting in polyphonic tuned sound. The following chart reflects a comparison between the first 1000 samples of the chord out of tune (in blue) and the first 1000 samples of the tuned chord (in red). Note that the filter introduces some distortion (only a matter of going "tuning" the system).
11
Figure 10. Chord in tune vs. chord out of tune in time domain.