Automatic Detection of Music Notes by the MFT Transform

The project goal is to find an algorithm that will get a piano music file as an input, and rewrite it automatically. The output is a list of notes with its parameters

Abstract
The project goal is to find an algorithm that will get a piano music file as an input, and rewrite it automatically. The output is a list of notes with its parameters:

1. Onset Time
2. Pitch Value
3. Amplitude
4. Duration

The algorithm converts an Audio file into a MIDI file, which contains the list of notes. Musicians can easily use this MIDI format for analyzing, editing, reconstructing or printing these notes.

The problem
Onset Time = the exact moment of note creation.

Pitch = the basic frequency of the periodic note.

Every piano key differs by its own basic frequency
For example A4 has a basic frequency of 440 Hz
The next piano key pitch is given by 440*
A note = a sequence of frequencies: f0,2f0,3f0… where f0 is the note pitch

Duration = time interval between Onset Time and its disappearance.

Since the human ear is very sensitive to small differences in the Pitch and Onset time of a note, special care must be taken in determining their values
An assumption : The played notes are connected, means a new one has its Onset Time while the previous one is fading out

A time-frequency representation is needed to determine the Onset Times & Pitches (basic frequencies) as well. The music file we dealt with was played with piano only, but it may contain a polyphonical music (a few notes can be played at the same time).

The solution
We used the STFT transform to represent time-frequency together. In descrete time, the STFT (Short Time Furier Transform) is given by:

The STFT is linear, reversible, invariant in time and frequency, and practical because of the analize bounded window h(n).

Every column of the STFT matrix, is a DFT transform of x(n) duplicated in time window: (that’s how we created the STFT matrix)

The problem is that one time-frequency resolution is not enough, and according to the Uncertainty Law The accuracy of time is good when the accuracy of frequency is bad.

We used a couple of resolutions, and joined the information of each of them. This is the definition of the MFT (Multiresolution Fourier Transform) transform.

Each level of the MFT transform is an ordinary STFT transform, but every time with a different resolution.

Onset Time Detection Algorithm
Onset Times are distinguished by their sudden rise in the MFT coefficients.

If we take the time deviation of these coefficients, we expect to get high values in the columns where the new notes have begun.

The sum of the deviation columns is a time function, called the C vector. Now, the local maximums of C are the Onset Time in potential.

The maximums that are caused by noise are removed by a global threshold, and the maximums of the disappering of the note, are removed by using energy considerations.

The a level combination is done for all C vectoes:

Every maximum in one resolution is searched approximately in the next resolution , until we have the most accurate onset times.

Pitch Detection Algorithm
Now when we have the exact Onset Times, we can deal with every interval separately.

For each interval:

FFT transform
Finding local maximums, to get the Pitches with their Partials
Scanning over the maximums to find their harmonies

The frequencies which have at least 4 harmonies are counted as basic frequencies, and these are the Pitches of the specific note.
Duration Time is taken out of Onset Times, under the assumption that the signal is played in legato. Amplitude is less important, and is taken as constant.

Results
The signal contains 1,2,1,2,1 notes

five Onset Times were found

Tools
MATLAB

Conclusions
The algorithm finds in most accuracy the Onset Time and the Pitch values of all the played notes, provided that the signal is played in legato. The algorithm also handles with a polyphonic music, where some noted are played concurrently.

The different resolutions used in the MFT transform must be fitted to the specific signal, according to its frequency sampling rate.

Acknowledgment
We would like to thank our supervisor, Alex Kobzantsev, for his help. We also want to thank the Ollendorff Minerva Center Fund which supports the laboratories.