Introduction to Digital Signal Processing

How Do Microphones Work?Mechanical Wave Energy & Sound Waves Electrical Energy & Audio Signals Waveforms Amplitude Frequency How Sound Works Harmonics Additive Synthesis Signals How is sound stored in the computer?Types of Signals Analog-to-Digital Conversion Digital-to-Analog Conversion Other characteristics of signal Fundamental Frequency What Audio Formats are Used?Spectrograms Fourier Transform Discrete Fourier Transform Discrete Fourier Transform Errors & Short-time Fourier Transform Fast Fourier Transform Spectrogram Generation Spectrogram Alternatives Cepstrum Features MFCCs Down-sampling the log-spectrum Mel-scale MFCCs Sources

How Do Microphones Work?

Microphones function as transducers.

Mechanical Wave Energy & Sound Waves

Mechanical wave energy is the energy carried by the oscillation of matter in a medium (mechanical wave).

Sound wave is a type of mechanical wave caused by the disturbance of particles within an elastic medium, such as gas (air), liquid (water), or solid.
- Sound waves oscillate within a range of Hz.
- Infrasound occurs below Hz (inaudible to humans), while ultrasound occurs above Hz.

Electrical Energy & Audio Signals

Electrical energy refers to electric potential energy.
- In modern times, we harvest electrical energy and convert it into other types of energy.
- An audio signal is an electrical signal that represents sound in the form of electrical energy.
  - Measured as AC voltages in millivolts (RMS) or in decibels relative to voltage (dBV or dBu).

Waveforms

Page with audio examples

A waveform is a graph that shows the displacement of air molecules over time as a sound wave travels.
- X-axis represents time:
- Y-axis measures displacement of air molecules:
  - NB: this displacement measures the sound wave's loudness.
    - E.g.: A lightly strummed guitar string only vibrates slightly, causing a small displacement. If you pull the string back by an inch and release, it vibrates more, causing a larger displacement and a louder sound.
- The waveform graph above shows one complete oscillation of a sound wave.
  - It starts by displacing the air molecules positively (+1) and then negatively (-1).

Amplitude

Waveforms are abstract representations of sound waves.
- While real sound waves may displace air molecules by nanometers, we use abstract measurements for waveforms.

Amplitude measures how much a molecule is displaced from its resting position.
- We measure from (silence) to (maximum displacement).
- A waveform's amplitude controls the maximum displacement.
- The higher the amplitude, the louder the sound; the lower the amplitude, the quieter the sound.

Frequency

In periodic waveforms, frequency is key:
- Frequency measures how many times a waveform repeats within a given time.
- The common unit for frequency is Hertz (Hz), representing the number of repetitions per second.
  - E.g.: the waveform above shows a -second interval. This wave oscillates at 2 Hz.
- Frequency is closely related to pitch:
  - E.g.: a singer singing an "A4" note vibrates their throat at 440 Hz. When singing "C5" (3 semitones higher), their throat vibrates at 523 Hz.

Not all sounds are periodic, though:
- White noise contains all audible frequencies, distributed uniformly.

How Sound Works

The air around us contains molecules. When an object vibrates, it causes nearby molecules to vibrate. These molecules impact neighboring ones, propagating the wave outward from the source until its amplitude (volume) fades with distance. The vibration moves through air molecules like a chain reaction, eventually reaching your ear, where your brain interprets it as sound.
- NB: the air molecules do not move across space; they only vibrate.

On Earth, sound travels primarily through air, but it can also travel through water or solid ground (like the rumble of an earthquake). The farther the molecules move with each pulse, the higher the amplitude (volume) of the sound. The faster they vibrate, the higher the frequency (pitch).

Harmonics

The shape of a waveform describes how the displacement changes over time.
- The sine waveform is known as the fundamental waveform because it is pure and has no harmonics.

When a waveform has "side effect" frequencies, we call them harmonics.
- Harmonics are additional frequencies that certain waveforms produce.
  - E.g.: triangle waveforms only have odd harmonics.
  - E.g.: square waveforms have the same harmonics as triangle waveforms, but their harmonics do not diminish as much with increasing frequency:
    - NB: a perfect square wave cannot exist in nature because molecules cannot instantly "teleport" from +1 to -1. We can only approximate it.
  - E.g.: sawtooth waveforms contain frequencies at every multiple of the fundamental frequency.

Additive Synthesis

A surprising fact about waveforms:

Something counter-intuitive about waveform addition is that it does not always make the resulting sound louder. To demonstrate this more clearly, we have to learn about another waveform property—phase.
- Phase is the amount of offset applied to a wave, measured in degrees.
  - This is exactly how noise-cancelling headphones work:
    - NB: this process is imperfect: real noise is not as simple or consistent as sine waves.
      - E.g.: can remarkably effect in areas with consistent low-frequency noise, like airplanes or subways.

Signals

Sound is air pressure fluctuations that microphones convert into electrical signals.

How is sound stored in the computer?

In order to represent a sound wave in a way computers can manipulate and work with, the sound has to be converted into a digital form. This process is called analog to digital (A/D) conversion.

Types of Signals

All signals fall into four categories:

Analog-to-Digital Conversion

Process of converting an analog signal to a digital signal:

A/D conversion process has the following form:

Position of each audio source within the audio signal is called a channel.
- Each channel contains a sample indicating the amplitude of the audio being produced by that source at a given moment in time.
  - E.g.: in mono sound, there is only one audio source while in stereo sound, there are two audio sources: one speaker on the left, and one on the right. Each of these is represented by one channel, and the number of channels contained in the audio signal is called the channel count.

Signal Sampling: Nyquist-Shannon Theorem

Theory

The higher the sampling rate, or sampling frequency, the more accurate would be the stored information.
- E.g.: much higher sampling rate is needed for sampling a signal which is rich in high frequency components, such as the sound of music, compared to the sampling frequency needed for sampling a slowly varying signal, such as the output of a gas-chromatograph detector.

problem: high sampling rate produces a large volume of data to be stored.
- Q: Which is the minimum necessary sampling rate for a given type of signal, that will not distort the underlying information and/or allow its accurate reconstruction?
  - E.g.: for digitalization of sound, a sampling rate of about kHz is sufficient for telephony, since normal human voice does not contain an appreciable amount of frequency components higher than – kHz.
  - E.g.: for digitalization of music, a sampling rate of about kHz is needed, since frequency components of about – kHz are common and needed for achieving fidelity of sound reconstruction.
  - For digital recording of music in CD, a sampling rate of kHz is commonly used.

Prior to sampling, signal must pass through a low-pass filter which will remove all unnecessary components higher than , preventing thus the "contamination" of the stored signal by their aliased frequencies.

Digital-to-Analog Conversion

Process of converting a digital signal into an analog signal.

Want to play back sound.

Use of interpolation by either:

Other characteristics of signal

Formally, signal is represented as

Power of a signal is .

Energy of a signal corresponds to the total magnitude of the signal:
- Roughly corresponds to how loud the signal is.
- In practice estimated by some window.

Fundamental Frequency

Fundamental frequency, , of a speech signal refers to the approximate frequency of the (quasi-)periodic structure of voiced speech signals.
- Oscillation originates from the vocal folds, which oscillate in the airflow when appropriately tensed.
- Oscillation originates from an organic structure.
  - Jitter — amount of variation in period length.
  - Shimmer — amount of variation in amplitude.

is typically not stationary and changes constantly within a sentence.
- Lies roughly in the range from to , where males have lower voices than females and children.
- of an individual speaker depends primarily on the length of the vocal folds.

is closely related to pitch, which is defined as our perception of fundamental frequency.
- describes the actual physical phenomenon.
- Pitch describes how our ears and brains interpret the signal in terms of periodicity.
- E.g.: voice signal could have an of . If we then apply a high-pass filter to remove all signal components below , then that would remove the actual fundamental frequency. The lowest remaining periodic component would be , which correspond to the fifth harmonic of the original . However, a human listener would then typically still perceive a pitch of , even if it does not exist anymore. The brain somehow reconstructs the fundamental from the upper harmonics. This well-known phenomenon is however still not completely understood.
  - Speech signal with a fundamental frequency of approximately :

If is the fundamental frequency, then the length of a single period in seconds is
- Example:
  - NB: magnitude spectrum of , has then a periodic comb-structure.

What Audio Formats are Used?

https://en.wikipedia.org/wiki/Audio_file_format

Non-compressed formats: WAV, AIFF, etc.

Lossless compression: FLAC, ALAC, etc.

Lossy compression: MP3, Opus, etc.

Spectrograms

Build spectrogram in real-time

Spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time.
- Common format is a graph with:
  - Two geometric dimensions: one axis represents time, and the other axis represents frequency.
  - Third dimension indicates the amplitude of a particular frequency at a particular time.
    - Represented by the intensity or color of each point in the image.

It is bad to work with sound in raw format because:

Fourier Transform

Spectrograms may be created from a time-domain signal using the Fourier transform (FT).
- Video on the FT:

Discrete Fourier Transform

Creating a spectrogram is a digital process.
- be the continuous signal which is the source of the data.
- samples be denoted .

Example #1

Let the continuous signal be
- E.g.: , , , .

Example #2

Consider the following signal with the frequency of :

Discrete Fourier Transform Errors & Short-time Fourier Transform

DFT is only an approximation since it provides only for a finite set of frequencies.

Aliasing

If the initial samples are not sufficiently closely spaced to represent high-frequency components present in the underlying function, then the DFT values will be corrupted by aliasing:

Q: How to cope with aliasing?

Leakage

FT requires integration over the interval or over an integer number of cycles of the waveform.
- E.g.: consider case of input signal which is a sinusoid with a fractional number of cycles in data samples:

Most sequences of real data are much more complicated than the sinusoidal sequences that we considered.
- NB:
  - is the Hamming window.
  - is the Hanning window.
- These window functions taper the samples towards zero values at both endpoints,

Fast Fourier Transform

Time taken to evaluate a DFT on a digital computer depends on the number of multiplications involved.

Highly efficient computer algorithms for estimating DFT have been developed since the mid-'s.
- It’s easy to realize that the same values are calculated many times as the computation proceeds:

Spectrogram Generation

Algorithm:
- The higher you go in the column, the higher the frequencies represented, and vice versa.
- The brightness of the color indicates how strong a particular frequency is at that moment.
  - E.g.: Yellow at the bottom represents a bass sound, while higher up represents a high-pitched sound.

Spectrogram Alternatives

Spectrogram visualizes effectively many pertinent features of speech signals.

Logarithmic spectrum is a much more accessible representation:
- It’s not only more visual, but importantly, the logarithm approximates roughly the sensitivity of the ear, such that logarithmic spectra can be used to assess auditory importance of spectral features.
  - Logarithmic spectrum visualizes spectral content such that the magnitude of values is approximately uniform throughout the spectrum.
- Only exception is zeros and other very small values in the magnitude spectrum.
  - Q: How to deal with zeros?

Cepstrum

Log-spectrum reveals a rich structural composition of the analyzed signal.

Cepstrum is a tool for investigating periodic structures in frequency spectra.

It is worth repeating that the cepstrum involves two time-frequency transforms.

Features

Cepstrum has two features:

MFCCs

Mel-Frequency Cepstral Coefficients

Cepstrum is good for extracting envelope and -information.

Down-sampling the log-spectrum

Envelope information is about the slowly-varying shape of the log-spectrum.

problem: power-spectrum can sometimes have arbitrarily small values.

solution: apply smoothing in the power-spectrum.
- Could use a FIR-filter , or more generally, a triangular shape:
- Achieves this with a low number of coefficients.
- Downside: information does not reflect the importance of features for humans.

Mel-scale

To improve the representation, we can include more information about auditory perception into the model.
- By introducing information about human perception, we focus the model on that part of the information which human listeners would find important.

Log-spectrum already takes into account perceptual sensitivity on the magnitude axis, by expressing magnitudes on the logarithmic-axis. The other dimension is then the frequency axis.

Above about , increasingly large intervals are judged by listeners to produce equal pitch increments.
- Popular formula to convert hertz into mels is
  - Corresponding inverse expression is

By taking points , using the above formula, we can find points whose perceptual distance is equal.

Mel-envelope clearly models lower frequencies accurately, which is also where the all-important formants reside. That is, accuracy is concentrated on the important part, which is good. Higher frequencies, above in particular, are poorly modelled, but there is usually not too much energy anyway, so that is ok:

MFCCs

Remaining issue with the log-melspectrum is however that neighbouring samples are highly correlated.

problem: we cannot reduce accuracy more, because then we would start loosing accuracy of the formants.
- Say, we have a time-signal which has correlation over time. By taking the DCT, we obtain the spectrum of the signal, where samples are reasonably uncorrelated.
  - Algorithm:

MFCC is an abstract domain, which contains information about the spectral envelope of the speech signal.
- Not easy to interpret visually.
- Since it is designed to correspond to resemble perception in both magnitude and frequency axis, and to be roughly uncorrelated, it is efficient for computation.

Beneficial properties of the MFCCs include:

Some of the issues with the MFCC include:

Sources

https://github.com/markovka17/dla

https://www.khanacademy.org/test-prep/mcat/physical-processes/sound/a/sound-is-a-longitudinal-wave

https://is.muni.cz/el/1433/jaro2012/PA190/um/Slides_02.pdf

https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Audio_concepts

https://www.robots.ox.ac.uk/~sjrob/Teaching/SP/l7.pdf

https://www.cs.brandeis.edu/~cs136a/CS136a_docs/KishorePrahallad_CMU_mfcc.pdf

http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

Introduction to Digital Signal Processing

Table of Contents

Table of Contents

Introduction to Digital Signal Processing

How Do Microphones Work?

Mechanical Wave Energy & Sound Waves

Electrical Energy & Audio Signals

Waveforms

Amplitude

Frequency

How Sound Works

Harmonics

Additive Synthesis

Signals

How is sound stored in the computer?

Types of Signals

Analog-to-Digital Conversion

Digital-to-Analog Conversion

Other characteristics of signal

Fundamental Frequency

What Audio Formats are Used?

Spectrograms

Fourier Transform

Discrete Fourier Transform

Discrete Fourier Transform Errors & Short-time Fourier Transform

Fast Fourier Transform

Spectrogram Generation

Spectrogram Alternatives

Cepstrum

Features

MFCCs

Down-sampling the log-spectrum

Mel-scale

MFCCs

Sources