Skip to content

Speech Analysis

Frame-based analysis, spectrograms, pitch trackers, voice activity detection, level measurement, and psychoacoustic metrics.

Framing and time-frequency

v_enframe

V_ENFRAME - Split signal into (overlapping) frames: one per row.

v_enframe

v_enframe(
    x, win=None, hop=None, m="", fs=1
) -> tuple[ndarray, ndarray, ndarray]

Split signal up into (overlapping) frames: one per row.

Parameters:

Name Type Description Default
x array_like

Input signal (1-D).

required
win int or array_like

Window or window length in samples. Default is len(x).

None
hop int or float

Frame increment in samples. If < 1, fraction of window length. Default is window length (non-overlapping).

None
m str

Mode string: 'z' - zero pad final frame 'r' - reflect last few samples for final frame 'A' - t output as centre of mass 'E' - t output as centre of energy 'f' - 1-sided DFT on each frame 'F' - 2-sided DFT on each frame 'p' - 1-sided power spectrum 'P' - 2-sided power spectrum 'a' - scale window to give unity gain with overlap-add 's' - scale so power is preserved 'S' - scale so total energy is preserved 'd' - make 's'/'S' give power/energy per Hz

''
fs float

Sample frequency (only needed for 'd' option). Default is 1.

1

Returns:

Name Type Description
f ndarray

Enframed data, one frame per row.

t ndarray (if requested via tuple unpacking)

Fractional time in samples at the centre of each frame. First sample is index 1 (MATLAB convention).

w ndarray (if requested via tuple unpacking)

Window function used, including scaling.

Source code in pyvoicebox/v_enframe.py
def v_enframe(x, win=None, hop=None, m='', fs=1) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Split signal up into (overlapping) frames: one per row.

    Parameters
    ----------
    x : array_like
        Input signal (1-D).
    win : int or array_like, optional
        Window or window length in samples. Default is len(x).
    hop : int or float, optional
        Frame increment in samples. If < 1, fraction of window length.
        Default is window length (non-overlapping).
    m : str, optional
        Mode string:
          'z' - zero pad final frame
          'r' - reflect last few samples for final frame
          'A' - t output as centre of mass
          'E' - t output as centre of energy
          'f' - 1-sided DFT on each frame
          'F' - 2-sided DFT on each frame
          'p' - 1-sided power spectrum
          'P' - 2-sided power spectrum
          'a' - scale window to give unity gain with overlap-add
          's' - scale so power is preserved
          'S' - scale so total energy is preserved
          'd' - make 's'/'S' give power/energy per Hz
    fs : float, optional
        Sample frequency (only needed for 'd' option). Default is 1.

    Returns
    -------
    f : ndarray
        Enframed data, one frame per row.
    t : ndarray (if requested via tuple unpacking)
        Fractional time in samples at the centre of each frame.
        First sample is index 1 (MATLAB convention).
    w : ndarray (if requested via tuple unpacking)
        Window function used, including scaling.
    """
    x = np.asarray(x, dtype=float).ravel()
    nx = len(x)

    if win is None:
        win = nx
    win = np.asarray(win, dtype=float).ravel()

    nwin = len(win)
    if nwin == 1:
        lw = int(win[0])
        w = np.ones(lw)
    else:
        lw = nwin
        w = win.copy()

    if hop is None:
        hop = lw
    elif hop < 1:
        hop = int(round(lw * hop))
    else:
        hop = int(hop)

    # Window scaling
    wsc = 1.0
    if 'a' in m:
        wsc = np.sqrt(hop / np.dot(w, w))
    elif 'd' in m:
        if 's' in m:
            wsc = np.sqrt(1.0 / (np.dot(w, w) * fs))
        elif 'S' in m:
            wsc = np.sqrt(hop / np.dot(w, w)) / fs
    else:
        if 's' in m:
            wsc = np.sqrt(1.0 / (np.dot(w, w) * lw))
        elif 'S' in m:
            wsc = np.sqrt(hop / (np.dot(w, w) * lw))

    nli = nx - lw + hop
    nf = max(int(np.fix(nli / hop)), 0)  # number of full frames
    na = nli - hop * nf + (nf == 0) * (lw - hop)  # samples left over
    fx = ('z' in m or 'r' in m) and na > 0  # need extra row

    f = np.zeros((nf + int(fx), lw))
    indf = hop * np.arange(nf)  # (nf,)
    inds = np.arange(lw)        # (lw,)

    if fx:
        # Build index matrix for full frames
        idx = indf[:, np.newaxis] + inds[np.newaxis, :]  # (nf, lw)
        if nf > 0:
            f[:nf, :] = x[idx]
        if 'r' in m:
            ix = 1 + np.mod(nf * hop + np.arange(lw), 2 * nx)
            # MATLAB: x(ix+(ix>nx).*(2*nx+1-2*ix)), 1-based
            ix0 = ix.copy()
            mask = ix0 > nx
            ix0[mask] = 2 * nx + 1 - ix0[mask]
            f[nf, :] = x[ix0.astype(int) - 1]  # convert to 0-based
        else:
            rem_samples = nx - nf * hop
            f[nf, :rem_samples] = x[nf * hop:nx]
        nf = f.shape[0]
    else:
        if nf > 0:
            idx = indf[:, np.newaxis] + inds[np.newaxis, :]
            f[:] = x[idx]

    w = wsc * w
    if nwin > 1:
        f = f * w[np.newaxis, :]
    else:
        f = wsc * f

    # Power spectrum or DFT
    ml = m.lower()
    if 'p' in ml:
        f = np.fft.fft(f, axis=1)
        f = np.real(f * np.conj(f))
        if 'p' in m:  # 1-sided
            imx = int(np.fix((lw + 1) / 2))
            f[:, 1:imx] = f[:, 1:imx] + f[:, lw - 1:lw - imx:-1]
            f = f[:, :int(np.fix(lw / 2)) + 1]
    elif 'f' in ml:
        f = np.fft.fft(f, axis=1)
        if 'f' in m:  # 1-sided
            f = f[:, :int(np.fix(lw / 2)) + 1]

    # Time output
    if 'E' in m:
        t0 = np.sum(np.arange(1, lw + 1) * w ** 2) / np.sum(w ** 2)
    elif 'A' in m:
        t0 = np.sum(np.arange(1, lw + 1) * w) / np.sum(w)
    else:
        t0 = (1 + lw) / 2.0
    t = t0 + hop * np.arange(nf)

    return f, t, w

v_overlapadd

V_OVERLAPADD - Join overlapping frames together.

v_overlapadd

v_overlapadd(f, win=None, inc=None) -> ndarray

Join overlapping frames together.

Parameters:

Name Type Description Default
f (ndarray, shape(nr, nw))

Frames to be added together, one frame per row.

required
win array_like or dict

Window function to multiply each frame, or saved state dict. If omitted, a rectangular window is used.

None
inc int

Time increment (in samples) between successive frames. Default is nw.

None

Returns:

Name Type Description
x ndarray

Output signal of length nw + (nr-1)*inc.

zo dict (only if explicitly requested)

Saved state for chunk processing.

Source code in pyvoicebox/v_overlapadd.py
def v_overlapadd(f, win=None, inc=None) -> np.ndarray:
    """Join overlapping frames together.

    Parameters
    ----------
    f : ndarray, shape (nr, nw)
        Frames to be added together, one frame per row.
    win : array_like or dict, optional
        Window function to multiply each frame, or saved state dict.
        If omitted, a rectangular window is used.
    inc : int, optional
        Time increment (in samples) between successive frames.
        Default is nw.

    Returns
    -------
    x : ndarray
        Output signal of length nw + (nr-1)*inc.
    zo : dict (only if explicitly requested)
        Saved state for chunk processing.
    """
    f = np.asarray(f, dtype=float)
    if f.ndim == 1:
        f = f.reshape(1, -1)
    nr, nf = f.shape

    if win is None:
        if inc is None:
            inc = nf
        w = None
    elif isinstance(win, dict):
        w = win['w']
        inc = win['inc']
        xx = win['xx']
    else:
        win = np.asarray(win, dtype=float).ravel()
        if inc is None:
            if len(win) == 1 and win[0] == int(win[0]):
                inc = int(win[0])
                w = None
            else:
                inc = nf
                w = win
                if len(w) != nf:
                    raise ValueError('window length does not match frame size')
                if np.all(w == 1):
                    w = None
        else:
            w = win
            if len(w) != nf:
                raise ValueError('window length does not match frame size')
            if np.all(w == 1):
                w = None

    if not isinstance(win, dict):
        xx = None

    nb = int(np.ceil(nf / inc))  # number of overlap buffers
    no = nf + (nr - 1) * inc     # total output length

    z = np.zeros((no, nb))

    for i in range(nr):
        buf_idx = i % nb
        start = i * inc
        frame = f[i, :] * w if w is not None else f[i, :]
        z[start:start + nf, buf_idx] += frame

    x = np.sum(z, axis=1)

    if xx is not None and len(xx) > 0:
        x[:len(xx)] += xx

    return x

v_fram2wav

V_FRAM2WAV - Convert frame values to a continuous waveform.

v_fram2wav

v_fram2wav(x, tt, mode='l') -> tuple[ndarray, ndarray]

Convert frame values to a continuous waveform.

Parameters:

Name Type Description Default
x (ndarray, shape(nf) or (nf, p))

Input signal: one row per frame.

required
tt (ndarray, shape(nf, 2) or (nf, 3))

Frame specifications. Each row: [start_sample, end_sample, flag]. flag=1 for start of new spurt. If tt has only 2 columns, spurts are auto-detected from gaps.

required
mode str

'z' for zero-order hold, 'l' for linear interpolation (default).

'l'

Returns:

Name Type Description
w (ndarray, shape(n, p))

Interpolated waveform of length n = tt[-1, 1].

s (ndarray, shape(ns, 2))

Start and end sample numbers of each spurt.

Source code in pyvoicebox/v_fram2wav.py
def v_fram2wav(x, tt, mode='l') -> tuple[np.ndarray, np.ndarray]:
    """Convert frame values to a continuous waveform.

    Parameters
    ----------
    x : ndarray, shape (nf,) or (nf, p)
        Input signal: one row per frame.
    tt : ndarray, shape (nf, 2) or (nf, 3)
        Frame specifications. Each row: [start_sample, end_sample, flag].
        flag=1 for start of new spurt. If tt has only 2 columns, spurts
        are auto-detected from gaps.
    mode : str, optional
        'z' for zero-order hold, 'l' for linear interpolation (default).

    Returns
    -------
    w : ndarray, shape (n, p)
        Interpolated waveform of length n = tt[-1, 1].
    s : ndarray, shape (ns, 2)
        Start and end sample numbers of each spurt.
    """
    x = np.asarray(x, dtype=float)
    tt = np.asarray(tt, dtype=float)

    if x.ndim == 1:
        x = x[:, np.newaxis]

    nf, p = x.shape
    n = int(np.round(tt[-1, 1]))
    w = np.full((n, p), np.nan)
    nt = tt.shape[1]

    # MATLAB 1-based indexing: ceil(tt(:,1)) and floor(tt(:,2))
    ix1 = np.ceil(tt[:, 0]).astype(int)   # start of frame sample (1-based)
    ix2 = np.floor(tt[:, 1]).astype(int)  # end of frame sample (1-based)

    # Determine spurt boundaries
    if nt > 2:
        ty = (tt[:, 2] > 0).astype(float)
    else:
        ty = np.zeros(nf)
        ty[1:] = (ix1[1:] > ix2[:-1] + 1).astype(float)

    ty[0] = 1  # first frame always starts a spurt
    # NaN always ends previous spurt
    nan_mask = np.any(np.isnan(x), axis=1)
    ty[nan_mask] = 1
    # NaN always forces a new spurt for the frame after
    nan_idx = np.where(nan_mask[:nf - 1])[0]
    ty[nan_idx + 1] = 1

    ty = ty.astype(float)
    # Encode: bit 0 = starts a spurt, bit 1 = ends a spurt
    # MATLAB: ty(1:end-1) = ty(1:end-1) + 2*ty(2:end)
    ty[:nf - 1] = ty[:nf - 1] + 2 * ty[1:]
    ty[nf - 1] = ty[nf - 1] + 2  # last frame always ends a spurt

    nx = ix2 - ix1 + 1

    if 'z' in mode:
        # Zero-order hold
        for i in range(nf):
            if nx[i] > 0:
                # Convert to 0-based
                w[ix1[i] - 1:ix2[i], :] = np.tile(x[i, :], (nx[i], 1))
    else:
        # Linear interpolation (default)
        ttm = (tt[:, 0] + tt[:, 1]) / 2.0  # mid point of frame (1-based)
        ixm = np.floor(ttm).astype(int)     # end of first half (1-based)

        for i in range(nf):
            if nx[i] > 0:
                tyi = int(ty[i])
                if tyi == 3:  # isolated frame: zero-order hold
                    w[ix1[i] - 1:ix2[i], :] = np.tile(x[i, :], (nx[i], 1))
                else:
                    nxm = ixm[i] - ix1[i] + 1
                    if nxm > 0:
                        if tyi == 1:  # start of spurt
                            grad = (x[i + 1, :] - x[i, :]) / (ttm[i + 1] - ttm[i])
                        else:
                            grad = (x[i, :] - x[i - 1, :]) / (ttm[i] - ttm[i - 1])
                        samples = np.arange(ix1[i], ixm[i] + 1)  # 1-based
                        w[ix1[i] - 1:ixm[i], :] = (
                            np.tile(x[i, :], (nxm, 1)) +
                            (samples[:, np.newaxis] - ttm[i]) * grad[np.newaxis, :]
                        )
                    if nx[i] > nxm:
                        if tyi == 2:  # end of spurt
                            grad = (x[i, :] - x[i - 1, :]) / (ttm[i] - ttm[i - 1])
                        else:
                            grad = (x[i + 1, :] - x[i, :]) / (ttm[i + 1] - ttm[i])
                        n_remaining = ix2[i] - ixm[i]
                        samples = np.arange(ixm[i] + 1, ix2[i] + 1)  # 1-based
                        w[ixm[i]:ix2[i], :] = (
                            np.tile(x[i, :], (n_remaining, 1)) +
                            (samples[:, np.newaxis] - ttm[i]) * grad[np.newaxis, :]
                        )

    # Sort out spurt positions
    ty_clean = ty.copy()
    ty_clean[nan_mask] = 0

    # Find starts (bit 0 set) and ends (bit 1 set)
    starts = np.where(np.bitwise_and(ty_clean.astype(int), 1) > 0)[0]
    ends = np.where(np.bitwise_and(ty_clean.astype(int), 2) > 0)[0]

    s = np.column_stack([ix1[starts], ix2[ends]])

    if p == 1:
        w = w.ravel()

    return w, s

v_stftw

V_STFTW - Short-time Fourier Transform.

v_stftw

v_stftw(x, nw, m='', ov=2, nt=None) -> tuple[ndarray, dict]

Convert time-domain signal to time-frequency domain using STFT.

Parameters:

Name Type Description Default
x array_like

Input signal.

required
nw int

Window length (rounded up to multiple of ov).

required
m str

Mode string including window code.

''
ov int

Overlap factor. Default 2.

2
nt int

DFT length. Default nw.

None

Returns:

Name Type Description
y ndarray

STFT output (frames x frequencies).

so dict

Structure for inverse transformation.

Source code in pyvoicebox/v_stftw.py
def v_stftw(x, nw, m='', ov=2, nt=None) -> tuple[np.ndarray, dict]:
    """Convert time-domain signal to time-frequency domain using STFT.

    Parameters
    ----------
    x : array_like
        Input signal.
    nw : int
        Window length (rounded up to multiple of ov).
    m : str, optional
        Mode string including window code.
    ov : int, optional
        Overlap factor. Default 2.
    nt : int, optional
        DFT length. Default nw.

    Returns
    -------
    y : ndarray
        STFT output (frames x frequencies).
    so : dict
        Structure for inverse transformation.
    """
    x = np.asarray(x, dtype=float).ravel()
    nx = len(x)

    # Round nw up to multiple of ov
    nw = int(np.ceil(nw / ov) * ov)
    nh = nw // ov  # hop size

    if nt is None:
        nt = nw

    # Choose window
    if 'n' in m:
        wa = np.hanning(nw + 1)[:nw]  # Hann
    elif 'c' in m:
        wa = np.cos(np.pi * (np.arange(nw) + 0.5) / nw)
    elif 'R' in m:
        wa = np.ones(nw)
    elif 'M' in m or (ov == 2 and 'm' not in m):
        # sqrt Hamming
        wa = np.sqrt(np.hamming(nw))
    else:
        wa = np.hamming(nw)

    # Normalize for COLA
    wa_sum = np.zeros(nw)
    for k in range(ov):
        wa_sum += np.roll(wa ** 2, k * nh)
    # Don't normalize if wa_sum is already constant

    # Pad signal
    if 'e' in m:
        # Pad beginning and end
        pad_start = nw - nh
        pad_end = nw - nh
        x = np.concatenate([np.zeros(pad_start), x, np.zeros(pad_end)])
        nx = len(x)

    # Pad final frame
    if 'z' in m:
        pad = nh - (nx % nh) if nx % nh != 0 else 0
        x = np.concatenate([x, np.zeros(pad)])
    elif 'r' in m or True:
        # Pad with reflected data
        pad = nh - (nx % nh) if nx % nh != 0 else 0
        if pad > 0:
            x = np.concatenate([x, x[-1:-(pad + 1):-1]])

    nx = len(x)
    nframes = max(0, (nx - nw) // nh + 1)
    nf = nt // 2 + 1

    y = np.zeros((nframes, nf), dtype=complex)
    for i in range(nframes):
        start = i * nh
        frame = x[start:start + nw] * wa
        if nt > nw:
            frame = np.concatenate([frame, np.zeros(nt - nw)])
        Y = np.fft.rfft(frame, nt)
        y[i, :] = Y

    so = {
        'nt': nt,
        'nh': nh,
        'nw': nw,
        'wa': wa,
        'ov': ov,
    }

    return y, so

v_istftw

V_ISTFTW - Inverse Short-time Fourier Transform.

v_istftw

v_istftw(y, so, io=None) -> ndarray

Convert time-frequency domain back to time domain using inverse STFT.

Parameters:

Name Type Description Default
y array_like

STFT data (frames x frequencies).

required
so dict

Structure from v_stftw containing window and transform parameters.

required
io dict

State from previous call for chunked processing.

None

Returns:

Name Type Description
z ndarray

Reconstructed time-domain signal.

io dict

State for subsequent calls.

Source code in pyvoicebox/v_istftw.py
def v_istftw(y, so, io=None) -> np.ndarray:
    """Convert time-frequency domain back to time domain using inverse STFT.

    Parameters
    ----------
    y : array_like
        STFT data (frames x frequencies).
    so : dict
        Structure from v_stftw containing window and transform parameters.
    io : dict, optional
        State from previous call for chunked processing.

    Returns
    -------
    z : ndarray
        Reconstructed time-domain signal.
    io : dict
        State for subsequent calls.
    """
    y = np.asarray(y)
    nframes = y.shape[0]
    nt = so['nt']
    nh = so['nh']
    nw = so['nw']
    wa = so['wa']

    # Synthesis window (same as analysis for now)
    ws = wa.copy()

    # Compute normalization
    norm = np.zeros(nw)
    ov = so.get('ov', nw // nh)
    for k in range(ov):
        shifted = np.roll(wa * ws, k * nh)
        norm += shifted
    # Avoid division by zero
    norm[norm == 0] = 1.0

    # Output length
    out_len = (nframes - 1) * nh + nw
    z = np.zeros(out_len)
    w_sum = np.zeros(out_len)

    for i in range(nframes):
        # Inverse FFT
        frame = np.fft.irfft(y[i, :], nt)[:nw]
        frame *= ws

        start = i * nh
        z[start:start + nw] += frame
        w_sum[start:start + nw] += wa * ws

    # Normalize by window sum
    nz = w_sum > 1e-10
    z[nz] /= w_sum[nz]

    return z

v_filtbankm

V_FILTBANKM - General filterbank matrix (mel/bark/erb/linear).

v_filtbankm

v_filtbankm(
    p, n, fs, fl=0, fh=None, w=""
) -> tuple[Any, ndarray, int, int]

Determine matrix for a filterbank with various frequency scales.

Simplified implementation supporting mel, bark, erb, linear scales.

Parameters:

Name Type Description Default
p int

Number of filters in filterbank.

required
n int

Length of DFT.

required
fs float

Sample rate in Hz.

required
fl float

Low frequency edge in Hz. Default 0.

0
fh float

High frequency edge in Hz. Default fs/2.

None
w str

Options: 'm'=mel, 'b'=bark, 'e'=erb, 'f'=linear [default].

''

Returns:

Name Type Description
x (ndarray, shape(p, 1 + n // 2))

Filterbank matrix.

cf (ndarray, shape(p))

Filter center frequencies in Hz.

Source code in pyvoicebox/v_filtbankm.py
def v_filtbankm(p, n, fs, fl=0, fh=None, w='') -> tuple[Any, np.ndarray, int, int]:
    """Determine matrix for a filterbank with various frequency scales.

    Simplified implementation supporting mel, bark, erb, linear scales.

    Parameters
    ----------
    p : int
        Number of filters in filterbank.
    n : int
        Length of DFT.
    fs : float
        Sample rate in Hz.
    fl : float, optional
        Low frequency edge in Hz. Default 0.
    fh : float, optional
        High frequency edge in Hz. Default fs/2.
    w : str, optional
        Options: 'm'=mel, 'b'=bark, 'e'=erb, 'f'=linear [default].

    Returns
    -------
    x : ndarray, shape (p, 1+n//2)
        Filterbank matrix.
    cf : ndarray, shape (p,)
        Filter center frequencies in Hz.
    """
    if fh is None:
        fh = fs / 2.0

    nf = 1 + n // 2  # number of frequency bins
    fax = np.arange(nf) * fs / n  # frequency axis

    # Choose frequency scale
    if 'm' in w:
        frq2scale = lambda x: v_frq2mel(x)[0]
        scale2frq = lambda x: v_mel2frq(x)[0]
    elif 'b' in w:
        frq2scale = lambda x: v_frq2bark(x)[0]
        scale2frq = lambda x: v_bark2frq(x)[0]
    elif 'e' in w:
        frq2scale = lambda x: v_frq2erb(x)[0]
        scale2frq = lambda x: v_erb2frq(x)[0]
    else:
        frq2scale = lambda x: x
        scale2frq = lambda x: x

    # Convert frequency limits to chosen scale
    fl_s = frq2scale(fl)
    fh_s = frq2scale(fh)

    # Create p+2 equally spaced points in the chosen scale (including edges)
    cf_s = np.linspace(fl_s, fh_s, p + 2)
    cf_hz = scale2frq(cf_s)

    # Center frequencies (excluding edges)
    cf = cf_hz[1:-1]

    # Build the filterbank matrix
    x = np.zeros((p, nf))
    for i in range(p):
        lo = cf_hz[i]
        mid = cf_hz[i + 1]
        hi = cf_hz[i + 2]

        # Rising slope
        mask_rise = (fax >= lo) & (fax <= mid)
        if mid > lo:
            x[i, mask_rise] = (fax[mask_rise] - lo) / (mid - lo)

        # Falling slope
        mask_fall = (fax >= mid) & (fax <= hi)
        if hi > mid:
            x[i, mask_fall] = (hi - fax[mask_fall]) / (hi - mid)

    return x, cf

v_gammabank

V_GAMMABANK - Gammatone filterbank (stub).

v_gammabank

v_gammabank(*args, **kwargs) -> None

Create a bank of gammatone filters.

The full MATLAB implementation creates gammatone filters at ERB-spaced center frequencies. A stub is provided.

Raises:

Type Description
NotImplementedError

Full implementation pending.

Source code in pyvoicebox/v_gammabank.py
def v_gammabank(*args, **kwargs) -> None:
    """Create a bank of gammatone filters.

    The full MATLAB implementation creates gammatone filters at
    ERB-spaced center frequencies. A stub is provided.

    Raises
    ------
    NotImplementedError
        Full implementation pending.
    """
    raise NotImplementedError(
        "v_gammabank full implementation pending. "
        "Consider using brian2hears or gammatone Python packages."
    )

v_spgrambw

V_SPGRAMBW - Spectrogram computation with configurable bandwidth.

v_spgrambw

v_spgrambw(
    s, fs, mode="", bw=200, fmax=None, db=40, tinc=0
) -> tuple[ndarray, ndarray, ndarray]

Compute spectrogram with configurable bandwidth (no plotting).

Parameters:

Name Type Description Default
s array_like

Speech signal or power spectrum array.

required
fs float or array_like

Sample frequency in Hz, or [fs, t1].

required
mode str

Mode options: 'p' : output power per decade 'P' : output power per mel/bark/erb 'd' : output in dB 'm' : mel scale 'b' : bark scale 'e' : erb scale 'l' : log10 Hz scale

''
bw float

Bandwidth resolution in Hz.

200
fmax array_like

Frequency range [Fmin, Fstep, Fmax].

None
db float

dB range for plotting/clipping.

40
tinc float

Output frame increment in seconds.

0

Returns:

Name Type Description
t ndarray

Time axis values (seconds).

f ndarray

Frequency axis values.

b ndarray

Spectrogram values (power per Hz unless mode changes this).

Source code in pyvoicebox/v_spgrambw.py
def v_spgrambw(s, fs, mode='', bw=200, fmax=None, db=40, tinc=0) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Compute spectrogram with configurable bandwidth (no plotting).

    Parameters
    ----------
    s : array_like
        Speech signal or power spectrum array.
    fs : float or array_like
        Sample frequency in Hz, or [fs, t1].
    mode : str
        Mode options:
          'p' : output power per decade
          'P' : output power per mel/bark/erb
          'd' : output in dB
          'm' : mel scale
          'b' : bark scale
          'e' : erb scale
          'l' : log10 Hz scale
    bw : float
        Bandwidth resolution in Hz.
    fmax : array_like, optional
        Frequency range [Fmin, Fstep, Fmax].
    db : float
        dB range for plotting/clipping.
    tinc : float
        Output frame increment in seconds.

    Returns
    -------
    t : ndarray
        Time axis values (seconds).
    f : ndarray
        Frequency axis values.
    b : ndarray
        Spectrogram values (power per Hz unless mode changes this).
    """
    s = np.asarray(s, dtype=float)
    if s.ndim == 1:
        s = s[:, np.newaxis]
        ns1, ns2 = s.shape
        s = s.ravel()
        ns2 = 1
    else:
        ns1, ns2 = s.shape

    fs_arr = np.atleast_1d(np.asarray(fs, dtype=float))
    if len(fs_arr) < 2:
        fs_arr = np.array([fs_arr[0], 1.0 / fs_arr[0]])

    if tinc == 0:
        tinc = 1.81 / (4 * bw)

    # Determine frequency scale
    mdsw = ' '
    for ch in mode:
        if ch in 'lmbe':
            mdsw = ch

    nfrq = 257
    if ns2 == 1:
        fnyq = fs_arr[0] / 2.0
    else:
        if len(fs_arr) < 3:
            fs_arr = np.append(fs_arr, fs_arr[0] * 0.25)
        if len(fs_arr) < 4:
            fs_arr = np.append(fs_arr, 0)
        fnyq = fs_arr[3] + (ns2 - 1) * fs_arr[2]

    flmin = 30.0
    if fmax is None:
        if mdsw == 'l':
            fx = np.linspace(np.log10(flmin), np.log10(fnyq), nfrq)
        elif mdsw == 'm':
            fx = np.linspace(0, v_frq2mel(fnyq)[0], nfrq)
        elif mdsw == 'b':
            fx = np.linspace(0, v_frq2bark(fnyq)[0], nfrq)
        elif mdsw == 'e':
            fx = np.linspace(0, v_frq2erb(fnyq)[0], nfrq)
        else:
            fx = np.arange(nfrq) * fnyq / (nfrq - 1)
    else:
        fmax = np.atleast_1d(np.asarray(fmax, dtype=float))
        fmaxu = fmax.copy()
        if 'h' in mode:
            if mdsw == 'l':
                fmaxu = np.log10(fmax)
            elif mdsw == 'm':
                fmaxu = v_frq2mel(fmax)[0]
            elif mdsw == 'b':
                fmaxu = v_frq2bark(fmax)[0]
            elif mdsw == 'e':
                fmaxu = v_frq2erb(fmax)[0]

        if len(fmaxu) < 2:
            if mdsw == 'l':
                fx = np.linspace(np.log10(flmin), fmaxu[0], nfrq)
            else:
                fx = np.linspace(0, fmaxu[0], nfrq)
        elif len(fmaxu) < 3:
            fx = np.linspace(fmaxu[0], fmaxu[1], nfrq)
        else:
            fx = np.arange(fmaxu[0], fmaxu[2] + fmaxu[1] / 2, fmaxu[1])
            nfrq = len(fx)

    # Convert frequency range to Hz
    if mdsw == 'l':
        f_hz = 10.0 ** fx
    elif mdsw == 'm':
        f_hz = v_mel2frq(fx)[0]
    elif mdsw == 'b':
        f_hz = v_bark2frq(fx)[0]
    elif mdsw == 'e':
        f_hz = v_erb2frq(fx)[0]
    else:
        f_hz = fx.copy()

    f = fx.copy()  # output frequencies in native units

    # Calculate spectrogram
    if ns2 == 1:
        winlen = int(round(1.81 * fs_arr[0] / bw))
        win = 0.54 + 0.46 * np.cos(np.arange(1 - winlen, winlen + 1, 2) * np.pi / winlen)
        ninc = max(round(tinc * fs_arr[0]), 1)
        fftlen = int(2 ** np.ceil(np.log2(4 * winlen)))
        win = win / np.sqrt(np.sum(win ** 2))

        sf, t, *_ = v_enframe(s, win, ninc)
        t = fs_arr[1] + (t - 1) / fs_arr[0]  # time axis
        b_spec = v_rfft(sf, fftlen, 1)
        b_spec = np.real(b_spec * np.conj(b_spec)) * 2.0 / fs_arr[0]
        b_spec[:, 0] *= 0.5
        b_spec[:, -1] *= 0.5
        fb = np.arange(fftlen // 2 + 1) * fs_arr[0] / fftlen
    else:
        b_spec = s.copy()
        t = fs_arr[1] + np.arange(ns1) / fs_arr[0]
        fb = fs_arr[3] + np.arange(ns2) * fs_arr[2]
        fftlen = ns2

    nfr = len(t)

    # Apply preemphasis
    preemph = 'P' in mode
    if 'p' in mode or (preemph and mdsw == 'l'):
        b_spec = b_spec * fb[np.newaxis, :] * np.log(10)
    elif preemph and mdsw == 'm':
        b_spec = b_spec * ((700 + fb) * np.log(1 + 1000 / 700) / 1000)[np.newaxis, :]
    elif preemph and mdsw == 'b':
        b_spec = b_spec * ((1960 + fb) ** 2 / 52547.6)[np.newaxis, :]
    elif preemph and mdsw == 'e':
        b_spec = b_spec * (6.23 * fb ** 2 + 93.39 * fb + 28.52)[np.newaxis, :]

    # Interpolate onto desired frequency axis
    b_out = np.zeros((nfr, nfrq))
    for i in range(nfrq):
        # Find interpolation position
        fi = f_hz[i]
        idx = np.searchsorted(fb, fi)
        if idx <= 0:
            b_out[:, i] = b_spec[:, 0]
        elif idx >= len(fb):
            b_out[:, i] = b_spec[:, -1]
        else:
            frac = (fi - fb[idx - 1]) / (fb[idx] - fb[idx - 1])
            b_out[:, i] = (1 - frac) * b_spec[:, idx - 1] + frac * b_spec[:, idx]

    if 'd' in mode:
        b_out = 10.0 * np.log10(np.maximum(b_out, np.max(b_out) * 1e-30))

    return t, f, b_out

v_modspect

V_MODSPECT - Calculate modulation spectrum of a signal.

v_modspect

v_modspect(
    s, fs=11025, m="", nf=None, nq=None
) -> tuple[ndarray, ndarray, ndarray, ndarray]

Calculate the modulation spectrum of a signal.

This is a simplified implementation that computes the mel spectrogram and then the modulation spectrum for each mel channel.

Parameters:

Name Type Description Default
s array_like

Speech signal.

required
fs float

Sample rate in Hz.

11025
m str

Mode string.

''
nf array_like

[num_mel_bins, fmin, fmax, num_dct].

None
nq array_like

[num_mod_bins, mod_fmin, mod_fmax, num_mod_dct].

None

Returns:

Name Type Description
c ndarray

Modulation spectrum (mod_freq, mel_freq, time).

qq ndarray

Modulation frequency centres.

ff ndarray

Mel frequency centres.

tt ndarray

Time axis.

Source code in pyvoicebox/v_modspect.py
def v_modspect(s, fs=11025, m='', nf=None, nq=None) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """Calculate the modulation spectrum of a signal.

    This is a simplified implementation that computes the mel spectrogram
    and then the modulation spectrum for each mel channel.

    Parameters
    ----------
    s : array_like
        Speech signal.
    fs : float
        Sample rate in Hz.
    m : str
        Mode string.
    nf : array_like, optional
        [num_mel_bins, fmin, fmax, num_dct].
    nq : array_like, optional
        [num_mod_bins, mod_fmin, mod_fmax, num_mod_dct].

    Returns
    -------
    c : ndarray
        Modulation spectrum (mod_freq, mel_freq, time).
    qq : ndarray
        Modulation frequency centres.
    ff : ndarray
        Mel frequency centres.
    tt : ndarray
        Time axis.
    """
    s = np.asarray(s, dtype=float).ravel()

    # Parse nf parameters
    nf_defaults = [0.1, 40, min(10000, fs / 2), 25]
    if nf is not None:
        nf = list(np.atleast_1d(nf))
        while len(nf) < 4:
            nf.append(nf_defaults[len(nf)])
    else:
        nf = nf_defaults

    nq_defaults = [0.1, 0.5, 20, 15]
    if nq is not None:
        nq = list(np.atleast_1d(nq))
        while len(nq) < 4:
            nq.append(nq_defaults[len(nq)])
    else:
        nq = nq_defaults

    # Step 1: Short-time Fourier Transform
    n_fft = int(2 ** np.floor(np.log2(0.03 * fs)))
    inc = n_fft // 2
    win = 0.54 - 0.46 * np.cos(2 * np.pi * np.arange(n_fft) / (n_fft - 1))
    frames, tt, *_ = v_enframe(s, win, inc)
    tt = tt / fs

    spec = v_rfft(frames, n_fft, 1)
    pw = np.real(spec * np.conj(spec))

    # Step 2: Mel filterbank
    n_mel = int(nf[0]) if nf[0] >= 1 else int(np.ceil(4.6 * np.log10(fs)))
    mb, mc, _, _ = v_melbankm(n_mel, n_fft, fs, nf[1] / fs, nf[2] / fs)
    mel_spec = np.sqrt(np.maximum(mb.toarray() @ pw.T, 1e-20))  # (n_mel, n_frames)

    n_frames = mel_spec.shape[1]
    ff = mc

    # Step 3: Modulation spectrum per mel channel
    mod_inc = max(1, round(0.01 * fs / inc))  # ~10ms in frame units
    mod_win_len = max(4, min(n_frames, round(0.2 * fs / inc)))
    mod_nfft = int(2 ** np.ceil(np.log2(mod_win_len)))

    n_mod_frames = max(1, (n_frames - mod_win_len) // mod_inc + 1)
    n_mod_bins = mod_nfft // 2 + 1

    mod_freqs = np.arange(n_mod_bins) / mod_nfft * (fs / inc)
    c = np.zeros((n_mod_bins, n_mel, n_mod_frames))

    mod_win = np.hamming(mod_win_len)
    for i_mel in range(n_mel):
        sig = mel_spec[i_mel, :]
        for i_frame in range(n_mod_frames):
            start = i_frame * mod_inc
            end = start + mod_win_len
            if end > n_frames:
                break
            seg = sig[start:end] * mod_win
            f_mod = np.fft.rfft(seg, mod_nfft)
            c[:, i_mel, i_frame] = np.abs(f_mod) ** 2

    qq = mod_freqs
    tt_out = tt[::mod_inc][:n_mod_frames] if len(tt) > 0 else np.arange(n_mod_frames)

    return c, qq, ff, tt_out

v_correlogram

V_CORRELOGRAM - Calculate correlogram.

v_correlogram

v_correlogram(
    x, inc=128, nw=None, nlag=None, m="h", fs=1
) -> tuple[ndarray, ndarray]

Calculate correlogram.

Parameters:

Name Type Description Default
x ndarray

Input signal (samples, channels) from a filterbank.

required
inc int

Frame increment in samples.

128
nw int or array_like

Window length in samples or window function. Default: inc.

None
nlag int

Number of lags to calculate. Default: nw.

None
m str

Mode: 'h' for Hamming window.

'h'
fs float

Sample frequency.

1

Returns:

Name Type Description
y ndarray

Correlogram (nlag, channels, frames).

ty ndarray

Time of window centre for each frame.

Source code in pyvoicebox/v_correlogram.py
def v_correlogram(x, inc=128, nw=None, nlag=None, m='h', fs=1) -> tuple[np.ndarray, np.ndarray]:
    """Calculate correlogram.

    Parameters
    ----------
    x : ndarray
        Input signal (samples, channels) from a filterbank.
    inc : int
        Frame increment in samples.
    nw : int or array_like, optional
        Window length in samples or window function. Default: inc.
    nlag : int, optional
        Number of lags to calculate. Default: nw.
    m : str
        Mode: 'h' for Hamming window.
    fs : float
        Sample frequency.

    Returns
    -------
    y : ndarray
        Correlogram (nlag, channels, frames).
    ty : ndarray
        Time of window centre for each frame.
    """
    x = np.asarray(x, dtype=float)
    if x.ndim == 1:
        x = x[:, np.newaxis]
    nx, nc = x.shape

    if nw is None:
        nw = inc

    nw_arr = np.atleast_1d(np.asarray(nw, dtype=float))
    if len(nw_arr) > 1:
        win = nw_arr.ravel()
        nwin = len(win)
    else:
        nwin = int(nw_arr[0])
        if 'h' in m:
            win = v_windows(3, nwin).ravel()
        else:
            win = np.ones(nwin)

    if nlag is None:
        nlag = nwin

    nwl = nwin + nlag - 1
    nt = int(2 ** np.ceil(np.log2(nwl)))
    nf_frames = int(np.floor((nx - nwl + inc) / inc))

    if nf_frames <= 0:
        return np.zeros((nlag, nc, 0)), np.array([])

    wincg = np.dot(np.arange(1, nwin + 1), win ** 2) / np.dot(win, win)
    fwin = np.conj(np.fft.fft(win, nt))

    y = np.zeros((nlag, nc, nf_frames))

    for iframe in range(nf_frames):
        for ic in range(nc):
            start = iframe * inc
            x_seg = x[start:start + nwl, ic]
            if len(x_seg) < nwl:
                x_seg = np.concatenate([x_seg, np.zeros(nwl - len(x_seg))])

            x_win = x_seg[:nwin] * win
            X_win = np.fft.fft(x_win, nt)
            X_full = np.fft.fft(x_seg, nt)
            v = np.fft.ifft(np.conj(X_win) * X_full)

            x_sq = x_seg ** 2
            X_sq = np.fft.fft(x_sq, nt)
            w_energy = np.real(np.fft.ifft(fwin * X_sq))
            w_energy = np.maximum(w_energy[:nlag], 0)
            w0 = w_energy[0]
            norm = np.sqrt(w_energy * w0)
            norm[norm == 0] = 1.0

            if np.isreal(x).all():
                y[:, ic, iframe] = np.real(v[:nlag]) / norm
            else:
                y[:, ic, iframe] = np.real(np.conj(v[:nlag])) / norm

    ty = np.arange(nf_frames) * inc + wincg
    return y, ty

v_ewgrpdel

V_EWGRPDEL - Energy-weighted group delay waveform.

v_ewgrpdel

v_ewgrpdel(x, w=None, m=None) -> tuple[ndarray, int]

Calculate energy-weighted group delay waveform.

Parameters:

Name Type Description Default
x array_like

Input signal.

required
w int or array_like

Window or window length (default: Hamming of length(x)).

None
m int

Center sample of window (1-based, default: (1+len(w))/2).

None

Returns:

Name Type Description
y ndarray

Energy-weighted group delay waveform.

mm int

Actual value of m used.

Source code in pyvoicebox/v_ewgrpdel.py
def v_ewgrpdel(x, w=None, m=None) -> tuple[np.ndarray, int]:
    """Calculate energy-weighted group delay waveform.

    Parameters
    ----------
    x : array_like
        Input signal.
    w : int or array_like, optional
        Window or window length (default: Hamming of length(x)).
    m : int, optional
        Center sample of window (1-based, default: (1+len(w))/2).

    Returns
    -------
    y : ndarray
        Energy-weighted group delay waveform.
    mm : int
        Actual value of m used.
    """
    x = np.asarray(x, dtype=float).ravel()

    if w is None:
        w = hamming(len(x), sym=True)
    elif np.isscalar(w):
        w = hamming(int(w), sym=True)
    else:
        w = np.asarray(w, dtype=float).ravel()

    w = w ** 2
    lw = len(w)

    if m is None:
        m = (1 + lw) / 2.0
    m = max(int(round(m)), 1)
    mm = m

    wn = w * (m - np.arange(1, lw + 1))
    x2 = np.concatenate([x, np.zeros(m - 1)]) ** 2
    yn = lfilter(wn, [1.0], x2)
    yd = lfilter(w, [1.0], x2)
    yd[yd < np.finfo(float).eps] = 1.0
    y = yn[m - 1:] / yd[m - 1:]
    return y, mm

Pitch and voicing

v_fxpefac

V_FXPEFAC - PEFAC pitch extraction algorithm.

v_fxpefac

v_fxpefac(
    s, fs, tinc=0.01
) -> tuple[ndarray, ndarray, ndarray]

PEFAC pitch extraction algorithm.

Parameters:

Name Type Description Default
s array_like

Speech signal.

required
fs float

Sample frequency in Hz.

required
tinc float

Frame increment in seconds.

0.01

Returns:

Name Type Description
fx ndarray

Estimated pitch frequency per frame (0 = unvoiced).

tt ndarray

Time of each frame centre (seconds).

pv ndarray

Probability of voicing per frame.

References

[1] Gonzalez & Brookes, PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise, IEEE/ACM TASLP, 2014.

Source code in pyvoicebox/v_fxpefac.py
def v_fxpefac(s, fs, tinc=0.01) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """PEFAC pitch extraction algorithm.

    Parameters
    ----------
    s : array_like
        Speech signal.
    fs : float
        Sample frequency in Hz.
    tinc : float
        Frame increment in seconds.

    Returns
    -------
    fx : ndarray
        Estimated pitch frequency per frame (0 = unvoiced).
    tt : ndarray
        Time of each frame centre (seconds).
    pv : ndarray
        Probability of voicing per frame.

    References
    ----------
    [1] Gonzalez & Brookes, PEFAC - A Pitch Estimation Algorithm
        Robust to High Levels of Noise, IEEE/ACM TASLP, 2014.
    """
    s = np.asarray(s, dtype=float).ravel()

    # Parameters
    fxmin = 60.0
    fxmax = 500.0
    n_fft = int(2 ** np.ceil(np.log2(2 * fs / fxmin)))
    ni = max(1, round(tinc * fs))

    # Window
    win_len = int(round(2 * fs / fxmin))
    if win_len > n_fft:
        win_len = n_fft
    win = np.hamming(win_len)

    # Enframe
    frames, tt, *_ = v_enframe(s, win, ni)
    tt = tt / fs
    nr = frames.shape[0]

    # Compute log power spectrum
    spec = v_rfft(frames, n_fft, 1)
    pw = np.real(spec * np.conj(spec))
    pw = np.maximum(pw, 1e-20)
    log_pw = np.log(pw)

    # Frequency bins
    freq_bins = np.arange(pw.shape[1]) * fs / n_fft
    min_bin = max(1, int(np.ceil(fxmin * n_fft / fs)))
    max_bin = min(pw.shape[1] - 1, int(np.floor(fxmax * n_fft / fs)))

    fx = np.zeros(nr)
    pv = np.zeros(nr)

    for i in range(nr):
        # Compute autocorrelation via IFFT of power spectrum
        full_pw = pw[i, :]
        acf = np.fft.irfft(np.concatenate([full_pw, full_pw[-2:0:-1]]))[:n_fft // 2 + 1]
        acf = acf / (acf[0] + 1e-20)

        # Search for pitch in valid range
        min_lag = max(1, int(np.floor(fs / fxmax)))
        max_lag = min(len(acf) - 1, int(np.ceil(fs / fxmin)))

        if max_lag <= min_lag:
            continue

        acf_search = acf[min_lag:max_lag + 1]
        if len(acf_search) == 0:
            continue

        peak_idx = np.argmax(acf_search)
        peak_val = acf_search[peak_idx]

        if peak_val > 0.3:  # voicing threshold
            lag = peak_idx + min_lag
            # Parabolic interpolation
            if 0 < peak_idx < len(acf_search) - 1:
                a = acf_search[peak_idx - 1]
                b = acf_search[peak_idx]
                c = acf_search[peak_idx + 1]
                delta = 0.5 * (a - c) / (a - 2 * b + c + 1e-20)
                lag = peak_idx + min_lag + delta
            fx[i] = fs / lag
            pv[i] = peak_val

    return fx, tt, pv

v_fxrapt

V_FXRAPT - RAPT pitch extraction algorithm.

v_fxrapt

v_fxrapt(
    s, fs, tinc=0.01
) -> tuple[ndarray, ndarray, ndarray]

RAPT pitch extraction algorithm.

Parameters:

Name Type Description Default
s array_like

Speech signal.

required
fs float

Sample frequency in Hz.

required
tinc float

Frame increment in seconds.

0.01

Returns:

Name Type Description
fx ndarray

Estimated pitch frequency per frame (0 = unvoiced).

tt ndarray

Time of each frame centre (seconds).

pv ndarray

Probability of voicing per frame.

References

[1] Talkin, D. A robust algorithm for pitch tracking (RAPT). In Speech Coding and Synthesis, ch.14, Elsevier, 1995.

Source code in pyvoicebox/v_fxrapt.py
def v_fxrapt(s, fs, tinc=0.01) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """RAPT pitch extraction algorithm.

    Parameters
    ----------
    s : array_like
        Speech signal.
    fs : float
        Sample frequency in Hz.
    tinc : float
        Frame increment in seconds.

    Returns
    -------
    fx : ndarray
        Estimated pitch frequency per frame (0 = unvoiced).
    tt : ndarray
        Time of each frame centre (seconds).
    pv : ndarray
        Probability of voicing per frame.

    References
    ----------
    [1] Talkin, D. A robust algorithm for pitch tracking (RAPT).
        In Speech Coding and Synthesis, ch.14, Elsevier, 1995.
    """
    s = np.asarray(s, dtype=float).ravel()

    # Parameters
    fxmin = 50.0
    fxmax = 500.0
    ni = max(1, round(tinc * fs))

    # Downsampled processing
    ds_factor = max(1, int(np.floor(fs / (4 * fxmax))))
    fs_ds = fs / ds_factor
    if ds_factor > 1:
        # Simple decimation with anti-aliasing
        from scipy.signal import decimate
        s_ds = decimate(s, ds_factor, zero_phase=True)
    else:
        s_ds = s

    # Window for analysis
    win_len = int(round(2.5 * fs / fxmin))
    n_fft = int(2 ** np.ceil(np.log2(2 * win_len)))
    if win_len > n_fft:
        win_len = n_fft
    win = np.hamming(win_len)

    # Enframe at original rate
    frames, tt, *_ = v_enframe(s, np.hamming(int(round(2 * fs / fxmin))), ni)
    tt = tt / fs
    nr = frames.shape[0]

    fx = np.zeros(nr)
    pv = np.zeros(nr)

    for i in range(nr):
        frame = frames[i, :]
        n_frame = len(frame)
        nfft_frame = int(2 ** np.ceil(np.log2(2 * n_frame)))

        # Compute normalized cross-correlation
        # Using autocorrelation via FFT
        F = np.fft.rfft(frame, nfft_frame)
        acf_full = np.fft.irfft(F * np.conj(F))
        acf = acf_full[:n_frame]
        if acf[0] > 0:
            acf = acf / acf[0]
        else:
            continue

        # Search for pitch
        min_lag = max(1, int(np.floor(fs / fxmax)))
        max_lag = min(n_frame - 1, int(np.ceil(fs / fxmin)))

        if max_lag <= min_lag:
            continue

        acf_search = acf[min_lag:max_lag + 1]
        if len(acf_search) == 0:
            continue

        # Find peaks in the valid range
        peak_idx = np.argmax(acf_search)
        peak_val = acf_search[peak_idx]

        if peak_val > 0.25:  # voicing threshold
            lag = peak_idx + min_lag
            # Parabolic interpolation
            if 0 < peak_idx < len(acf_search) - 1:
                a = acf_search[peak_idx - 1]
                b = acf_search[peak_idx]
                c = acf_search[peak_idx + 1]
                denom = a - 2 * b + c
                if abs(denom) > 1e-10:
                    delta = 0.5 * (a - c) / denom
                    lag = peak_idx + min_lag + delta

            fx[i] = fs / lag
            pv[i] = peak_val

    return fx, tt, pv

v_dypsa

V_DYPSA - Derive glottal closure instances from speech using the DYPSA algorithm.

v_dypsa

v_dypsa(s, fs) -> tuple[ndarray, ndarray]

Derive glottal closure instances from speech.

Parameters:

Name Type Description Default
s array_like

Speech signal.

required
fs float

Sampling frequency in Hz.

required

Returns:

Name Type Description
gci ndarray

Vector of glottal closure sample numbers (0-based).

goi ndarray

Vector of glottal opening sample numbers (0-based).

References

[1] Naylor et al., IEEE Trans Speech Audio Proc, 15:34-43, 2007.

Source code in pyvoicebox/v_dypsa.py
def v_dypsa(s, fs) -> tuple[np.ndarray, np.ndarray]:
    """Derive glottal closure instances from speech.

    Parameters
    ----------
    s : array_like
        Speech signal.
    fs : float
        Sampling frequency in Hz.

    Returns
    -------
    gci : ndarray
        Vector of glottal closure sample numbers (0-based).
    goi : ndarray
        Vector of glottal opening sample numbers (0-based).

    References
    ----------
    [1] Naylor et al., IEEE Trans Speech Audio Proc, 15:34-43, 2007.
    """
    s = np.asarray(s, dtype=float).ravel()
    ns = len(s)

    # Algorithm parameters
    cpfrac = 0.3
    fxmax = 500
    fxmin = 50
    lpcdur = 0.020
    lpcstep = 0.010
    lpcn = 2
    lpcnf = 0.001
    preemph_freq = 50
    gwlen = 0.003
    fwlen = 0.00045

    # Pre-emphasis
    alpha = np.exp(-2 * np.pi * preemph_freq / fs)
    sp = np.concatenate([[s[0]], s[1:] - alpha * s[:-1]])

    # LPC analysis parameters
    lpc_order = int(round(lpcnf * fs) + lpcn)
    lpc_frame = int(round(lpcdur * fs))
    lpc_step = int(round(lpcstep * fs))

    # Compute LPC residual
    n_frames = max(1, int(np.floor((ns - lpc_frame) / lpc_step)) + 1)
    residual = np.zeros(ns)

    for i in range(n_frames):
        start = i * lpc_step
        end = min(start + lpc_frame, ns)
        seg = sp[start:end]
        if len(seg) < lpc_order + 1:
            continue
        ar, *_ = v_lpcauto(seg, lpc_order)
        if ar.ndim > 1:
            ar = ar[0, :]
        filt_out = v_lpcifilt(ar, sp[start:min(start + lpc_frame + lpc_step, ns)])
        end_res = min(start + len(filt_out), ns)
        residual[start:end_res] = filt_out[:end_res - start]

    # Group delay computation
    gw = int(round(gwlen * fs))
    if gw < 2:
        gw = 2
    fw = int(round(fwlen * fs))
    if fw < 1:
        fw = 1

    # Compute group delay function using the residual
    nfft = int(2 ** np.ceil(np.log2(2 * gw)))
    gdwav = np.zeros(ns)
    half_gw = gw // 2

    for i in range(half_gw, ns - half_gw):
        seg = residual[i - half_gw:i + half_gw]
        if len(seg) < 2 * half_gw:
            continue
        win = np.hamming(len(seg))
        # Weighted average of derivative
        n_idx = np.arange(len(seg)) - half_gw
        gdwav[i] = np.sum(seg * win * n_idx) / (np.sum(seg * win) + 1e-20)

    # Smooth group delay
    if fw > 0:
        kernel = np.ones(fw) / fw
        gdwav = np.convolve(gdwav, kernel, mode='same')

    # Find negative-going zero crossings (candidates for GCI)
    min_period = int(round(fs / fxmax))
    max_period = int(round(fs / fxmin))

    # Find zero crossings of group delay
    zc, _ = v_zerocros(gdwav, 'n')  # negative-going
    if len(zc) == 0:
        return np.array([], dtype=int), np.array([], dtype=int)

    # Filter candidates by minimum spacing
    gci = []
    last = -max_period
    for z in zc:
        if z - last >= min_period:
            gci.append(int(z))
            last = z

    gci = np.array(gci, dtype=int)

    # Estimate glottal openings
    goi = np.round(gci - cpfrac * np.concatenate([[max_period], np.diff(gci)])).astype(int)
    goi = np.clip(goi, 0, ns - 1)

    return gci, goi

v_vadsohn

V_VADSOHN - Voice activity detector (Sohn et al.).

v_vadsohn

v_vadsohn(si, fs, m='a', pp=None) -> ndarray

Voice activity detector based on Sohn et al.

Parameters:

Name Type Description Default
si array_like

Input speech signal.

required
fs float

Sample frequency in Hz.

required
m str

Mode: 'a' activity decision, 'b' likelihood ratio.

'a'
pp dict

Algorithm parameters.

None

Returns:

Name Type Description
vs ndarray

VAD output (one value per sample if mode 'a', or per frame if 'n'/'t').

Source code in pyvoicebox/v_vadsohn.py
def v_vadsohn(si, fs, m='a', pp=None) -> np.ndarray:
    """Voice activity detector based on Sohn et al.

    Parameters
    ----------
    si : array_like
        Input speech signal.
    fs : float
        Sample frequency in Hz.
    m : str
        Mode: 'a' activity decision, 'b' likelihood ratio.
    pp : dict, optional
        Algorithm parameters.

    Returns
    -------
    vs : ndarray
        VAD output (one value per sample if mode 'a', or per frame if 'n'/'t').
    """
    s = np.asarray(si, dtype=float).ravel()

    qq = {
        'of': 2, 'pr': 0.7, 'ts': 0.1, 'tn': 0.05,
        'ti': 10e-3, 'tj': 10e-3, 'ri': 0,
        'ta': 0.396, 'gx': 1000, 'gz': 1e-4, 'xn': 0, 'ne': 0,
    }
    qp = {}
    if pp is not None:
        for key in qq:
            if key in pp:
                qq[key] = pp[key]
        qp = pp

    qq['tj'] = min(qq['tj'], 0.5 * qq['ts'], 0.5 * qq['tn'])
    nj = max(round(qq['ti'] / qq['tj']), 1)
    if qq['ri']:
        ni = int(2 ** np.ceil(np.log2(qq['ti'] * fs * np.sqrt(0.5) / nj)))
    else:
        ni = round(qq['ti'] * fs / nj)

    tinc = ni / fs
    a_coeff = np.exp(-tinc / qq['ta'])
    gmax = qq['gx']
    kk = np.sqrt(2 * np.pi)
    xn = qq['xn']
    gz = qq['gz']
    a01 = tinc / qq['tn']
    a00 = 1 - a01
    a10 = tinc / qq['ts']
    a11 = 1 - a10
    b11 = a11 / a10
    b01 = a01 / a00
    b00 = a01 - a00 * a11 / a10
    b10 = a11 - a10 * a01 / a00
    prat = a10 / a01
    lprat = np.log(prat)

    no = round(qq['of'])
    nf = ni * no
    w = np.hamming(nf + 1)[:nf]
    w = w / np.sqrt(np.sum(w[::ni] ** 2))

    ns = len(s)
    y, *_ = v_enframe(s, w, ni)
    yf = v_rfft(y, nf, 1)

    if yf.shape[0] == 0:
        return np.array([])

    yp = np.real(yf * np.conj(yf))
    nr, nf2 = yp.shape
    nb = ni * nr

    if qq['ne'] > 0:
        dp, ze = v_estnoiseg(yp, tinc, qp)
    else:
        dp, ze, _ = v_estnoisem(yp, tinc, qp)

    xu = 1.0
    lggami = 0.0

    gam = np.clip(yp / dp, gz, gmax)
    prsp = np.zeros(nr)

    for i in range(nr):
        gami = gam[i, :]
        xi = a_coeff * xu + (1 - a_coeff) * np.maximum(gami - 1, xn)
        xi1 = 1 + xi
        v = 0.5 * xi * gami / xi1
        lamk = 2 * v - np.log(xi1)
        lami = np.sum(lamk[1:nf2]) / (nf2 - 1)

        if lggami < 0:
            lggami = lprat + lami + np.log(b11 + b00 / (a00 + a10 * np.exp(lggami)))
        else:
            lggami = lprat + lami + np.log(b01 + b10 / (a10 + a00 * np.exp(-lggami)))

        if lggami < 0:
            gg = np.exp(lggami)
            prsp[i] = gg / (1 + gg)
        else:
            prsp[i] = 1.0 / (1 + np.exp(-lggami))

        gi = (0.277 + 2 * v) / gami
        mv = v < 0.5
        if np.any(mv):
            vmv = v[mv]
            gi[mv] = kk * np.sqrt(vmv) * ((0.5 + vmv) * besseli(0, vmv) + vmv * besseli(1, vmv)) / (gami[mv] * np.exp(vmv))
        xu = gami * gi ** 2

    if 'a' in m:
        # Output per sample
        vs = np.zeros(ns)
        for i in range(nr):
            start = i * ni
            end = min(start + ni, ns)
            vs[start:end] = float(prsp[i] > qq['pr'])
        return vs
    elif 'b' in m:
        vs = np.zeros(ns)
        for i in range(nr):
            start = i * ni
            end = min(start + ni, ns)
            vs[start:end] = prsp[i]
        return vs
    else:
        vs = np.zeros(ns)
        for i in range(nr):
            start = i * ni
            end = min(start + ni, ns)
            vs[start:end] = float(prsp[i] > qq['pr'])
        return vs

Level, noise and alignment

v_activlev

V_ACTIVLEV - Measure active speech level as per ITU-T P.56.

v_activlev

v_activlev(sp, fs, mode='') -> tuple[ndarray, ndarray]

Measure active speech level as in ITU-T P.56 (Method B).

Parameters:

Name Type Description Default
sp array_like

Speech signal.

required
fs float

Sample frequency in Hz.

required
mode str

Mode string: 'd' - output in dB 'n' - normalize speech to 0 dB active level '0' - omit high pass filter 'h' - omit low pass filter 'l' - output long-term power level too

''

Returns:

Name Type Description
lev float or ndarray

Active speech level.

af float

Activity factor.

Source code in pyvoicebox/v_activlev.py
def v_activlev(sp, fs, mode='') -> tuple[np.ndarray, np.ndarray]:
    """Measure active speech level as in ITU-T P.56 (Method B).

    Parameters
    ----------
    sp : array_like
        Speech signal.
    fs : float
        Sample frequency in Hz.
    mode : str, optional
        Mode string:
            'd' - output in dB
            'n' - normalize speech to 0 dB active level
            '0' - omit high pass filter
            'h' - omit low pass filter
            'l' - output long-term power level too

    Returns
    -------
    lev : float or ndarray
        Active speech level.
    af : float
        Activity factor.
    """
    nbin = 20
    thresh = 15.9

    sp = np.asarray(sp, dtype=float).ravel()
    ns = len(sp)

    if ns == 0:
        if 'd' in mode:
            return -np.inf, 0.0
        return 0.0, 0.0

    # Simplified implementation: compute active level without bandpass filtering
    # (use '0h' mode equivalent for simplicity)
    ti = 1.0 / fs
    g = np.exp(-ti / 0.03)
    ae = np.array([1, -2 * g, g ** 2]) / (1 - g) ** 2
    nh = int(np.ceil(0.2 / ti)) + 1

    # Zero-pad
    nz = int(np.ceil(0.35 * fs))
    sq = np.concatenate([sp, np.zeros(nz)])
    ns_total = len(sq)

    # Envelope filter
    s = lfilter([1], ae, np.abs(sq))

    # Sum of squares
    ssq = np.sum(sq ** 2)
    if ssq == 0:
        af = 0.0
        if 'd' in mode:
            lev = -np.inf
        else:
            lev = 0.0
        return lev, af

    sf = ns_total / ssq
    sfdb = 10 * np.log10(sf)

    # Energy histogram using log2
    qf, qe = np.frexp(sf * s ** 2)
    qe = qe.astype(float)
    qe[qf == 0] = -np.inf

    # Apply hangover
    qe_hang = v_maxfilt(qe, 1, nh)[0]

    emax = np.max(qe_hang) + 1
    if emax == -np.inf:
        if 'd' in mode:
            return -np.inf, 0.0
        return 0.0, 0.0

    qe_idx = np.minimum(emax - qe_hang, nbin).astype(int)
    qe_idx = np.clip(qe_idx, 1, nbin)

    kc = np.cumsum(np.bincount(qe_idx, minlength=nbin + 1)[1:nbin + 1])

    # Calculate active level
    aj = 10 * np.log10(ssq / np.maximum(kc, 1))
    cj = 10 * np.log10(2) * (emax - np.arange(1, nbin + 1) - 1) - sfdb
    mj = aj - cj - thresh

    # Find positive transition through threshold
    transitions = np.where((mj[:-1] < 0) & (mj[1:] >= 0))[0]
    if len(transitions) == 0:
        if mj[-1] <= 0:
            jj = len(mj) - 2
            jf = 1.0
        else:
            jj = 0
            jf = 0.0
    else:
        jj = transitions[0]
        denom = mj[jj + 1] - mj[jj]
        if abs(denom) < 1e-30:
            jf = 0.5
        else:
            jf = 1.0 / (1.0 - mj[jj + 1] / mj[jj])

    lev_db = aj[jj] + jf * (aj[min(jj + 1, nbin - 1)] - aj[jj])
    lp = 10 ** (lev_db / 10)
    af = ssq / ((ns_total - nz) * lp) if lp > 0 else 0.0

    if 'd' in mode:
        lev = lev_db
        if 'l' in mode:
            lev = np.array([lev_db, 10 * np.log10(ssq / ns_total)])
    else:
        lev = lp
        if 'l' in mode:
            lev = np.array([lp, ssq / ns_total])

    if 'n' in mode:
        if lp > 0:
            return sp / np.sqrt(lp), lev, af
        return sp, lev, af

    return lev, af

v_activlevg

V_ACTIVLEVG - Measure active speech level robustly.

v_activlevg

v_activlevg(sp, fs, mode='') -> tuple[ndarray, ndarray]

Measure active speech level robustly.

This is a simplified wrapper around v_activlev. The full MATLAB implementation uses a Gaussian mixture model approach for robust estimation in noisy conditions.

Parameters:

Name Type Description Default
sp array_like

Speech signal.

required
fs float

Sample frequency in Hz.

required
mode str

Mode string (same as v_activlev).

''

Returns:

Name Type Description
lev float

Active speech level.

af float

Activity factor.

Source code in pyvoicebox/v_activlevg.py
def v_activlevg(sp, fs, mode='') -> tuple[np.ndarray, np.ndarray]:
    """Measure active speech level robustly.

    This is a simplified wrapper around v_activlev. The full MATLAB
    implementation uses a Gaussian mixture model approach for robust
    estimation in noisy conditions.

    Parameters
    ----------
    sp : array_like
        Speech signal.
    fs : float
        Sample frequency in Hz.
    mode : str, optional
        Mode string (same as v_activlev).

    Returns
    -------
    lev : float
        Active speech level.
    af : float
        Activity factor.
    """
    return v_activlev(sp, fs, mode)

v_earnoise

V_EARNOISE - Add noise to simulate hearing threshold.

v_earnoise

v_earnoise(
    s, fs, m="", spl=62.35
) -> tuple[ndarray, ndarray, ndarray]

Add noise to simulate the hearing threshold of a listener.

Simplified version that adds white noise scaled to simulate internal ear noise.

Parameters:

Name Type Description Default
s (array_like, shape(n) or (n, c))

Speech signal.

required
fs float

Sample frequency in Hz.

required
m str

Mode string: 'n' input normalized, 'u' input already scaled.

''
spl float

Target active speech level in dB SPL. Default 62.35.

62.35

Returns:

Name Type Description
y ndarray

Speech with simulated ear noise.

x ndarray

Filtered speech signal.

v ndarray

Added noise.

Source code in pyvoicebox/v_earnoise.py
def v_earnoise(s, fs, m='', spl=62.35) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Add noise to simulate the hearing threshold of a listener.

    Simplified version that adds white noise scaled to simulate internal ear noise.

    Parameters
    ----------
    s : array_like, shape (n,) or (n, c)
        Speech signal.
    fs : float
        Sample frequency in Hz.
    m : str, optional
        Mode string: 'n' input normalized, 'u' input already scaled.
    spl : float, optional
        Target active speech level in dB SPL. Default 62.35.

    Returns
    -------
    y : ndarray
        Speech with simulated ear noise.
    x : ndarray
        Filtered speech signal.
    v : ndarray
        Added noise.
    """
    s = np.asarray(s, dtype=float)
    if s.ndim == 1:
        s = s[:, np.newaxis]

    ns, nc = s.shape

    if 'u' in m:
        dboff = 0.0
    elif 'n' in m:
        dboff = spl
    else:
        dboff = spl  # simplified: assume 0 dB input

    x = 10 ** (0.05 * dboff) * s
    v = np.sqrt(0.5 * fs) * np.random.randn(*s.shape)
    y = x + v

    if nc == 1:
        return y.ravel(), x.ravel(), v.ravel()
    return y, x, v

v_ppmvu

V_PPMVU - Calculate PPM and VU meter readings (stub).

The full implementation requires specific filter design. A simplified stub is provided.

v_ppmvu

v_ppmvu(sp, fs, mode='') -> float

Calculate PPM and VU meter readings.

Simplified implementation. The full MATLAB version implements detailed PPM and VU metering standards.

Parameters:

Name Type Description Default
sp array_like

Input signal.

required
fs float

Sample frequency in Hz.

required
mode str

Mode string.

''

Returns:

Name Type Description
lev float

Level reading.

Source code in pyvoicebox/v_ppmvu.py
def v_ppmvu(sp, fs, mode='') -> float:
    """Calculate PPM and VU meter readings.

    Simplified implementation. The full MATLAB version implements
    detailed PPM and VU metering standards.

    Parameters
    ----------
    sp : array_like
        Input signal.
    fs : float
        Sample frequency in Hz.
    mode : str, optional
        Mode string.

    Returns
    -------
    lev : float
        Level reading.
    """
    sp = np.asarray(sp, dtype=float).ravel()
    if len(sp) == 0:
        return 0.0

    # Simple RMS level
    rms = np.sqrt(np.mean(sp ** 2))
    if rms > 0:
        return 20 * np.log10(rms)
    return -np.inf

v_snrseg

V_SNRSEG - Measure segmental and global SNR.

v_snrseg

v_snrseg(
    s, r, fs, m="wz", tf=0.01
) -> tuple[float, float, ndarray, ndarray, ndarray]

Measure segmental and global SNR.

Parameters:

Name Type Description Default
s array_like

Test signal (noisy).

required
r array_like

Reference signal (clean).

required
fs float

Sample frequency in Hz.

required
m str

Mode string: 'w' : No VAD - use whole file (default). 'z' : Do not do any alignment (default). 'q' : Use linear interpolation to remove delays ± 1 sample.

'wz'
tf float

Frame increment in seconds. Default: 0.01.

0.01

Returns:

Name Type Description
seg float

Segmental SNR in dB.

glo float

Global SNR in dB.

tc ndarray

Time at centre of each frame (seconds).

snf ndarray

Segmental SNR in dB in each frame.

vf ndarray

Boolean mask indicating valid frames.

Source code in pyvoicebox/v_snrseg.py
def v_snrseg(s, r, fs, m='wz', tf=0.01) -> tuple[float, float, np.ndarray, np.ndarray, np.ndarray]:
    """Measure segmental and global SNR.

    Parameters
    ----------
    s : array_like
        Test signal (noisy).
    r : array_like
        Reference signal (clean).
    fs : float
        Sample frequency in Hz.
    m : str, optional
        Mode string:
          'w' : No VAD - use whole file (default).
          'z' : Do not do any alignment (default).
          'q' : Use linear interpolation to remove delays +/- 1 sample.
    tf : float, optional
        Frame increment in seconds. Default: 0.01.

    Returns
    -------
    seg : float
        Segmental SNR in dB.
    glo : float
        Global SNR in dB.
    tc : ndarray
        Time at centre of each frame (seconds).
    snf : ndarray
        Segmental SNR in dB in each frame.
    vf : ndarray
        Boolean mask indicating valid frames.
    """
    s = np.asarray(s, dtype=float).ravel()
    r = np.asarray(r, dtype=float).ravel()
    snmax = 100.0  # clipping limit for SNR

    mq = 'z' not in m  # flag for alignment
    nr = min(len(r), len(s))
    kf = round(tf * fs)  # frame length in samples
    ifr_start = kf + mq  # starting sample for first frame end (1-based in MATLAB)

    # Build frame end indices (0-based)
    ifr = np.arange(ifr_start, nr - mq + 1, kf) - 1  # 0-based ending sample
    nf = len(ifr)
    if nf == 0:
        return 0.0, 0.0, np.array([]), np.array([]), np.array([])

    ifl = ifr[-1] + 1  # last sample index + 1 (exclusive)

    if mq:
        # Linear interpolation alignment
        # For each frame, find optimal shift
        # s[1:ifl-1] vs s[2:ifl], s[0:ifl-2] (0-based)
        ssm = np.zeros((kf, nf))
        ssp = np.zeros((kf, nf))
        sr_mat = np.zeros((kf, nf))
        for j in range(nf):
            start = j * kf + 1  # 1-based converted: frame starts at mq + j*kf (0-based)
            for k in range(kf):
                idx = start + k  # 0-based index into s[1:ifl]
                ssm[k, j] = s[idx] - s[idx + 1]
                ssp[k, j] = s[idx] - s[idx - 1]
                sr_mat[k, j] = s[idx] - r[idx]

        am = np.clip(np.sum(sr_mat * ssm, axis=0) / np.sum(ssm**2, axis=0), 0, 1)
        ap = np.clip(np.sum(sr_mat * ssp, axis=0) / np.sum(ssp**2, axis=0), 0, 1)
        ef = np.minimum(
            np.sum((sr_mat - am[np.newaxis, :] * ssm)**2, axis=0),
            np.sum((sr_mat - ap[np.newaxis, :] * ssp)**2, axis=0)
        )
    else:
        # No interpolation
        ef = np.zeros(nf)
        for j in range(nf):
            start = j * kf
            frame_s = s[start:start + kf]
            frame_r = r[start:start + kf]
            ef[j] = np.sum((frame_s - frame_r)**2)

    # Calculate reference power per frame
    rf = np.zeros(nf)
    for j in range(nf):
        start = mq + j * kf
        rf[j] = np.sum(r[start:start + kf]**2)

    em = ef == 0  # zero noise frames
    rm = rf == 0  # zero reference frames
    snf = 10.0 * np.log10((rf + rm) / (ef + em))
    snf[rm] = -snmax
    snf[em] = snmax

    # Select frames to include
    if 'w' in m:
        vf = np.ones(nf, dtype=bool)
    else:
        vf = np.ones(nf, dtype=bool)  # default: use all

    tc = (np.arange(1, nf + 1) * kf + (1 - kf) / 2.0) / fs

    seg = np.mean(snf[vf]) if np.any(vf) else 0.0
    glo = 10.0 * np.log10(np.sum(rf[vf]) / np.sum(ef[vf])) if np.any(vf) and np.sum(ef[vf]) > 0 else 0.0

    return seg, glo, tc, snf, vf

v_addnoise

V_ADDNOISE - Add noise at a chosen SNR.

v_addnoise

v_addnoise(s, fs, snr=inf, m='') -> tuple[ndarray, ndarray]

Add white noise at a chosen SNR using energy-based measurement.

This is a simplified version that supports white noise addition with energy-based level measurement. For more advanced noise types, see the MATLAB original.

Parameters:

Name Type Description Default
s array_like

Input speech signal (1-D).

required
fs float

Sample frequency in Hz.

required
snr float

Target SNR in dB. Default: Inf (no noise).

inf
m str

Mode string: 'D' : SNR input given as power ratio instead of dB. 'e' : Use energy to calculate signal level (default). 'k' : Preserve original signal power. 'n' : Make signal level = 0 dB. 'N' : Make noise level = 0 dB. 't' : Make total = 0 dB (default). 'x' : Output signal and noise as separate columns.

''

Returns:

Name Type Description
z ndarray

Noisy signal (or [signal, noise] if 'x' option).

p ndarray

Levels: [s-in n-in s-out n-out] as power ratios or dB.

Source code in pyvoicebox/v_addnoise.py
def v_addnoise(s, fs, snr=np.inf, m='') -> tuple[np.ndarray, np.ndarray]:
    """Add white noise at a chosen SNR using energy-based measurement.

    This is a simplified version that supports white noise addition with
    energy-based level measurement. For more advanced noise types, see the
    MATLAB original.

    Parameters
    ----------
    s : array_like
        Input speech signal (1-D).
    fs : float
        Sample frequency in Hz.
    snr : float, optional
        Target SNR in dB. Default: Inf (no noise).
    m : str, optional
        Mode string:
          'D' : SNR input given as power ratio instead of dB.
          'e' : Use energy to calculate signal level (default).
          'k' : Preserve original signal power.
          'n' : Make signal level = 0 dB.
          'N' : Make noise level = 0 dB.
          't' : Make total = 0 dB (default).
          'x' : Output signal and noise as separate columns.

    Returns
    -------
    z : ndarray
        Noisy signal (or [signal, noise] if 'x' option).
    p : ndarray
        Levels: [s-in n-in s-out n-out] as power ratios or dB.
    """
    s = np.asarray(s, dtype=float).ravel()
    ns = len(s)

    # Calculate signal energy
    se = np.mean(s**2)

    # Generate white noise
    n = np.random.randn(ns)
    ne = np.mean(n**2)

    # Convert SNR
    if 'D' in m:
        snre = snr
    else:
        snre = 10.0 ** (0.1 * snr)

    # Determine scaling factors
    if snre > 1:
        if 'n' in m:
            sze = 1.0
        elif 'N' in m:
            sze = snre
        elif 'k' in m:
            sze = se
        else:
            sze = 1.0 / (1.0 + 1.0 / snre)
        nze = sze / snre
    else:
        if 'n' in m:
            nze = 1.0 / snre
        elif 'N' in m:
            nze = 1.0
        elif 'k' in m:
            nze = se / snre
        else:
            nze = 1.0 / (1.0 + snre)
        sze = nze * snre

    pe = np.array([se, ne, sze, nze])
    gm = np.array([
        np.sqrt(sze / (se + (se == 0))),
        np.sqrt(nze / (ne + (ne == 0)))
    ])

    if gm[1] > 0:
        if 'x' in m:
            z = np.column_stack([gm[0] * s, gm[1] * n])
        else:
            z = gm[0] * s + gm[1] * n
    elif 'x' in m:
        z = np.column_stack([gm[0] * s, np.zeros(ns)])
    else:
        z = gm[0] * s

    p = pe.copy()
    if 'D' not in m:
        mk = (pe != np.inf) & (pe != 0)
        p[mk] = 10.0 * np.log10(pe[mk])
        p[pe == 0] = -np.inf

    return z, p

v_sigalign

V_SIGALIGN - Align a clean reference with a noisy signal.

v_sigalign

v_sigalign(
    s, r, maxd=None, m="gs", fs=None
) -> tuple[int, float, ndarray, ndarray]

Align a clean reference with a noisy signal.

Parameters:

Name Type Description Default
s array_like

Test signal.

required
r array_like

Reference signal.

required
maxd float or array_like

[+-max] or [min, max] delay allowed in samples. Fractions of len® are used if abs(maxd) < 1. Default ensures at least 50% overlap.

None
m str

Mode string: 'u' - unity gain 'g' - find optimal gain (default) 's' - maximize correlation coefficient (default) 'S' - maximize energy of common component

'gs'
fs float

Sample frequency (only used for filtering).

None

Returns:

Name Type Description
d int

Optimum delay to apply to r.

g float

Optimal gain to apply to r.

rr ndarray

g * r(shifted by -d), zero-padded to match s if ss not returned.

ss ndarray

s truncated to match rr.

Source code in pyvoicebox/v_sigalign.py
def v_sigalign(s, r, maxd=None, m='gs', fs=None) -> tuple[int, float, np.ndarray, np.ndarray]:
    """Align a clean reference with a noisy signal.

    Parameters
    ----------
    s : array_like
        Test signal.
    r : array_like
        Reference signal.
    maxd : float or array_like, optional
        [+-max] or [min, max] delay allowed in samples. Fractions of len(r)
        are used if abs(maxd) < 1. Default ensures at least 50% overlap.
    m : str, optional
        Mode string:
          'u' - unity gain
          'g' - find optimal gain (default)
          's' - maximize correlation coefficient (default)
          'S' - maximize energy of common component
    fs : float, optional
        Sample frequency (only used for filtering).

    Returns
    -------
    d : int
        Optimum delay to apply to r.
    g : float
        Optimal gain to apply to r.
    rr : ndarray
        g * r(shifted by -d), zero-padded to match s if ss not returned.
    ss : ndarray
        s truncated to match rr.
    """
    s = np.asarray(s, dtype=float).ravel()
    r = np.asarray(r, dtype=float).ravel()
    ns = len(s)
    nr = len(r)

    if maxd is None:
        maxd_arr = np.array([])
    else:
        maxd_arr = np.atleast_1d(np.asarray(maxd, dtype=float))

    if len(maxd_arr) == 0:
        if nr < ns:
            lmm = np.array([-0.25 * nr, ns - 0.75 * nr])
        else:
            lmm = np.array([-0.25 * ns, nr - 0.75 * ns])
    elif len(maxd_arr) == 1:
        lmm = np.array([-maxd_arr[0], maxd_arr[0]])
    else:
        lmm = maxd_arr[:2].copy()

    # Convert fractions of nr to samples
    lmm = np.round(lmm * (1 + (nr - 1) * (np.abs(lmm) < 1))).astype(int)
    lmin = int(lmm[0])
    lmax = int(lmm[1])
    lags = lmax - lmin + 1

    if lags <= 0:
        raise ValueError('Invalid lag limits')

    # Note: A-weighting and BS-468 weighting (m='a', m='b') require
    # v_stdspectrum which is not yet converted. Skip filtering.

    # Cross correlation
    rxi = max(1, 1 - lmin)       # first reference sample needed (1-based)
    rxj = min(nr, ns - lmax)     # last reference sample needed (1-based)
    nrx = rxj - rxi + 1         # length of reference segment

    if nrx < 1:
        raise ValueError('Reference signal too short')

    fl = int(2 ** np.ceil(np.log2(lmax - lmin + nrx)))
    sxi = max(1, rxi + lmin)    # first signal sample needed (1-based)
    sxj = min(ns, rxj + lmax)   # last signal sample needed (1-based)

    # Zero-padded FFT cross-correlation
    s_seg = np.zeros(fl)
    s_seg[:sxj - sxi + 1] = s[sxi - 1:sxj]
    r_seg = np.zeros(fl)
    r_seg[:rxj - rxi + 1] = r[rxi - 1:rxj]

    S = np.fft.fft(s_seg)
    R = np.fft.fft(r_seg)
    rs_full = np.real(np.fft.ifft(S * np.conj(R)))
    rsu = rs_full[:lags]

    ssq = np.cumsum(s[sxi - 1:sxj] ** 2)
    ssqd = np.zeros(lags)
    ssqd[0] = ssq[nrx - 1]
    if lags > 1:
        ssqd[1:] = ssq[nrx:nrx + lags - 1] - ssq[:lags - 1]

    if 'S' in m:
        icx = np.argmax(np.abs(rsu))
    else:
        # Maximize correlation coefficient
        with np.errstate(divide='ignore', invalid='ignore'):
            corr_coeff = rsu ** 2 / ssqd
            corr_coeff[ssqd == 0] = 0
        icx = np.argmax(corr_coeff)

    d = icx + lmin

    # Extract common region
    ia = max(1, d + 1)       # first sample of s in common region (1-based)
    ja = min(ns, d + nr)     # last sample of s in common region (1-based)
    ija = np.arange(ia, ja + 1)  # 1-based
    ijad = ija - d

    rr = r[ijad - 1].copy()
    ss = s[ija - 1].copy()

    if 'u' in m:
        g = 1.0
    else:
        g = np.sum(rr * ss) / np.sum(rr ** 2)

    rr = rr * g

    return d, g, rr, ss

v_txalign

V_TXALIGN - Find best alignment of two sets of time markers.

v_txalign

v_txalign(
    x, y, maxt, nsd=None
) -> tuple[ndarray, ndarray, int, float, float]

Find the best alignment of two sets of time markers.

Parameters:

Name Type Description Default
x array_like

First set of non-decreasing time values.

required
y array_like

Second set of non-decreasing time values.

required
maxt float

Penalty threshold.

required
nsd float

If specified, threshold is nsd times std dev from mean.

None

Returns:

Name Type Description
kx ndarray

Alignment from x to y (kx[i]=j means x[i] matched to y[j], 0=unmatched).

ky ndarray

Alignment from y to x.

nxy int

Number of matched pairs.

mxy float

Mean of y-x for matched pairs.

sxy float

Std dev of y-x for matched pairs.

Source code in pyvoicebox/v_txalign.py
def v_txalign(x, y, maxt, nsd=None) -> tuple[np.ndarray, np.ndarray, int, float, float]:
    """Find the best alignment of two sets of time markers.

    Parameters
    ----------
    x : array_like
        First set of non-decreasing time values.
    y : array_like
        Second set of non-decreasing time values.
    maxt : float
        Penalty threshold.
    nsd : float, optional
        If specified, threshold is nsd times std dev from mean.

    Returns
    -------
    kx : ndarray
        Alignment from x to y (kx[i]=j means x[i] matched to y[j], 0=unmatched).
    ky : ndarray
        Alignment from y to x.
    nxy : int
        Number of matched pairs.
    mxy : float
        Mean of y-x for matched pairs.
    sxy : float
        Std dev of y-x for matched pairs.
    """
    x = np.asarray(x, dtype=float).ravel().copy()
    y = np.asarray(y, dtype=float).ravel().copy()
    lx = len(x)
    ly = len(y)

    if lx == 0 or ly == 0:
        return np.zeros(lx, dtype=int), np.zeros(ly, dtype=int), 0, 0.0, 0.0

    if nsd is not None:
        kx, ky, nxy, mxy, sxy = v_txalign(x, y, maxt)
        nxy0 = nxy + 1
        while nxy < nxy0:
            nxy0 = nxy
            mxy0 = mxy
            kx, ky, nxy, mxy, sxy = v_txalign(x + mxy0, y, nsd * sxy)
        mxy = mxy + mxy0
        return kx, ky, nxy, mxy, sxy

    # Add sentinel values
    x_ext = np.append(x, 2 * abs(y[-1]) + 1)
    y_ext = np.append(y, 2 * abs(x[-1]) + 1)
    lxy = lx + ly + 1

    d = np.zeros((lxy, 4))
    c0 = maxt ** 2
    ix = 0  # 0-based index into x_ext
    iy = 0  # 0-based index into y_ext
    d[0, :] = [0, 0, -1, 1 if y_ext[0] < x_ext[0] else 0]
    vp = 0.0

    for id_ in range(1, lxy):
        if y_ext[iy] < x_ext[ix]:
            v = y_ext[iy]
            d[id_, 3] = 1
            iy += 1
        else:
            v = x_ext[ix]
            d[id_, 3] = 0
            ix += 1

        if d[id_, 3] == d[id_ - 1, 3]:
            d[id_, 2] = -1
        else:
            d[id_, 1] = d[id_ - 1, 0] - c0 + (v - vp) ** 2

        if d[id_ - 1, 2] == 0 and d[id_ - 1, 0] >= d[id_ - 1, 1]:
            d[id_, 0] = d[id_ - 1, 1]
            d[id_ - 1, 2] = 1
        else:
            d[id_, 0] = d[id_ - 1, 0]

        vp = v

    if d[lxy - 1, 2] == 0 and d[lxy - 1, 0] >= d[lxy - 1, 1]:
        d[lxy - 1, 2] = 1

    # Traceback
    ix = lx - 1
    iy = ly - 1
    nxy = 0
    mxy = 0.0
    sxy = 0.0
    kx = np.zeros(lx, dtype=int)
    ky = np.zeros(ly, dtype=int)

    while ix >= 0 and iy >= 0:
        id_ = ix + iy + 2  # 0-based in d array (was ix+iy+1 in 1-based)
        if d[id_, 2] > 0:
            ky[iy] = ix + 1  # 1-based output
            kx[ix] = iy + 1
            dt = y[iy] - x[ix]
            nxy += 1
            mxy += dt
            sxy += dt ** 2
            ix -= 1
            iy -= 1
        else:
            if d[id_, 3] == 1:
                iy -= 1
            else:
                ix -= 1

    if nxy > 0:
        mxy = mxy / nxy
        sxy = np.sqrt(sxy / nxy - mxy ** 2)

    return kx, ky, nxy, mxy, sxy

Psychoacoustics

v_importsii

V_IMPORTSII - Calculate the SII importance function.

v_importsii

v_importsii(f, m='') -> ndarray

Calculate the SII importance function per Hz or per Bark.

Parameters:

Name Type Description Default
f array_like

Frequencies in Hz (or Bark if 'b' flag).

required
m str

Mode string: 'b' : Frequencies given in Bark rather than Hz. 'c' : Calculate cumulative importance. 'd' : Calculate importance of n-1 bands. 'h' : Calculate importance per Hz or per Bark.

''

Returns:

Name Type Description
q ndarray

Importance values.

References

[1] ANSI Standard S3.5-1997 (R2007). [2] C. V. Pavlovic. JASA, 82:413-422, 1987.

Source code in pyvoicebox/v_importsii.py
def v_importsii(f, m='') -> np.ndarray:
    """Calculate the SII importance function per Hz or per Bark.

    Parameters
    ----------
    f : array_like
        Frequencies in Hz (or Bark if 'b' flag).
    m : str, optional
        Mode string:
          'b' : Frequencies given in Bark rather than Hz.
          'c' : Calculate cumulative importance.
          'd' : Calculate importance of n-1 bands.
          'h' : Calculate importance per Hz or per Bark.

    Returns
    -------
    q : ndarray
        Importance values.

    References
    ----------
    [1] ANSI Standard S3.5-1997 (R2007).
    [2] C. V. Pavlovic. JASA, 82:413-422, 1987.
    """
    f = np.asarray(f, dtype=float)

    if 'b' in m:
        b = f.copy()
        d_bark = np.ones_like(f)  # not needed for bark input
    else:
        b, d_bark = v_frq2bark(f)

    if 'c' in m or 'd' in m:
        q = _mi * b + _ci + _ai * (b < 4) * (b - 4)**2 - _bi * (b > 18) * (b - 18)**2
        q[b < _xi0] = 0.0
        q[b > _xi1] = 1.0
        if 'd' in m:
            q = q[1:] - q[:-1]
    else:
        q = _mi + _ai * (b < 4) * (b - 4) - _bi * (b > 18) * (b - 18)
        q[b < _xi0] = 0.0
        q[b > _xi1] = 0.0
        if 'b' not in m:
            q = q / d_bark

    return q

v_phon2sone

V_PHON2SONE - Convert PHON loudness values to SONEs.

v_phon2sone

v_phon2sone(p) -> ndarray

Convert PHON loudness values to SONEs.

Parameters:

Name Type Description Default
p array_like

Matrix of phon values.

required

Returns:

Name Type Description
s ndarray

Matrix of sone values, same shape as p.

Notes

The phon scale measures perceived loudness in dB; at 1 kHz it is identical to dB SPL relative to 20e-6 Pa sound pressure. The sone scale is proportional to apparent loudness and, by definition, equals 1 at 40 phon.

References

[1] J. Lochner and J. Burger. Form of the loudness function in the presence of masking noise. JASA, 33: 1705, 1961. [2] ISO/TC43. Acoustics Normal equal-loudness-level contours. Standard ISO 226:2003, Aug. 2003.

Source code in pyvoicebox/v_phon2sone.py
def v_phon2sone(p) -> np.ndarray:
    """Convert PHON loudness values to SONEs.

    Parameters
    ----------
    p : array_like
        Matrix of phon values.

    Returns
    -------
    s : ndarray
        Matrix of sone values, same shape as p.

    Notes
    -----
    The phon scale measures perceived loudness in dB; at 1 kHz it is identical
    to dB SPL relative to 20e-6 Pa sound pressure. The sone scale is proportional
    to apparent loudness and, by definition, equals 1 at 40 phon.

    References
    ----------
    [1] J. Lochner and J. Burger. Form of the loudness function in the presence
        of masking noise. JASA, 33: 1705, 1961.
    [2] ISO/TC43. Acoustics Normal equal-loudness-level contours.
        Standard ISO 226:2003, Aug. 2003.
    """
    p = np.asarray(p, dtype=float)
    b = np.log(10) * 0.1 * 0.27  # 0.27 is the exponent from [1] and [2]
    d = np.exp(b * 2.4)           # 2.4 dB is the hearing threshold from [2]
    a = 1.0 / (np.exp(b * 40) - d)  # scale factor to make p=40 give s=1
    s = a * (np.exp(b * p) - d)
    return s

v_sone2phon

V_SONE2PHON - Convert SONE loudness values to PHONs.

v_sone2phon

v_sone2phon(s) -> ndarray

Convert SONE loudness values to PHONs.

Parameters:

Name Type Description Default
s array_like

Matrix of sone values.

required

Returns:

Name Type Description
p ndarray

Matrix of phon values, same shape as s.

Notes

The phon scale measures perceived loudness in dB; at 1 kHz it is identical to dB SPL relative to 20e-6 Pa sound pressure. The sone scale is proportional to apparent loudness and, by definition, equals 1 at 40 phon.

References

[1] J. Lochner and J. Burger. Form of the loudness function in the presence of masking noise. JASA, 33: 1705, 1961. [2] ISO/TC43. Acoustics Normal equal-loudness-level contours. Standard ISO 226:2003, Aug. 2003.

Source code in pyvoicebox/v_sone2phon.py
def v_sone2phon(s) -> np.ndarray:
    """Convert SONE loudness values to PHONs.

    Parameters
    ----------
    s : array_like
        Matrix of sone values.

    Returns
    -------
    p : ndarray
        Matrix of phon values, same shape as s.

    Notes
    -----
    The phon scale measures perceived loudness in dB; at 1 kHz it is identical
    to dB SPL relative to 20e-6 Pa sound pressure. The sone scale is proportional
    to apparent loudness and, by definition, equals 1 at 40 phon.

    References
    ----------
    [1] J. Lochner and J. Burger. Form of the loudness function in the presence
        of masking noise. JASA, 33: 1705, 1961.
    [2] ISO/TC43. Acoustics Normal equal-loudness-level contours.
        Standard ISO 226:2003, Aug. 2003.
    """
    s = np.asarray(s, dtype=float)
    b = 1.0 / (np.log(10) * 0.1 * 0.27)  # 0.27 is the exponent from [1] and [2]
    d = np.exp(2.4 / b)                    # 2.4 dB is the hearing threshold from [2]
    a = np.exp(40.0 / b) - d               # scale factor to make p=40 give s=1
    p = b * np.log(a * s + d)
    return p

v_pesq2mos

V_PESQ2MOS - Convert PESQ speech quality scores to MOS.

v_pesq2mos

v_pesq2mos(p) -> ndarray

Convert PESQ speech quality scores to MOS.

Parameters:

Name Type Description Default
p array_like

Matrix of PESQ scores.

required

Returns:

Name Type Description
m ndarray

Matrix of MOS scores, same shape as p.

References

[1] ITU-T Recommendation P.862.1, Nov. 2003.

Source code in pyvoicebox/v_pesq2mos.py
def v_pesq2mos(p) -> np.ndarray:
    """Convert PESQ speech quality scores to MOS.

    Parameters
    ----------
    p : array_like
        Matrix of PESQ scores.

    Returns
    -------
    m : ndarray
        Matrix of MOS scores, same shape as p.

    References
    ----------
    [1] ITU-T Recommendation P.862.1, Nov. 2003.
    """
    p = np.asarray(p, dtype=float)
    a = 0.999
    b = 4.999 - a
    c = -1.4945
    d = 4.6607
    m = a + b / (1.0 + np.exp(c * p + d))
    return m

v_mos2pesq

V_MOS2PESQ - Convert MOS speech quality scores to PESQ.

v_mos2pesq

v_mos2pesq(m) -> ndarray

Convert MOS speech quality scores to PESQ.

Parameters:

Name Type Description Default
m array_like

Matrix of MOS scores.

required

Returns:

Name Type Description
p ndarray

Matrix of PESQ scores, same shape as m.

References

[1] ITU-T Recommendation P.862.1, Nov. 2003.

Source code in pyvoicebox/v_mos2pesq.py
def v_mos2pesq(m) -> np.ndarray:
    """Convert MOS speech quality scores to PESQ.

    Parameters
    ----------
    m : array_like
        Matrix of MOS scores.

    Returns
    -------
    p : ndarray
        Matrix of PESQ scores, same shape as m.

    References
    ----------
    [1] ITU-T Recommendation P.862.1, Nov. 2003.
    """
    m = np.asarray(m, dtype=float)
    a = 0.999
    b = 4.999 - a
    c = -1.4945
    d = 4.6607
    p = (np.log(b / (m - a) - 1.0) - d) / c
    return p

v_stoi2prob

V_STOI2PROB - Convert STOI to probability.

v_stoi2prob

v_stoi2prob(s, m='i') -> ndarray

Convert STOI to probability.

Parameters:

Name Type Description Default
s array_like

Matrix containing STOI values.

required
m str

Mapping: 'i' for IEEE sentences (default), 'd' for Dantale corpus.

'i'

Returns:

Name Type Description
p ndarray

Corresponding probability values.

References

[1] C. H. Taal et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio, Speech, Language Processing, 19(7):2125-2136, 2011.

Source code in pyvoicebox/v_stoi2prob.py
def v_stoi2prob(s, m='i') -> np.ndarray:
    """Convert STOI to probability.

    Parameters
    ----------
    s : array_like
        Matrix containing STOI values.
    m : str, optional
        Mapping: 'i' for IEEE sentences (default), 'd' for Dantale corpus.

    Returns
    -------
    p : ndarray
        Corresponding probability values.

    References
    ----------
    [1] C. H. Taal et al. An algorithm for intelligibility prediction of
        time-frequency weighted noisy speech. IEEE Trans. Audio, Speech,
        Language Processing, 19(7):2125-2136, 2011.
    """
    s = np.asarray(s, dtype=float)
    if m and m[0] == 'd':
        a = -14.5435
        b = 7.0792
    else:
        a = -17.4906
        b = 9.6921
    p = 1.0 / (1.0 + np.exp(a * s + b))
    return p

v_psycdigit

V_PSYCDIGIT - Psychoacoustic digit recognition test (stub).

v_psycdigit

v_psycdigit(*args, **kwargs) -> None

Run a psychoacoustic digit recognition test.

This is an interactive GUI-based function that has no direct Python equivalent without a GUI framework.

Raises:

Type Description
NotImplementedError

Requires interactive GUI.

Source code in pyvoicebox/v_psycdigit.py
def v_psycdigit(*args, **kwargs) -> None:
    """Run a psychoacoustic digit recognition test.

    This is an interactive GUI-based function that has no direct
    Python equivalent without a GUI framework.

    Raises
    ------
    NotImplementedError
        Requires interactive GUI.
    """
    raise NotImplementedError(
        "v_psycdigit is an interactive MATLAB GUI function. "
        "Consider implementing with tkinter or a web interface."
    )

v_psycest

V_PSYCEST - Psychoacoustic estimation (stub).

v_psycest

v_psycest(*args, **kwargs) -> None

Psychoacoustic parameter estimation.

This is a complex estimation function with many options. A stub is provided.

Raises:

Type Description
NotImplementedError

Full implementation pending.

Source code in pyvoicebox/v_psycest.py
def v_psycest(*args, **kwargs) -> None:
    """Psychoacoustic parameter estimation.

    This is a complex estimation function with many options.
    A stub is provided.

    Raises
    ------
    NotImplementedError
        Full implementation pending.
    """
    raise NotImplementedError(
        "v_psycest is a complex psychoacoustic estimation function. "
        "Full implementation pending."
    )

v_psycestu

V_PSYCESTU - Psychoacoustic estimation utilities (stub).

v_psycestu

v_psycestu(*args, **kwargs) -> None

Psychoacoustic estimation utilities.

Raises:

Type Description
NotImplementedError

Full implementation pending.

Source code in pyvoicebox/v_psycestu.py
def v_psycestu(*args, **kwargs) -> None:
    """Psychoacoustic estimation utilities.

    Raises
    ------
    NotImplementedError
        Full implementation pending.
    """
    raise NotImplementedError(
        "v_psycestu is a complex psychoacoustic utility function. "
        "Full implementation pending."
    )

v_psychofunc

V_PSYCHOFUNC - Calculate psychometric functions.

v_psychofunc

v_psychofunc(
    m_or_q=None, q=None, x=None, r=None
) -> ndarray

Calculate psychometric functions: trial success probability vs SNR.

Parameters:

Name Type Description Default
m_or_q str or array_like

Mode string or q parameters (if mode omitted).

None
q array_like

Model parameters [p_threshold, threshold, slope, miss_prob, guess_prob, type].

None
x array_like

SNR values or probability values (for inverse).

None
r array_like

Test results (0 or 1).

None

Returns:

Name Type Description
p ndarray

Probabilities.

Source code in pyvoicebox/v_psychofunc.py
def v_psychofunc(m_or_q=None, q=None, x=None, r=None) -> np.ndarray:
    """Calculate psychometric functions: trial success probability vs SNR.

    Parameters
    ----------
    m_or_q : str or array_like
        Mode string or q parameters (if mode omitted).
    q : array_like, optional
        Model parameters [p_threshold, threshold, slope, miss_prob, guess_prob, type].
    x : array_like, optional
        SNR values or probability values (for inverse).
    r : array_like, optional
        Test results (0 or 1).

    Returns
    -------
    p : ndarray
        Probabilities.
    """
    # Parse arguments
    if isinstance(m_or_q, str):
        m = m_or_q
    elif m_or_q is not None:
        # Mode argument was omitted
        r = x
        x = q
        q = m_or_q
        m = ''
    else:
        m = ''

    qq = np.array([0.5, 0, 0.1, 0, 0, 1])  # defaults

    if q is not None:
        q = np.asarray(q, dtype=float).ravel()
        qq[:len(q)] = q[:min(len(q), 6)]

    pt = qq[0]    # probability at threshold
    xt = qq[1]    # threshold
    sl = qq[2]    # slope at threshold
    pm = qq[3]    # miss/lapse probability
    pg = qq[4]    # guess probability
    ft = int(qq[5])  # function type

    if x is None:
        x = np.linspace(xt - 20, xt + 20, 100)
    x = np.asarray(x, dtype=float)

    if 'i' in m:
        # Inverse function
        p = np.clip(x, pg + 1e-10, 1 - pm - 1e-10)
        u = (p - pg) / (1 - pm - pg)
        u = np.clip(u, 1e-15, 1 - 1e-15)

        if ft == 1:  # logistic
            ut = (pt - pg) / (1 - pm - pg)
            k = sl / (ut * (1 - ut) * (1 - pm - pg))
            return xt + (np.log(u / (1 - u)) - np.log(ut / (1 - ut))) / k
        elif ft == 2:  # cumulative Gaussian
            from scipy.special import erfinv
            ut = (pt - pg) / (1 - pm - pg)
            sig = (1 - pm - pg) / (sl * np.sqrt(2 * np.pi))
            return xt + sig * np.sqrt(2) * (erfinv(2 * u - 1) - erfinv(2 * ut - 1))
        else:
            raise NotImplementedError(f"Function type {ft} inverse not implemented")

    # Calculate normalized variable
    pr = 1 - pm - pg  # probability range

    if ft == 1:  # logistic
        ut = (pt - pg) / pr
        k = sl / (ut * (1 - ut) * pr)
        v = k * (x - xt)
        # logistic with offset
        vt = np.log(ut / (1 - ut))
        u = 1.0 / (1.0 + np.exp(-(v + vt)))
    elif ft == 2:  # cumulative Gaussian
        ut = (pt - pg) / pr
        from scipy.special import erfinv
        sig = pr / (sl * np.sqrt(2 * np.pi))
        z = (x - xt) / (sig * np.sqrt(2))
        zt = erfinv(2 * ut - 1)
        u = 0.5 * erfc(-(z + zt))
    elif ft == 3:  # Weibull
        ut = (pt - pg) / pr
        k = sl * np.exp(1) * np.log(1 / (1 - ut)) / pr
        u = 1 - np.exp(-np.exp(k * (x - xt) + np.log(np.log(1 / (1 - ut)))))
        u = np.clip(u, 0, 1)
    else:
        raise NotImplementedError(f"Psychometric function type {ft} not implemented")

    p = pg + pr * u

    if r is not None:
        r = np.asarray(r, dtype=float)
        if 'n' not in m:
            # Normalize likelihoods
            p = np.clip(p, 1e-15, 1 - 1e-15)
            likelihood = r * np.log(p) + (1 - r) * np.log(1 - p)
            return np.exp(likelihood - np.max(likelihood))

    if 'r' in m:
        return (np.random.rand(*p.shape) < p).astype(float)

    return p

Miscellaneous

v_soundspeed

V_SOUNDSPEED - Speed of sound, density and impedance of air.

v_soundspeed

v_soundspeed(
    t=20, p=1, m=0.0289644, g=1.4
) -> tuple[float, float, float]

Calculate speed of sound, density, and acoustic impedance.

Parameters:

Name Type Description Default
t float

Air temperature in Celsius. Default 20.

20
p float

Air pressure in atm. Default 1.

1
m float

Average molecular weight of air in kg/mol. Default 0.0289644.

0.0289644
g float

Adiabatic constant for air. Default 1.4.

1.4

Returns:

Name Type Description
v float

Speed of sound in m/s.

d float

Density of air in kg/m^3.

z float

Characteristic impedance in Pa.s/m.

Source code in pyvoicebox/v_soundspeed.py
def v_soundspeed(t=20, p=1, m=0.0289644, g=1.4) -> tuple[float, float, float]:
    """Calculate speed of sound, density, and acoustic impedance.

    Parameters
    ----------
    t : float, optional
        Air temperature in Celsius. Default 20.
    p : float, optional
        Air pressure in atm. Default 1.
    m : float, optional
        Average molecular weight of air in kg/mol. Default 0.0289644.
    g : float, optional
        Adiabatic constant for air. Default 1.4.

    Returns
    -------
    v : float
        Speed of sound in m/s.
    d : float
        Density of air in kg/m^3.
    z : float
        Characteristic impedance in Pa.s/m.
    """
    p_pa = p * 101325  # convert atm to pascal
    k = t + 273.15     # absolute temperature
    r = 8.3144         # J/(mol K) universal gas constant
    d = p_pa * m / (r * k)
    v = np.sqrt(g * r * k / m)
    z = v * d
    return v, d, z

v_sigma

V_SIGMA - Estimate glottal opening and closing instants using SIGMA algorithm.

This function requires SWT (Stationary Wavelet Transform) which is complex to implement. A simplified stub is provided that raises NotImplementedError.

v_sigma

v_sigma(lx, fs, fmax=400) -> None

Estimate glottal opening and closing instants (SIGMA algorithm).

This function requires the Stationary Wavelet Transform (SWT) which is available in MATLAB's Wavelet Toolbox but not straightforwardly in scipy.

Parameters:

Name Type Description Default
lx array_like

LX (laryngograph) signal.

required
fs float

Sampling frequency in Hz.

required
fmax float

Maximum laryngeal frequency. Default 400 Hz.

400

Raises:

Type Description
NotImplementedError

SWT-based SIGMA algorithm requires specialized wavelet toolbox support.

Source code in pyvoicebox/v_sigma.py
def v_sigma(lx, fs, fmax=400) -> None:
    """Estimate glottal opening and closing instants (SIGMA algorithm).

    This function requires the Stationary Wavelet Transform (SWT) which
    is available in MATLAB's Wavelet Toolbox but not straightforwardly
    in scipy.

    Parameters
    ----------
    lx : array_like
        LX (laryngograph) signal.
    fs : float
        Sampling frequency in Hz.
    fmax : float, optional
        Maximum laryngeal frequency. Default 400 Hz.

    Raises
    ------
    NotImplementedError
        SWT-based SIGMA algorithm requires specialized wavelet toolbox support.
    """
    raise NotImplementedError(
        "v_sigma requires SWT (Stationary Wavelet Transform) which is not "
        "readily available in scipy. Consider using pywt.swt from PyWavelets."
    )