Speech Analysis¶

Frame-based analysis, spectrograms, pitch trackers, voice activity detection, level measurement, and psychoacoustic metrics.

Framing and time-frequency¶

v_enframe ¶

V_ENFRAME - Split signal into (overlapping) frames: one per row.

v_enframe ¶

v_enframe(
    x, win=None, hop=None, m="", fs=1
) -> tuple[ndarray, ndarray, ndarray]

Split signal up into (overlapping) frames: one per row.

Parameters:

Name	Type	Description	Default
`x`	`array_like`	Input signal (1-D).	required
`win`	`int or array_like`	Window or window length in samples. Default is len(x).	`None`
`hop`	`int or float`	Frame increment in samples. If < 1, fraction of window length. Default is window length (non-overlapping).	`None`
`m`	`str`	Mode string: 'z' - zero pad final frame 'r' - reflect last few samples for final frame 'A' - t output as centre of mass 'E' - t output as centre of energy 'f' - 1-sided DFT on each frame 'F' - 2-sided DFT on each frame 'p' - 1-sided power spectrum 'P' - 2-sided power spectrum 'a' - scale window to give unity gain with overlap-add 's' - scale so power is preserved 'S' - scale so total energy is preserved 'd' - make 's'/'S' give power/energy per Hz	`''`
`fs`	`float`	Sample frequency (only needed for 'd' option). Default is 1.	`1`

Returns:

Name	Type	Description
`f`	`ndarray`	Enframed data, one frame per row.
`t`	`ndarray (if requested via tuple unpacking)`	Fractional time in samples at the centre of each frame. First sample is index 1 (MATLAB convention).
`w`	`ndarray (if requested via tuple unpacking)`	Window function used, including scaling.

Source code in pyvoicebox/v_enframe.py

def v_enframe(x, win=None, hop=None, m='', fs=1) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Split signal up into (overlapping) frames: one per row.

    Parameters
    ----------
    x : array_like
        Input signal (1-D).
    win : int or array_like, optional
        Window or window length in samples. Default is len(x).
    hop : int or float, optional
        Frame increment in samples. If < 1, fraction of window length.
        Default is window length (non-overlapping).
    m : str, optional
        Mode string:
          'z' - zero pad final frame
          'r' - reflect last few samples for final frame
          'A' - t output as centre of mass
          'E' - t output as centre of energy
          'f' - 1-sided DFT on each frame
          'F' - 2-sided DFT on each frame
          'p' - 1-sided power spectrum
          'P' - 2-sided power spectrum
          'a' - scale window to give unity gain with overlap-add
          's' - scale so power is preserved
          'S' - scale so total energy is preserved
          'd' - make 's'/'S' give power/energy per Hz
    fs : float, optional
        Sample frequency (only needed for 'd' option). Default is 1.

    Returns
    -------
    f : ndarray
        Enframed data, one frame per row.
    t : ndarray (if requested via tuple unpacking)
        Fractional time in samples at the centre of each frame.
        First sample is index 1 (MATLAB convention).
    w : ndarray (if requested via tuple unpacking)
        Window function used, including scaling.
    """
    x = np.asarray(x, dtype=float).ravel()
    nx = len(x)

    if win is None:
        win = nx
    win = np.asarray(win, dtype=float).ravel()

    nwin = len(win)
    if nwin == 1:
        lw = int(win[0])
        w = np.ones(lw)
    else:
        lw = nwin
        w = win.copy()

    if hop is None:
        hop = lw
    elif hop < 1:
        hop = int(round(lw * hop))
    else:
        hop = int(hop)

    # Window scaling
    wsc = 1.0
    if 'a' in m:
        wsc = np.sqrt(hop / np.dot(w, w))
    elif 'd' in m:
        if 's' in m:
            wsc = np.sqrt(1.0 / (np.dot(w, w) * fs))
        elif 'S' in m:
            wsc = np.sqrt(hop / np.dot(w, w)) / fs
    else:
        if 's' in m:
            wsc = np.sqrt(1.0 / (np.dot(w, w) * lw))
        elif 'S' in m:
            wsc = np.sqrt(hop / (np.dot(w, w) * lw))

    nli = nx - lw + hop
    nf = max(int(np.fix(nli / hop)), 0)  # number of full frames
    na = nli - hop * nf + (nf == 0) * (lw - hop)  # samples left over
    fx = ('z' in m or 'r' in m) and na > 0  # need extra row

    f = np.zeros((nf + int(fx), lw))
    indf = hop * np.arange(nf)  # (nf,)
    inds = np.arange(lw)        # (lw,)

    if fx:
        # Build index matrix for full frames
        idx = indf[:, np.newaxis] + inds[np.newaxis, :]  # (nf, lw)
        if nf > 0:
            f[:nf, :] = x[idx]
        if 'r' in m:
            ix = 1 + np.mod(nf * hop + np.arange(lw), 2 * nx)
            # MATLAB: x(ix+(ix>nx).*(2*nx+1-2*ix)), 1-based
            ix0 = ix.copy()
            mask = ix0 > nx
            ix0[mask] = 2 * nx + 1 - ix0[mask]
            f[nf, :] = x[ix0.astype(int) - 1]  # convert to 0-based
        else:
            rem_samples = nx - nf * hop
            f[nf, :rem_samples] = x[nf * hop:nx]
        nf = f.shape[0]
    else:
        if nf > 0:
            idx = indf[:, np.newaxis] + inds[np.newaxis, :]
            f[:] = x[idx]

    w = wsc * w
    if nwin > 1:
        f = f * w[np.newaxis, :]
    else:
        f = wsc * f

    # Power spectrum or DFT
    ml = m.lower()
    if 'p' in ml:
        f = np.fft.fft(f, axis=1)
        f = np.real(f * np.conj(f))
        if 'p' in m:  # 1-sided
            imx = int(np.fix((lw + 1) / 2))
            f[:, 1:imx] = f[:, 1:imx] + f[:, lw - 1:lw - imx:-1]
            f = f[:, :int(np.fix(lw / 2)) + 1]
    elif 'f' in ml:
        f = np.fft.fft(f, axis=1)
        if 'f' in m:  # 1-sided
            f = f[:, :int(np.fix(lw / 2)) + 1]

    # Time output
    if 'E' in m:
        t0 = np.sum(np.arange(1, lw + 1) * w ** 2) / np.sum(w ** 2)
    elif 'A' in m:
        t0 = np.sum(np.arange(1, lw + 1) * w) / np.sum(w)
    else:
        t0 = (1 + lw) / 2.0
    t = t0 + hop * np.arange(nf)

    return f, t, w

v_overlapadd ¶

V_OVERLAPADD - Join overlapping frames together.

v_overlapadd ¶

v_overlapadd(f, win=None, inc=None) -> ndarray

Join overlapping frames together.

Parameters:

Name	Type	Description	Default
`f`	`(ndarray, shape(nr, nw))`	Frames to be added together, one frame per row.	required
`win`	`array_like or dict`	Window function to multiply each frame, or saved state dict. If omitted, a rectangular window is used.	`None`
`inc`	`int`	Time increment (in samples) between successive frames. Default is nw.	`None`

Returns:

Name	Type	Description
`x`	`ndarray`	Output signal of length nw + (nr-1)*inc.
`zo`	`dict (only if explicitly requested)`	Saved state for chunk processing.

Source code in pyvoicebox/v_overlapadd.py

def v_overlapadd(f, win=None, inc=None) -> np.ndarray:
    """Join overlapping frames together.

    Parameters
    ----------
    f : ndarray, shape (nr, nw)
        Frames to be added together, one frame per row.
    win : array_like or dict, optional
        Window function to multiply each frame, or saved state dict.
        If omitted, a rectangular window is used.
    inc : int, optional
        Time increment (in samples) between successive frames.
        Default is nw.

    Returns
    -------
    x : ndarray
        Output signal of length nw + (nr-1)*inc.
    zo : dict (only if explicitly requested)
        Saved state for chunk processing.
    """
    f = np.asarray(f, dtype=float)
    if f.ndim == 1:
        f = f.reshape(1, -1)
    nr, nf = f.shape

    if win is None:
        if inc is None:
            inc = nf
        w = None
    elif isinstance(win, dict):
        w = win['w']
        inc = win['inc']
        xx = win['xx']
    else:
        win = np.asarray(win, dtype=float).ravel()
        if inc is None:
            if len(win) == 1 and win[0] == int(win[0]):
                inc = int(win[0])
                w = None
            else:
                inc = nf
                w = win
                if len(w) != nf:
                    raise ValueError('window length does not match frame size')
                if np.all(w == 1):
                    w = None
        else:
            w = win
            if len(w) != nf:
                raise ValueError('window length does not match frame size')
            if np.all(w == 1):
                w = None

    if not isinstance(win, dict):
        xx = None

    nb = int(np.ceil(nf / inc))  # number of overlap buffers
    no = nf + (nr - 1) * inc     # total output length

    z = np.zeros((no, nb))

    for i in range(nr):
        buf_idx = i % nb
        start = i * inc
        frame = f[i, :] * w if w is not None else f[i, :]
        z[start:start + nf, buf_idx] += frame

    x = np.sum(z, axis=1)

    if xx is not None and len(xx) > 0:
        x[:len(xx)] += xx

    return x

v_fram2wav ¶

V_FRAM2WAV - Convert frame values to a continuous waveform.

v_fram2wav ¶

v_fram2wav(x, tt, mode='l') -> tuple[ndarray, ndarray]

Convert frame values to a continuous waveform.

Parameters:

Name	Type	Description	Default
`x`	`(ndarray, shape(nf) or (nf, p))`	Input signal: one row per frame.	required
`tt`	`(ndarray, shape(nf, 2) or (nf, 3))`	Frame specifications. Each row: [start_sample, end_sample, flag]. flag=1 for start of new spurt. If tt has only 2 columns, spurts are auto-detected from gaps.	required
`mode`	`str`	'z' for zero-order hold, 'l' for linear interpolation (default).	`'l'`

Returns:

Name	Type	Description
`w`	`(ndarray, shape(n, p))`	Interpolated waveform of length n = tt[-1, 1].
`s`	`(ndarray, shape(ns, 2))`	Start and end sample numbers of each spurt.

Source code in pyvoicebox/v_fram2wav.py

def v_fram2wav(x, tt, mode='l') -> tuple[np.ndarray, np.ndarray]:
    """Convert frame values to a continuous waveform.

    Parameters
    ----------
    x : ndarray, shape (nf,) or (nf, p)
        Input signal: one row per frame.
    tt : ndarray, shape (nf, 2) or (nf, 3)
        Frame specifications. Each row: [start_sample, end_sample, flag].
        flag=1 for start of new spurt. If tt has only 2 columns, spurts
        are auto-detected from gaps.
    mode : str, optional
        'z' for zero-order hold, 'l' for linear interpolation (default).

    Returns
    -------
    w : ndarray, shape (n, p)
        Interpolated waveform of length n = tt[-1, 1].
    s : ndarray, shape (ns, 2)
        Start and end sample numbers of each spurt.
    """
    x = np.asarray(x, dtype=float)
    tt = np.asarray(tt, dtype=float)

    if x.ndim == 1:
        x = x[:, np.newaxis]

    nf, p = x.shape
    n = int(np.round(tt[-1, 1]))
    w = np.full((n, p), np.nan)
    nt = tt.shape[1]

    # MATLAB 1-based indexing: ceil(tt(:,1)) and floor(tt(:,2))
    ix1 = np.ceil(tt[:, 0]).astype(int)   # start of frame sample (1-based)
    ix2 = np.floor(tt[:, 1]).astype(int)  # end of frame sample (1-based)

    # Determine spurt boundaries
    if nt > 2:
        ty = (tt[:, 2] > 0).astype(float)
    else:
        ty = np.zeros(nf)
        ty[1:] = (ix1[1:] > ix2[:-1] + 1).astype(float)

    ty[0] = 1  # first frame always starts a spurt
    # NaN always ends previous spurt
    nan_mask = np.any(np.isnan(x), axis=1)
    ty[nan_mask] = 1
    # NaN always forces a new spurt for the frame after
    nan_idx = np.where(nan_mask[:nf - 1])[0]
    ty[nan_idx + 1] = 1

    ty = ty.astype(float)
    # Encode: bit 0 = starts a spurt, bit 1 = ends a spurt
    # MATLAB: ty(1:end-1) = ty(1:end-1) + 2*ty(2:end)
    ty[:nf - 1] = ty[:nf - 1] + 2 * ty[1:]
    ty[nf - 1] = ty[nf - 1] + 2  # last frame always ends a spurt

    nx = ix2 - ix1 + 1

    if 'z' in mode:
        # Zero-order hold
        for i in range(nf):
            if nx[i] > 0:
                # Convert to 0-based
                w[ix1[i] - 1:ix2[i], :] = np.tile(x[i, :], (nx[i], 1))
    else:
        # Linear interpolation (default)
        ttm = (tt[:, 0] + tt[:, 1]) / 2.0  # mid point of frame (1-based)
        ixm = np.floor(ttm).astype(int)     # end of first half (1-based)

        for i in range(nf):
            if nx[i] > 0:
                tyi = int(ty[i])
                if tyi == 3:  # isolated frame: zero-order hold
                    w[ix1[i] - 1:ix2[i], :] = np.tile(x[i, :], (nx[i], 1))
                else:
                    nxm = ixm[i] - ix1[i] + 1
                    if nxm > 0:
                        if tyi == 1:  # start of spurt
                            grad = (x[i + 1, :] - x[i, :]) / (ttm[i + 1] - ttm[i])
                        else:
                            grad = (x[i, :] - x[i - 1, :]) / (ttm[i] - ttm[i - 1])
                        samples = np.arange(ix1[i], ixm[i] + 1)  # 1-based
                        w[ix1[i] - 1:ixm[i], :] = (
                            np.tile(x[i, :], (nxm, 1)) +
                            (samples[:, np.newaxis] - ttm[i]) * grad[np.newaxis, :]
                        )
                    if nx[i] > nxm:
                        if tyi == 2:  # end of spurt
                            grad = (x[i, :] - x[i - 1, :]) / (ttm[i] - ttm[i - 1])
                        else:
                            grad = (x[i + 1, :] - x[i, :]) / (ttm[i + 1] - ttm[i])
                        n_remaining = ix2[i] - ixm[i]
                        samples = np.arange(ixm[i] + 1, ix2[i] + 1)  # 1-based
                        w[ixm[i]:ix2[i], :] = (
                            np.tile(x[i, :], (n_remaining, 1)) +
                            (samples[:, np.newaxis] - ttm[i]) * grad[np.newaxis, :]
                        )

    # Sort out spurt positions
    ty_clean = ty.copy()
    ty_clean[nan_mask] = 0

    # Find starts (bit 0 set) and ends (bit 1 set)
    starts = np.where(np.bitwise_and(ty_clean.astype(int), 1) > 0)[0]
    ends = np.where(np.bitwise_and(ty_clean.astype(int), 2) > 0)[0]

    s = np.column_stack([ix1[starts], ix2[ends]])

    if p == 1:
        w = w.ravel()

    return w, s

v_stftw ¶

V_STFTW - Short-time Fourier Transform.

v_stftw ¶

v_stftw(x, nw, m='', ov=2, nt=None) -> tuple[ndarray, dict]

Convert time-domain signal to time-frequency domain using STFT.

Parameters:

Name	Type	Description	Default
`x`	`array_like`	Input signal.	required
`nw`	`int`	Window length (rounded up to multiple of ov).	required
`m`	`str`	Mode string including window code.	`''`
`ov`	`int`	Overlap factor. Default 2.	`2`
`nt`	`int`	DFT length. Default nw.	`None`

Returns:

Name	Type	Description
`y`	`ndarray`	STFT output (frames x frequencies).
`so`	`dict`	Structure for inverse transformation.

Source code in pyvoicebox/v_stftw.py

def v_stftw(x, nw, m='', ov=2, nt=None) -> tuple[np.ndarray, dict]:
    """Convert time-domain signal to time-frequency domain using STFT.

    Parameters
    ----------
    x : array_like
        Input signal.
    nw : int
        Window length (rounded up to multiple of ov).
    m : str, optional
        Mode string including window code.
    ov : int, optional
        Overlap factor. Default 2.
    nt : int, optional
        DFT length. Default nw.

    Returns
    -------
    y : ndarray
        STFT output (frames x frequencies).
    so : dict
        Structure for inverse transformation.
    """
    x = np.asarray(x, dtype=float).ravel()
    nx = len(x)

    # Round nw up to multiple of ov
    nw = int(np.ceil(nw / ov) * ov)
    nh = nw // ov  # hop size

    if nt is None:
        nt = nw

    # Choose window
    if 'n' in m:
        wa = np.hanning(nw + 1)[:nw]  # Hann
    elif 'c' in m:
        wa = np.cos(np.pi * (np.arange(nw) + 0.5) / nw)
    elif 'R' in m:
        wa = np.ones(nw)
    elif 'M' in m or (ov == 2 and 'm' not in m):
        # sqrt Hamming
        wa = np.sqrt(np.hamming(nw))
    else:
        wa = np.hamming(nw)

    # Normalize for COLA
    wa_sum = np.zeros(nw)
    for k in range(ov):
        wa_sum += np.roll(wa ** 2, k * nh)
    # Don't normalize if wa_sum is already constant

    # Pad signal
    if 'e' in m:
        # Pad beginning and end
        pad_start = nw - nh
        pad_end = nw - nh
        x = np.concatenate([np.zeros(pad_start), x, np.zeros(pad_end)])
        nx = len(x)

    # Pad final frame
    if 'z' in m:
        pad = nh - (nx % nh) if nx % nh != 0 else 0
        x = np.concatenate([x, np.zeros(pad)])
    elif 'r' in m or True:
        # Pad with reflected data
        pad = nh - (nx % nh) if nx % nh != 0 else 0
        if pad > 0:
            x = np.concatenate([x, x[-1:-(pad + 1):-1]])

    nx = len(x)
    nframes = max(0, (nx - nw) // nh + 1)
    nf = nt // 2 + 1

    y = np.zeros((nframes, nf), dtype=complex)
    for i in range(nframes):
        start = i * nh
        frame = x[start:start + nw] * wa
        if nt > nw:
            frame = np.concatenate([frame, np.zeros(nt - nw)])
        Y = np.fft.rfft(frame, nt)
        y[i, :] = Y

    so = {
        'nt': nt,
        'nh': nh,
        'nw': nw,
        'wa': wa,
        'ov': ov,
    }

    return y, so

v_istftw ¶

V_ISTFTW - Inverse Short-time Fourier Transform.

v_istftw ¶

v_istftw(y, so, io=None) -> ndarray

Convert time-frequency domain back to time domain using inverse STFT.

Parameters:

Name	Type	Description	Default
`y`	`array_like`	STFT data (frames x frequencies).	required
`so`	`dict`	Structure from v_stftw containing window and transform parameters.	required
`io`	`dict`	State from previous call for chunked processing.	`None`

Returns:

Name	Type	Description
`z`	`ndarray`	Reconstructed time-domain signal.
`io`	`dict`	State for subsequent calls.

Source code in pyvoicebox/v_istftw.py

def v_istftw(y, so, io=None) -> np.ndarray:
    """Convert time-frequency domain back to time domain using inverse STFT.

    Parameters
    ----------
    y : array_like
        STFT data (frames x frequencies).
    so : dict
        Structure from v_stftw containing window and transform parameters.
    io : dict, optional
        State from previous call for chunked processing.

    Returns
    -------
    z : ndarray
        Reconstructed time-domain signal.
    io : dict
        State for subsequent calls.
    """
    y = np.asarray(y)
    nframes = y.shape[0]
    nt = so['nt']
    nh = so['nh']
    nw = so['nw']
    wa = so['wa']

    # Synthesis window (same as analysis for now)
    ws = wa.copy()

    # Compute normalization
    norm = np.zeros(nw)
    ov = so.get('ov', nw // nh)
    for k in range(ov):
        shifted = np.roll(wa * ws, k * nh)
        norm += shifted
    # Avoid division by zero
    norm[norm == 0] = 1.0

    # Output length
    out_len = (nframes - 1) * nh + nw
    z = np.zeros(out_len)
    w_sum = np.zeros(out_len)

    for i in range(nframes):
        # Inverse FFT
        frame = np.fft.irfft(y[i, :], nt)[:nw]
        frame *= ws

        start = i * nh
        z[start:start + nw] += frame
        w_sum[start:start + nw] += wa * ws

    # Normalize by window sum
    nz = w_sum > 1e-10
    z[nz] /= w_sum[nz]

    return z

v_filtbankm ¶

V_FILTBANKM - General filterbank matrix (mel/bark/erb/linear).

v_filtbankm ¶

v_filtbankm(
    p, n, fs, fl=0, fh=None, w=""
) -> tuple[Any, ndarray, int, int]

Determine matrix for a filterbank with various frequency scales.

Simplified implementation supporting mel, bark, erb, linear scales.

Parameters:

Name	Type	Description	Default
`p`	`int`	Number of filters in filterbank.	required
`n`	`int`	Length of DFT.	required
`fs`	`float`	Sample rate in Hz.	required
`fl`	`float`	Low frequency edge in Hz. Default 0.	`0`
`fh`	`float`	High frequency edge in Hz. Default fs/2.	`None`
`w`	`str`	Options: 'm'=mel, 'b'=bark, 'e'=erb, 'f'=linear [default].	`''`

Returns:

Name	Type	Description
`x`	`(ndarray, shape(p, 1 + n // 2))`	Filterbank matrix.
`cf`	`(ndarray, shape(p))`	Filter center frequencies in Hz.

Source code in pyvoicebox/v_filtbankm.py

def v_filtbankm(p, n, fs, fl=0, fh=None, w='') -> tuple[Any, np.ndarray, int, int]:
    """Determine matrix for a filterbank with various frequency scales.

    Simplified implementation supporting mel, bark, erb, linear scales.

    Parameters
    ----------
    p : int
        Number of filters in filterbank.
    n : int
        Length of DFT.
    fs : float
        Sample rate in Hz.
    fl : float, optional
        Low frequency edge in Hz. Default 0.
    fh : float, optional
        High frequency edge in Hz. Default fs/2.
    w : str, optional
        Options: 'm'=mel, 'b'=bark, 'e'=erb, 'f'=linear [default].

    Returns
    -------
    x : ndarray, shape (p, 1+n//2)
        Filterbank matrix.
    cf : ndarray, shape (p,)
        Filter center frequencies in Hz.
    """
    if fh is None:
        fh = fs / 2.0

    nf = 1 + n // 2  # number of frequency bins
    fax = np.arange(nf) * fs / n  # frequency axis

    # Choose frequency scale
    if 'm' in w:
        frq2scale = lambda x: v_frq2mel(x)[0]
        scale2frq = lambda x: v_mel2frq(x)[0]
    elif 'b' in w:
        frq2scale = lambda x: v_frq2bark(x)[0]
        scale2frq = lambda x: v_bark2frq(x)[0]
    elif 'e' in w:
        frq2scale = lambda x: v_frq2erb(x)[0]
        scale2frq = lambda x: v_erb2frq(x)[0]
    else:
        frq2scale = lambda x: x
        scale2frq = lambda x: x

    # Convert frequency limits to chosen scale
    fl_s = frq2scale(fl)
    fh_s = frq2scale(fh)

    # Create p+2 equally spaced points in the chosen scale (including edges)
    cf_s = np.linspace(fl_s, fh_s, p + 2)
    cf_hz = scale2frq(cf_s)

    # Center frequencies (excluding edges)
    cf = cf_hz[1:-1]

    # Build the filterbank matrix
    x = np.zeros((p, nf))
    for i in range(p):
        lo = cf_hz[i]
        mid = cf_hz[i + 1]
        hi = cf_hz[i + 2]

        # Rising slope
        mask_rise = (fax >= lo) & (fax <= mid)
        if mid > lo:
            x[i, mask_rise] = (fax[mask_rise] - lo) / (mid - lo)

        # Falling slope
        mask_fall = (fax >= mid) & (fax <= hi)
        if hi > mid:
            x[i, mask_fall] = (hi - fax[mask_fall]) / (hi - mid)

    return x, cf

v_gammabank ¶

V_GAMMABANK - Gammatone filterbank (stub).

v_gammabank ¶

v_gammabank(*args, **kwargs) -> None

Create a bank of gammatone filters.

The full MATLAB implementation creates gammatone filters at ERB-spaced center frequencies. A stub is provided.

Raises:

Type	Description
`NotImplementedError`	Full implementation pending.

Source code in pyvoicebox/v_gammabank.py

def v_gammabank(*args, **kwargs) -> None:
    """Create a bank of gammatone filters.

    The full MATLAB implementation creates gammatone filters at
    ERB-spaced center frequencies. A stub is provided.

    Raises
    ------
    NotImplementedError
        Full implementation pending.
    """
    raise NotImplementedError(
        "v_gammabank full implementation pending. "
        "Consider using brian2hears or gammatone Python packages."
    )

v_spgrambw ¶

V_SPGRAMBW - Spectrogram computation with configurable bandwidth.

v_spgrambw ¶

v_spgrambw(
    s, fs, mode="", bw=200, fmax=None, db=40, tinc=0
) -> tuple[ndarray, ndarray, ndarray]

Compute spectrogram with configurable bandwidth (no plotting).

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Speech signal or power spectrum array.	required
`fs`	`float or array_like`	Sample frequency in Hz, or [fs, t1].	required
`mode`	`str`	Mode options: 'p' : output power per decade 'P' : output power per mel/bark/erb 'd' : output in dB 'm' : mel scale 'b' : bark scale 'e' : erb scale 'l' : log10 Hz scale	`''`
`bw`	`float`	Bandwidth resolution in Hz.	`200`
`fmax`	`array_like`	Frequency range [Fmin, Fstep, Fmax].	`None`
`db`	`float`	dB range for plotting/clipping.	`40`
`tinc`	`float`	Output frame increment in seconds.	`0`

Returns:

Name	Type	Description
`t`	`ndarray`	Time axis values (seconds).
`f`	`ndarray`	Frequency axis values.
`b`	`ndarray`	Spectrogram values (power per Hz unless mode changes this).

Source code in pyvoicebox/v_spgrambw.py

def v_spgrambw(s, fs, mode='', bw=200, fmax=None, db=40, tinc=0) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Compute spectrogram with configurable bandwidth (no plotting).

    Parameters
    ----------
    s : array_like
        Speech signal or power spectrum array.
    fs : float or array_like
        Sample frequency in Hz, or [fs, t1].
    mode : str
        Mode options:
          'p' : output power per decade
          'P' : output power per mel/bark/erb
          'd' : output in dB
          'm' : mel scale
          'b' : bark scale
          'e' : erb scale
          'l' : log10 Hz scale
    bw : float
        Bandwidth resolution in Hz.
    fmax : array_like, optional
        Frequency range [Fmin, Fstep, Fmax].
    db : float
        dB range for plotting/clipping.
    tinc : float
        Output frame increment in seconds.

    Returns
    -------
    t : ndarray
        Time axis values (seconds).
    f : ndarray
        Frequency axis values.
    b : ndarray
        Spectrogram values (power per Hz unless mode changes this).
    """
    s = np.asarray(s, dtype=float)
    if s.ndim == 1:
        s = s[:, np.newaxis]
        ns1, ns2 = s.shape
        s = s.ravel()
        ns2 = 1
    else:
        ns1, ns2 = s.shape

    fs_arr = np.atleast_1d(np.asarray(fs, dtype=float))
    if len(fs_arr) < 2:
        fs_arr = np.array([fs_arr[0], 1.0 / fs_arr[0]])

    if tinc == 0:
        tinc = 1.81 / (4 * bw)

    # Determine frequency scale
    mdsw = ' '
    for ch in mode:
        if ch in 'lmbe':
            mdsw = ch

    nfrq = 257
    if ns2 == 1:
        fnyq = fs_arr[0] / 2.0
    else:
        if len(fs_arr) < 3:
            fs_arr = np.append(fs_arr, fs_arr[0] * 0.25)
        if len(fs_arr) < 4:
            fs_arr = np.append(fs_arr, 0)
        fnyq = fs_arr[3] + (ns2 - 1) * fs_arr[2]

    flmin = 30.0
    if fmax is None:
        if mdsw == 'l':
            fx = np.linspace(np.log10(flmin), np.log10(fnyq), nfrq)
        elif mdsw == 'm':
            fx = np.linspace(0, v_frq2mel(fnyq)[0], nfrq)
        elif mdsw == 'b':
            fx = np.linspace(0, v_frq2bark(fnyq)[0], nfrq)
        elif mdsw == 'e':
            fx = np.linspace(0, v_frq2erb(fnyq)[0], nfrq)
        else:
            fx = np.arange(nfrq) * fnyq / (nfrq - 1)
    else:
        fmax = np.atleast_1d(np.asarray(fmax, dtype=float))
        fmaxu = fmax.copy()
        if 'h' in mode:
            if mdsw == 'l':
                fmaxu = np.log10(fmax)
            elif mdsw == 'm':
                fmaxu = v_frq2mel(fmax)[0]
            elif mdsw == 'b':
                fmaxu = v_frq2bark(fmax)[0]
            elif mdsw == 'e':
                fmaxu = v_frq2erb(fmax)[0]

        if len(fmaxu) < 2:
            if mdsw == 'l':
                fx = np.linspace(np.log10(flmin), fmaxu[0], nfrq)
            else:
                fx = np.linspace(0, fmaxu[0], nfrq)
        elif len(fmaxu) < 3:
            fx = np.linspace(fmaxu[0], fmaxu[1], nfrq)
        else:
            fx = np.arange(fmaxu[0], fmaxu[2] + fmaxu[1] / 2, fmaxu[1])
            nfrq = len(fx)

    # Convert frequency range to Hz
    if mdsw == 'l':
        f_hz = 10.0 ** fx
    elif mdsw == 'm':
        f_hz = v_mel2frq(fx)[0]
    elif mdsw == 'b':
        f_hz = v_bark2frq(fx)[0]
    elif mdsw == 'e':
        f_hz = v_erb2frq(fx)[0]
    else:
        f_hz = fx.copy()

    f = fx.copy()  # output frequencies in native units

    # Calculate spectrogram
    if ns2 == 1:
        winlen = int(round(1.81 * fs_arr[0] / bw))
        win = 0.54 + 0.46 * np.cos(np.arange(1 - winlen, winlen + 1, 2) * np.pi / winlen)
        ninc = max(round(tinc * fs_arr[0]), 1)
        fftlen = int(2 ** np.ceil(np.log2(4 * winlen)))
        win = win / np.sqrt(np.sum(win ** 2))

        sf, t, *_ = v_enframe(s, win, ninc)
        t = fs_arr[1] + (t - 1) / fs_arr[0]  # time axis
        b_spec = v_rfft(sf, fftlen, 1)
        b_spec = np.real(b_spec * np.conj(b_spec)) * 2.0 / fs_arr[0]
        b_spec[:, 0] *= 0.5
        b_spec[:, -1] *= 0.5
        fb = np.arange(fftlen // 2 + 1) * fs_arr[0] / fftlen
    else:
        b_spec = s.copy()
        t = fs_arr[1] + np.arange(ns1) / fs_arr[0]
        fb = fs_arr[3] + np.arange(ns2) * fs_arr[2]
        fftlen = ns2

    nfr = len(t)

    # Apply preemphasis
    preemph = 'P' in mode
    if 'p' in mode or (preemph and mdsw == 'l'):
        b_spec = b_spec * fb[np.newaxis, :] * np.log(10)
    elif preemph and mdsw == 'm':
        b_spec = b_spec * ((700 + fb) * np.log(1 + 1000 / 700) / 1000)[np.newaxis, :]
    elif preemph and mdsw == 'b':
        b_spec = b_spec * ((1960 + fb) ** 2 / 52547.6)[np.newaxis, :]
    elif preemph and mdsw == 'e':
        b_spec = b_spec * (6.23 * fb ** 2 + 93.39 * fb + 28.52)[np.newaxis, :]

    # Interpolate onto desired frequency axis
    b_out = np.zeros((nfr, nfrq))
    for i in range(nfrq):
        # Find interpolation position
        fi = f_hz[i]
        idx = np.searchsorted(fb, fi)
        if idx <= 0:
            b_out[:, i] = b_spec[:, 0]
        elif idx >= len(fb):
            b_out[:, i] = b_spec[:, -1]
        else:
            frac = (fi - fb[idx - 1]) / (fb[idx] - fb[idx - 1])
            b_out[:, i] = (1 - frac) * b_spec[:, idx - 1] + frac * b_spec[:, idx]

    if 'd' in mode:
        b_out = 10.0 * np.log10(np.maximum(b_out, np.max(b_out) * 1e-30))

    return t, f, b_out

v_modspect ¶

V_MODSPECT - Calculate modulation spectrum of a signal.

v_modspect ¶

v_modspect(
    s, fs=11025, m="", nf=None, nq=None
) -> tuple[ndarray, ndarray, ndarray, ndarray]

Calculate the modulation spectrum of a signal.

This is a simplified implementation that computes the mel spectrogram and then the modulation spectrum for each mel channel.

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Speech signal.	required
`fs`	`float`	Sample rate in Hz.	`11025`
`m`	`str`	Mode string.	`''`
`nf`	`array_like`	[num_mel_bins, fmin, fmax, num_dct].	`None`
`nq`	`array_like`	[num_mod_bins, mod_fmin, mod_fmax, num_mod_dct].	`None`

Returns:

Name	Type	Description
`c`	`ndarray`	Modulation spectrum (mod_freq, mel_freq, time).
`qq`	`ndarray`	Modulation frequency centres.
`ff`	`ndarray`	Mel frequency centres.
`tt`	`ndarray`	Time axis.

Source code in pyvoicebox/v_modspect.py

def v_modspect(s, fs=11025, m='', nf=None, nq=None) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """Calculate the modulation spectrum of a signal.

    This is a simplified implementation that computes the mel spectrogram
    and then the modulation spectrum for each mel channel.

    Parameters
    ----------
    s : array_like
        Speech signal.
    fs : float
        Sample rate in Hz.
    m : str
        Mode string.
    nf : array_like, optional
        [num_mel_bins, fmin, fmax, num_dct].
    nq : array_like, optional
        [num_mod_bins, mod_fmin, mod_fmax, num_mod_dct].

    Returns
    -------
    c : ndarray
        Modulation spectrum (mod_freq, mel_freq, time).
    qq : ndarray
        Modulation frequency centres.
    ff : ndarray
        Mel frequency centres.
    tt : ndarray
        Time axis.
    """
    s = np.asarray(s, dtype=float).ravel()

    # Parse nf parameters
    nf_defaults = [0.1, 40, min(10000, fs / 2), 25]
    if nf is not None:
        nf = list(np.atleast_1d(nf))
        while len(nf) < 4:
            nf.append(nf_defaults[len(nf)])
    else:
        nf = nf_defaults

    nq_defaults = [0.1, 0.5, 20, 15]
    if nq is not None:
        nq = list(np.atleast_1d(nq))
        while len(nq) < 4:
            nq.append(nq_defaults[len(nq)])
    else:
        nq = nq_defaults

    # Step 1: Short-time Fourier Transform
    n_fft = int(2 ** np.floor(np.log2(0.03 * fs)))
    inc = n_fft // 2
    win = 0.54 - 0.46 * np.cos(2 * np.pi * np.arange(n_fft) / (n_fft - 1))
    frames, tt, *_ = v_enframe(s, win, inc)
    tt = tt / fs

    spec = v_rfft(frames, n_fft, 1)
    pw = np.real(spec * np.conj(spec))

    # Step 2: Mel filterbank
    n_mel = int(nf[0]) if nf[0] >= 1 else int(np.ceil(4.6 * np.log10(fs)))
    mb, mc, _, _ = v_melbankm(n_mel, n_fft, fs, nf[1] / fs, nf[2] / fs)
    mel_spec = np.sqrt(np.maximum(mb.toarray() @ pw.T, 1e-20))  # (n_mel, n_frames)

    n_frames = mel_spec.shape[1]
    ff = mc

    # Step 3: Modulation spectrum per mel channel
    mod_inc = max(1, round(0.01 * fs / inc))  # ~10ms in frame units
    mod_win_len = max(4, min(n_frames, round(0.2 * fs / inc)))
    mod_nfft = int(2 ** np.ceil(np.log2(mod_win_len)))

    n_mod_frames = max(1, (n_frames - mod_win_len) // mod_inc + 1)
    n_mod_bins = mod_nfft // 2 + 1

    mod_freqs = np.arange(n_mod_bins) / mod_nfft * (fs / inc)
    c = np.zeros((n_mod_bins, n_mel, n_mod_frames))

    mod_win = np.hamming(mod_win_len)
    for i_mel in range(n_mel):
        sig = mel_spec[i_mel, :]
        for i_frame in range(n_mod_frames):
            start = i_frame * mod_inc
            end = start + mod_win_len
            if end > n_frames:
                break
            seg = sig[start:end] * mod_win
            f_mod = np.fft.rfft(seg, mod_nfft)
            c[:, i_mel, i_frame] = np.abs(f_mod) ** 2

    qq = mod_freqs
    tt_out = tt[::mod_inc][:n_mod_frames] if len(tt) > 0 else np.arange(n_mod_frames)

    return c, qq, ff, tt_out

v_correlogram ¶

V_CORRELOGRAM - Calculate correlogram.

v_correlogram ¶

v_correlogram(
    x, inc=128, nw=None, nlag=None, m="h", fs=1
) -> tuple[ndarray, ndarray]

Calculate correlogram.

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	Input signal (samples, channels) from a filterbank.	required
`inc`	`int`	Frame increment in samples.	`128`
`nw`	`int or array_like`	Window length in samples or window function. Default: inc.	`None`
`nlag`	`int`	Number of lags to calculate. Default: nw.	`None`
`m`	`str`	Mode: 'h' for Hamming window.	`'h'`
`fs`	`float`	Sample frequency.	`1`

Returns:

Name	Type	Description
`y`	`ndarray`	Correlogram (nlag, channels, frames).
`ty`	`ndarray`	Time of window centre for each frame.

Source code in pyvoicebox/v_correlogram.py

def v_correlogram(x, inc=128, nw=None, nlag=None, m='h', fs=1) -> tuple[np.ndarray, np.ndarray]:
    """Calculate correlogram.

    Parameters
    ----------
    x : ndarray
        Input signal (samples, channels) from a filterbank.
    inc : int
        Frame increment in samples.
    nw : int or array_like, optional
        Window length in samples or window function. Default: inc.
    nlag : int, optional
        Number of lags to calculate. Default: nw.
    m : str
        Mode: 'h' for Hamming window.
    fs : float
        Sample frequency.

    Returns
    -------
    y : ndarray
        Correlogram (nlag, channels, frames).
    ty : ndarray
        Time of window centre for each frame.
    """
    x = np.asarray(x, dtype=float)
    if x.ndim == 1:
        x = x[:, np.newaxis]
    nx, nc = x.shape

    if nw is None:
        nw = inc

    nw_arr = np.atleast_1d(np.asarray(nw, dtype=float))
    if len(nw_arr) > 1:
        win = nw_arr.ravel()
        nwin = len(win)
    else:
        nwin = int(nw_arr[0])
        if 'h' in m:
            win = v_windows(3, nwin).ravel()
        else:
            win = np.ones(nwin)

    if nlag is None:
        nlag = nwin

    nwl = nwin + nlag - 1
    nt = int(2 ** np.ceil(np.log2(nwl)))
    nf_frames = int(np.floor((nx - nwl + inc) / inc))

    if nf_frames <= 0:
        return np.zeros((nlag, nc, 0)), np.array([])

    wincg = np.dot(np.arange(1, nwin + 1), win ** 2) / np.dot(win, win)
    fwin = np.conj(np.fft.fft(win, nt))

    y = np.zeros((nlag, nc, nf_frames))

    for iframe in range(nf_frames):
        for ic in range(nc):
            start = iframe * inc
            x_seg = x[start:start + nwl, ic]
            if len(x_seg) < nwl:
                x_seg = np.concatenate([x_seg, np.zeros(nwl - len(x_seg))])

            x_win = x_seg[:nwin] * win
            X_win = np.fft.fft(x_win, nt)
            X_full = np.fft.fft(x_seg, nt)
            v = np.fft.ifft(np.conj(X_win) * X_full)

            x_sq = x_seg ** 2
            X_sq = np.fft.fft(x_sq, nt)
            w_energy = np.real(np.fft.ifft(fwin * X_sq))
            w_energy = np.maximum(w_energy[:nlag], 0)
            w0 = w_energy[0]
            norm = np.sqrt(w_energy * w0)
            norm[norm == 0] = 1.0

            if np.isreal(x).all():
                y[:, ic, iframe] = np.real(v[:nlag]) / norm
            else:
                y[:, ic, iframe] = np.real(np.conj(v[:nlag])) / norm

    ty = np.arange(nf_frames) * inc + wincg
    return y, ty

v_ewgrpdel ¶

V_EWGRPDEL - Energy-weighted group delay waveform.

v_ewgrpdel ¶

v_ewgrpdel(x, w=None, m=None) -> tuple[ndarray, int]

Calculate energy-weighted group delay waveform.

Parameters:

Name	Type	Description	Default
`x`	`array_like`	Input signal.	required
`w`	`int or array_like`	Window or window length (default: Hamming of length(x)).	`None`
`m`	`int`	Center sample of window (1-based, default: (1+len(w))/2).	`None`

Returns:

Name	Type	Description
`y`	`ndarray`	Energy-weighted group delay waveform.
`mm`	`int`	Actual value of m used.

Source code in pyvoicebox/v_ewgrpdel.py

def v_ewgrpdel(x, w=None, m=None) -> tuple[np.ndarray, int]:
    """Calculate energy-weighted group delay waveform.

    Parameters
    ----------
    x : array_like
        Input signal.
    w : int or array_like, optional
        Window or window length (default: Hamming of length(x)).
    m : int, optional
        Center sample of window (1-based, default: (1+len(w))/2).

    Returns
    -------
    y : ndarray
        Energy-weighted group delay waveform.
    mm : int
        Actual value of m used.
    """
    x = np.asarray(x, dtype=float).ravel()

    if w is None:
        w = hamming(len(x), sym=True)
    elif np.isscalar(w):
        w = hamming(int(w), sym=True)
    else:
        w = np.asarray(w, dtype=float).ravel()

    w = w ** 2
    lw = len(w)

    if m is None:
        m = (1 + lw) / 2.0
    m = max(int(round(m)), 1)
    mm = m

    wn = w * (m - np.arange(1, lw + 1))
    x2 = np.concatenate([x, np.zeros(m - 1)]) ** 2
    yn = lfilter(wn, [1.0], x2)
    yd = lfilter(w, [1.0], x2)
    yd[yd < np.finfo(float).eps] = 1.0
    y = yn[m - 1:] / yd[m - 1:]
    return y, mm

Pitch and voicing¶

v_fxpefac ¶

V_FXPEFAC - PEFAC pitch extraction algorithm.

v_fxpefac ¶

v_fxpefac(
    s, fs, tinc=0.01
) -> tuple[ndarray, ndarray, ndarray]

PEFAC pitch extraction algorithm.

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Speech signal.	required
`fs`	`float`	Sample frequency in Hz.	required
`tinc`	`float`	Frame increment in seconds.	`0.01`

Returns:

Name	Type	Description
`fx`	`ndarray`	Estimated pitch frequency per frame (0 = unvoiced).
`tt`	`ndarray`	Time of each frame centre (seconds).
`pv`	`ndarray`	Probability of voicing per frame.

References

[1] Gonzalez & Brookes, PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise, IEEE/ACM TASLP, 2014.

Source code in pyvoicebox/v_fxpefac.py

def v_fxpefac(s, fs, tinc=0.01) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """PEFAC pitch extraction algorithm.

    Parameters
    ----------
    s : array_like
        Speech signal.
    fs : float
        Sample frequency in Hz.
    tinc : float
        Frame increment in seconds.

    Returns
    -------
    fx : ndarray
        Estimated pitch frequency per frame (0 = unvoiced).
    tt : ndarray
        Time of each frame centre (seconds).
    pv : ndarray
        Probability of voicing per frame.

    References
    ----------
    [1] Gonzalez & Brookes, PEFAC - A Pitch Estimation Algorithm
        Robust to High Levels of Noise, IEEE/ACM TASLP, 2014.
    """
    s = np.asarray(s, dtype=float).ravel()

    # Parameters
    fxmin = 60.0
    fxmax = 500.0
    n_fft = int(2 ** np.ceil(np.log2(2 * fs / fxmin)))
    ni = max(1, round(tinc * fs))

    # Window
    win_len = int(round(2 * fs / fxmin))
    if win_len > n_fft:
        win_len = n_fft
    win = np.hamming(win_len)

    # Enframe
    frames, tt, *_ = v_enframe(s, win, ni)
    tt = tt / fs
    nr = frames.shape[0]

    # Compute log power spectrum
    spec = v_rfft(frames, n_fft, 1)
    pw = np.real(spec * np.conj(spec))
    pw = np.maximum(pw, 1e-20)
    log_pw = np.log(pw)

    # Frequency bins
    freq_bins = np.arange(pw.shape[1]) * fs / n_fft
    min_bin = max(1, int(np.ceil(fxmin * n_fft / fs)))
    max_bin = min(pw.shape[1] - 1, int(np.floor(fxmax * n_fft / fs)))

    fx = np.zeros(nr)
    pv = np.zeros(nr)

    for i in range(nr):
        # Compute autocorrelation via IFFT of power spectrum
        full_pw = pw[i, :]
        acf = np.fft.irfft(np.concatenate([full_pw, full_pw[-2:0:-1]]))[:n_fft // 2 + 1]
        acf = acf / (acf[0] + 1e-20)

        # Search for pitch in valid range
        min_lag = max(1, int(np.floor(fs / fxmax)))
        max_lag = min(len(acf) - 1, int(np.ceil(fs / fxmin)))

        if max_lag <= min_lag:
            continue

        acf_search = acf[min_lag:max_lag + 1]
        if len(acf_search) == 0:
            continue

        peak_idx = np.argmax(acf_search)
        peak_val = acf_search[peak_idx]

        if peak_val > 0.3:  # voicing threshold
            lag = peak_idx + min_lag
            # Parabolic interpolation
            if 0 < peak_idx < len(acf_search) - 1:
                a = acf_search[peak_idx - 1]
                b = acf_search[peak_idx]
                c = acf_search[peak_idx + 1]
                delta = 0.5 * (a - c) / (a - 2 * b + c + 1e-20)
                lag = peak_idx + min_lag + delta
            fx[i] = fs / lag
            pv[i] = peak_val

    return fx, tt, pv

v_fxrapt ¶

V_FXRAPT - RAPT pitch extraction algorithm.

v_fxrapt ¶

v_fxrapt(
    s, fs, tinc=0.01
) -> tuple[ndarray, ndarray, ndarray]

RAPT pitch extraction algorithm.

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Speech signal.	required
`fs`	`float`	Sample frequency in Hz.	required
`tinc`	`float`	Frame increment in seconds.	`0.01`

Returns:

Name	Type	Description
`fx`	`ndarray`	Estimated pitch frequency per frame (0 = unvoiced).
`tt`	`ndarray`	Time of each frame centre (seconds).
`pv`	`ndarray`	Probability of voicing per frame.

References

[1] Talkin, D. A robust algorithm for pitch tracking (RAPT). In Speech Coding and Synthesis, ch.14, Elsevier, 1995.

Source code in pyvoicebox/v_fxrapt.py

def v_fxrapt(s, fs, tinc=0.01) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """RAPT pitch extraction algorithm.

    Parameters
    ----------
    s : array_like
        Speech signal.
    fs : float
        Sample frequency in Hz.
    tinc : float
        Frame increment in seconds.

    Returns
    -------
    fx : ndarray
        Estimated pitch frequency per frame (0 = unvoiced).
    tt : ndarray
        Time of each frame centre (seconds).
    pv : ndarray
        Probability of voicing per frame.

    References
    ----------
    [1] Talkin, D. A robust algorithm for pitch tracking (RAPT).
        In Speech Coding and Synthesis, ch.14, Elsevier, 1995.
    """
    s = np.asarray(s, dtype=float).ravel()

    # Parameters
    fxmin = 50.0
    fxmax = 500.0
    ni = max(1, round(tinc * fs))

    # Downsampled processing
    ds_factor = max(1, int(np.floor(fs / (4 * fxmax))))
    fs_ds = fs / ds_factor
    if ds_factor > 1:
        # Simple decimation with anti-aliasing
        from scipy.signal import decimate
        s_ds = decimate(s, ds_factor, zero_phase=True)
    else:
        s_ds = s

    # Window for analysis
    win_len = int(round(2.5 * fs / fxmin))
    n_fft = int(2 ** np.ceil(np.log2(2 * win_len)))
    if win_len > n_fft:
        win_len = n_fft
    win = np.hamming(win_len)

    # Enframe at original rate
    frames, tt, *_ = v_enframe(s, np.hamming(int(round(2 * fs / fxmin))), ni)
    tt = tt / fs
    nr = frames.shape[0]

    fx = np.zeros(nr)
    pv = np.zeros(nr)

    for i in range(nr):
        frame = frames[i, :]
        n_frame = len(frame)
        nfft_frame = int(2 ** np.ceil(np.log2(2 * n_frame)))

        # Compute normalized cross-correlation
        # Using autocorrelation via FFT
        F = np.fft.rfft(frame, nfft_frame)
        acf_full = np.fft.irfft(F * np.conj(F))
        acf = acf_full[:n_frame]
        if acf[0] > 0:
            acf = acf / acf[0]
        else:
            continue

        # Search for pitch
        min_lag = max(1, int(np.floor(fs / fxmax)))
        max_lag = min(n_frame - 1, int(np.ceil(fs / fxmin)))

        if max_lag <= min_lag:
            continue

        acf_search = acf[min_lag:max_lag + 1]
        if len(acf_search) == 0:
            continue

        # Find peaks in the valid range
        peak_idx = np.argmax(acf_search)
        peak_val = acf_search[peak_idx]

        if peak_val > 0.25:  # voicing threshold
            lag = peak_idx + min_lag
            # Parabolic interpolation
            if 0 < peak_idx < len(acf_search) - 1:
                a = acf_search[peak_idx - 1]
                b = acf_search[peak_idx]
                c = acf_search[peak_idx + 1]
                denom = a - 2 * b + c
                if abs(denom) > 1e-10:
                    delta = 0.5 * (a - c) / denom
                    lag = peak_idx + min_lag + delta

            fx[i] = fs / lag
            pv[i] = peak_val

    return fx, tt, pv

v_dypsa ¶

V_DYPSA - Derive glottal closure instances from speech using the DYPSA algorithm.

v_dypsa ¶

v_dypsa(s, fs) -> tuple[ndarray, ndarray]

Derive glottal closure instances from speech.

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Speech signal.	required
`fs`	`float`	Sampling frequency in Hz.	required

Returns:

Name	Type	Description
`gci`	`ndarray`	Vector of glottal closure sample numbers (0-based).
`goi`	`ndarray`	Vector of glottal opening sample numbers (0-based).

References

[1] Naylor et al., IEEE Trans Speech Audio Proc, 15:34-43, 2007.

Source code in pyvoicebox/v_dypsa.py

def v_dypsa(s, fs) -> tuple[np.ndarray, np.ndarray]:
    """Derive glottal closure instances from speech.

    Parameters
    ----------
    s : array_like
        Speech signal.
    fs : float
        Sampling frequency in Hz.

    Returns
    -------
    gci : ndarray
        Vector of glottal closure sample numbers (0-based).
    goi : ndarray
        Vector of glottal opening sample numbers (0-based).

    References
    ----------
    [1] Naylor et al., IEEE Trans Speech Audio Proc, 15:34-43, 2007.
    """
    s = np.asarray(s, dtype=float).ravel()
    ns = len(s)

    # Algorithm parameters
    cpfrac = 0.3
    fxmax = 500
    fxmin = 50
    lpcdur = 0.020
    lpcstep = 0.010
    lpcn = 2
    lpcnf = 0.001
    preemph_freq = 50
    gwlen = 0.003
    fwlen = 0.00045

    # Pre-emphasis
    alpha = np.exp(-2 * np.pi * preemph_freq / fs)
    sp = np.concatenate([[s[0]], s[1:] - alpha * s[:-1]])

    # LPC analysis parameters
    lpc_order = int(round(lpcnf * fs) + lpcn)
    lpc_frame = int(round(lpcdur * fs))
    lpc_step = int(round(lpcstep * fs))

    # Compute LPC residual
    n_frames = max(1, int(np.floor((ns - lpc_frame) / lpc_step)) + 1)
    residual = np.zeros(ns)

    for i in range(n_frames):
        start = i * lpc_step
        end = min(start + lpc_frame, ns)
        seg = sp[start:end]
        if len(seg) < lpc_order + 1:
            continue
        ar, *_ = v_lpcauto(seg, lpc_order)
        if ar.ndim > 1:
            ar = ar[0, :]
        filt_out = v_lpcifilt(ar, sp[start:min(start + lpc_frame + lpc_step, ns)])
        end_res = min(start + len(filt_out), ns)
        residual[start:end_res] = filt_out[:end_res - start]

    # Group delay computation
    gw = int(round(gwlen * fs))
    if gw < 2:
        gw = 2
    fw = int(round(fwlen * fs))
    if fw < 1:
        fw = 1

    # Compute group delay function using the residual
    nfft = int(2 ** np.ceil(np.log2(2 * gw)))
    gdwav = np.zeros(ns)
    half_gw = gw // 2

    for i in range(half_gw, ns - half_gw):
        seg = residual[i - half_gw:i + half_gw]
        if len(seg) < 2 * half_gw:
            continue
        win = np.hamming(len(seg))
        # Weighted average of derivative
        n_idx = np.arange(len(seg)) - half_gw
        gdwav[i] = np.sum(seg * win * n_idx) / (np.sum(seg * win) + 1e-20)

    # Smooth group delay
    if fw > 0:
        kernel = np.ones(fw) / fw
        gdwav = np.convolve(gdwav, kernel, mode='same')

    # Find negative-going zero crossings (candidates for GCI)
    min_period = int(round(fs / fxmax))
    max_period = int(round(fs / fxmin))

    # Find zero crossings of group delay
    zc, _ = v_zerocros(gdwav, 'n')  # negative-going
    if len(zc) == 0:
        return np.array([], dtype=int), np.array([], dtype=int)

    # Filter candidates by minimum spacing
    gci = []
    last = -max_period
    for z in zc:
        if z - last >= min_period:
            gci.append(int(z))
            last = z

    gci = np.array(gci, dtype=int)

    # Estimate glottal openings
    goi = np.round(gci - cpfrac * np.concatenate([[max_period], np.diff(gci)])).astype(int)
    goi = np.clip(goi, 0, ns - 1)

    return gci, goi

v_vadsohn ¶

V_VADSOHN - Voice activity detector (Sohn et al.).

v_vadsohn ¶

v_vadsohn(si, fs, m='a', pp=None) -> ndarray

Voice activity detector based on Sohn et al.

Parameters:

Name	Type	Description	Default
`si`	`array_like`	Input speech signal.	required
`fs`	`float`	Sample frequency in Hz.	required
`m`	`str`	Mode: 'a' activity decision, 'b' likelihood ratio.	`'a'`
`pp`	`dict`	Algorithm parameters.	`None`

Returns:

Name	Type	Description
`vs`	`ndarray`	VAD output (one value per sample if mode 'a', or per frame if 'n'/'t').

Source code in pyvoicebox/v_vadsohn.py

def v_vadsohn(si, fs, m='a', pp=None) -> np.ndarray:
    """Voice activity detector based on Sohn et al.

    Parameters
    ----------
    si : array_like
        Input speech signal.
    fs : float
        Sample frequency in Hz.
    m : str
        Mode: 'a' activity decision, 'b' likelihood ratio.
    pp : dict, optional
        Algorithm parameters.

    Returns
    -------
    vs : ndarray
        VAD output (one value per sample if mode 'a', or per frame if 'n'/'t').
    """
    s = np.asarray(si, dtype=float).ravel()

    qq = {
        'of': 2, 'pr': 0.7, 'ts': 0.1, 'tn': 0.05,
        'ti': 10e-3, 'tj': 10e-3, 'ri': 0,
        'ta': 0.396, 'gx': 1000, 'gz': 1e-4, 'xn': 0, 'ne': 0,
    }
    qp = {}
    if pp is not None:
        for key in qq:
            if key in pp:
                qq[key] = pp[key]
        qp = pp

    qq['tj'] = min(qq['tj'], 0.5 * qq['ts'], 0.5 * qq['tn'])
    nj = max(round(qq['ti'] / qq['tj']), 1)
    if qq['ri']:
        ni = int(2 ** np.ceil(np.log2(qq['ti'] * fs * np.sqrt(0.5) / nj)))
    else:
        ni = round(qq['ti'] * fs / nj)

    tinc = ni / fs
    a_coeff = np.exp(-tinc / qq['ta'])
    gmax = qq['gx']
    kk = np.sqrt(2 * np.pi)
    xn = qq['xn']
    gz = qq['gz']
    a01 = tinc / qq['tn']
    a00 = 1 - a01
    a10 = tinc / qq['ts']
    a11 = 1 - a10
    b11 = a11 / a10
    b01 = a01 / a00
    b00 = a01 - a00 * a11 / a10
    b10 = a11 - a10 * a01 / a00
    prat = a10 / a01
    lprat = np.log(prat)

    no = round(qq['of'])
    nf = ni * no
    w = np.hamming(nf + 1)[:nf]
    w = w / np.sqrt(np.sum(w[::ni] ** 2))

    ns = len(s)
    y, *_ = v_enframe(s, w, ni)
    yf = v_rfft(y, nf, 1)

    if yf.shape[0] == 0:
        return np.array([])

    yp = np.real(yf * np.conj(yf))
    nr, nf2 = yp.shape
    nb = ni * nr

    if qq['ne'] > 0:
        dp, ze = v_estnoiseg(yp, tinc, qp)
    else:
        dp, ze, _ = v_estnoisem(yp, tinc, qp)

    xu = 1.0
    lggami = 0.0

    gam = np.clip(yp / dp, gz, gmax)
    prsp = np.zeros(nr)

    for i in range(nr):
        gami = gam[i, :]
        xi = a_coeff * xu + (1 - a_coeff) * np.maximum(gami - 1, xn)
        xi1 = 1 + xi
        v = 0.5 * xi * gami / xi1
        lamk = 2 * v - np.log(xi1)
        lami = np.sum(lamk[1:nf2]) / (nf2 - 1)

        if lggami < 0:
            lggami = lprat + lami + np.log(b11 + b00 / (a00 + a10 * np.exp(lggami)))
        else:
            lggami = lprat + lami + np.log(b01 + b10 / (a10 + a00 * np.exp(-lggami)))

        if lggami < 0:
            gg = np.exp(lggami)
            prsp[i] = gg / (1 + gg)
        else:
            prsp[i] = 1.0 / (1 + np.exp(-lggami))

        gi = (0.277 + 2 * v) / gami
        mv = v < 0.5
        if np.any(mv):
            vmv = v[mv]
            gi[mv] = kk * np.sqrt(vmv) * ((0.5 + vmv) * besseli(0, vmv) + vmv * besseli(1, vmv)) / (gami[mv] * np.exp(vmv))
        xu = gami * gi ** 2

    if 'a' in m:
        # Output per sample
        vs = np.zeros(ns)
        for i in range(nr):
            start = i * ni
            end = min(start + ni, ns)
            vs[start:end] = float(prsp[i] > qq['pr'])
        return vs
    elif 'b' in m:
        vs = np.zeros(ns)
        for i in range(nr):
            start = i * ni
            end = min(start + ni, ns)
            vs[start:end] = prsp[i]
        return vs
    else:
        vs = np.zeros(ns)
        for i in range(nr):
            start = i * ni
            end = min(start + ni, ns)
            vs[start:end] = float(prsp[i] > qq['pr'])
        return vs

Level, noise and alignment¶

v_activlev ¶

V_ACTIVLEV - Measure active speech level as per ITU-T P.56.

v_activlev ¶

v_activlev(sp, fs, mode='') -> tuple[ndarray, ndarray]

Measure active speech level as in ITU-T P.56 (Method B).

Parameters:

Name	Type	Description	Default
`sp`	`array_like`	Speech signal.	required
`fs`	`float`	Sample frequency in Hz.	required
`mode`	`str`	Mode string: 'd' - output in dB 'n' - normalize speech to 0 dB active level '0' - omit high pass filter 'h' - omit low pass filter 'l' - output long-term power level too	`''`

Returns:

Name	Type	Description
`lev`	`float or ndarray`	Active speech level.
`af`	`float`	Activity factor.

Source code in pyvoicebox/v_activlev.py

def v_activlev(sp, fs, mode='') -> tuple[np.ndarray, np.ndarray]:
    """Measure active speech level as in ITU-T P.56 (Method B).

    Parameters
    ----------
    sp : array_like
        Speech signal.
    fs : float
        Sample frequency in Hz.
    mode : str, optional
        Mode string:
            'd' - output in dB
            'n' - normalize speech to 0 dB active level
            '0' - omit high pass filter
            'h' - omit low pass filter
            'l' - output long-term power level too

    Returns
    -------
    lev : float or ndarray
        Active speech level.
    af : float
        Activity factor.
    """
    nbin = 20
    thresh = 15.9

    sp = np.asarray(sp, dtype=float).ravel()
    ns = len(sp)

    if ns == 0:
        if 'd' in mode:
            return -np.inf, 0.0
        return 0.0, 0.0

    # Simplified implementation: compute active level without bandpass filtering
    # (use '0h' mode equivalent for simplicity)
    ti = 1.0 / fs
    g = np.exp(-ti / 0.03)
    ae = np.array([1, -2 * g, g ** 2]) / (1 - g) ** 2
    nh = int(np.ceil(0.2 / ti)) + 1

    # Zero-pad
    nz = int(np.ceil(0.35 * fs))
    sq = np.concatenate([sp, np.zeros(nz)])
    ns_total = len(sq)

    # Envelope filter
    s = lfilter([1], ae, np.abs(sq))

    # Sum of squares
    ssq = np.sum(sq ** 2)
    if ssq == 0:
        af = 0.0
        if 'd' in mode:
            lev = -np.inf
        else:
            lev = 0.0
        return lev, af

    sf = ns_total / ssq
    sfdb = 10 * np.log10(sf)

    # Energy histogram using log2
    qf, qe = np.frexp(sf * s ** 2)
    qe = qe.astype(float)
    qe[qf == 0] = -np.inf

    # Apply hangover
    qe_hang = v_maxfilt(qe, 1, nh)[0]

    emax = np.max(qe_hang) + 1
    if emax == -np.inf:
        if 'd' in mode:
            return -np.inf, 0.0
        return 0.0, 0.0

    qe_idx = np.minimum(emax - qe_hang, nbin).astype(int)
    qe_idx = np.clip(qe_idx, 1, nbin)

    kc = np.cumsum(np.bincount(qe_idx, minlength=nbin + 1)[1:nbin + 1])

    # Calculate active level
    aj = 10 * np.log10(ssq / np.maximum(kc, 1))
    cj = 10 * np.log10(2) * (emax - np.arange(1, nbin + 1) - 1) - sfdb
    mj = aj - cj - thresh

    # Find positive transition through threshold
    transitions = np.where((mj[:-1] < 0) & (mj[1:] >= 0))[0]
    if len(transitions) == 0:
        if mj[-1] <= 0:
            jj = len(mj) - 2
            jf = 1.0
        else:
            jj = 0
            jf = 0.0
    else:
        jj = transitions[0]
        denom = mj[jj + 1] - mj[jj]
        if abs(denom) < 1e-30:
            jf = 0.5
        else:
            jf = 1.0 / (1.0 - mj[jj + 1] / mj[jj])

    lev_db = aj[jj] + jf * (aj[min(jj + 1, nbin - 1)] - aj[jj])
    lp = 10 ** (lev_db / 10)
    af = ssq / ((ns_total - nz) * lp) if lp > 0 else 0.0

    if 'd' in mode:
        lev = lev_db
        if 'l' in mode:
            lev = np.array([lev_db, 10 * np.log10(ssq / ns_total)])
    else:
        lev = lp
        if 'l' in mode:
            lev = np.array([lp, ssq / ns_total])

    if 'n' in mode:
        if lp > 0:
            return sp / np.sqrt(lp), lev, af
        return sp, lev, af

    return lev, af

v_activlevg ¶

V_ACTIVLEVG - Measure active speech level robustly.

v_activlevg ¶

v_activlevg(sp, fs, mode='') -> tuple[ndarray, ndarray]

Measure active speech level robustly.

This is a simplified wrapper around v_activlev. The full MATLAB implementation uses a Gaussian mixture model approach for robust estimation in noisy conditions.

Parameters:

Name	Type	Description	Default
`sp`	`array_like`	Speech signal.	required
`fs`	`float`	Sample frequency in Hz.	required
`mode`	`str`	Mode string (same as v_activlev).	`''`

Returns:

Name	Type	Description
`lev`	`float`	Active speech level.
`af`	`float`	Activity factor.

Source code in pyvoicebox/v_activlevg.py

def v_activlevg(sp, fs, mode='') -> tuple[np.ndarray, np.ndarray]:
    """Measure active speech level robustly.

    This is a simplified wrapper around v_activlev. The full MATLAB
    implementation uses a Gaussian mixture model approach for robust
    estimation in noisy conditions.

    Parameters
    ----------
    sp : array_like
        Speech signal.
    fs : float
        Sample frequency in Hz.
    mode : str, optional
        Mode string (same as v_activlev).

    Returns
    -------
    lev : float
        Active speech level.
    af : float
        Activity factor.
    """
    return v_activlev(sp, fs, mode)

v_earnoise ¶

V_EARNOISE - Add noise to simulate hearing threshold.

v_earnoise ¶

v_earnoise(
    s, fs, m="", spl=62.35
) -> tuple[ndarray, ndarray, ndarray]

Add noise to simulate the hearing threshold of a listener.

Simplified version that adds white noise scaled to simulate internal ear noise.

Parameters:

Name	Type	Description	Default
`s`	`(array_like, shape(n) or (n, c))`	Speech signal.	required
`fs`	`float`	Sample frequency in Hz.	required
`m`	`str`	Mode string: 'n' input normalized, 'u' input already scaled.	`''`
`spl`	`float`	Target active speech level in dB SPL. Default 62.35.	`62.35`

Returns:

Name	Type	Description
`y`	`ndarray`	Speech with simulated ear noise.
`x`	`ndarray`	Filtered speech signal.
`v`	`ndarray`	Added noise.

Source code in pyvoicebox/v_earnoise.py

def v_earnoise(s, fs, m='', spl=62.35) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Add noise to simulate the hearing threshold of a listener.

    Simplified version that adds white noise scaled to simulate internal ear noise.

    Parameters
    ----------
    s : array_like, shape (n,) or (n, c)
        Speech signal.
    fs : float
        Sample frequency in Hz.
    m : str, optional
        Mode string: 'n' input normalized, 'u' input already scaled.
    spl : float, optional
        Target active speech level in dB SPL. Default 62.35.

    Returns
    -------
    y : ndarray
        Speech with simulated ear noise.
    x : ndarray
        Filtered speech signal.
    v : ndarray
        Added noise.
    """
    s = np.asarray(s, dtype=float)
    if s.ndim == 1:
        s = s[:, np.newaxis]

    ns, nc = s.shape

    if 'u' in m:
        dboff = 0.0
    elif 'n' in m:
        dboff = spl
    else:
        dboff = spl  # simplified: assume 0 dB input

    x = 10 ** (0.05 * dboff) * s
    v = np.sqrt(0.5 * fs) * np.random.randn(*s.shape)
    y = x + v

    if nc == 1:
        return y.ravel(), x.ravel(), v.ravel()
    return y, x, v

v_ppmvu ¶

V_PPMVU - Calculate PPM and VU meter readings (stub).

The full implementation requires specific filter design. A simplified stub is provided.

v_ppmvu ¶

v_ppmvu(sp, fs, mode='') -> float

Calculate PPM and VU meter readings.

Simplified implementation. The full MATLAB version implements detailed PPM and VU metering standards.

Parameters:

Name	Type	Description	Default
`sp`	`array_like`	Input signal.	required
`fs`	`float`	Sample frequency in Hz.	required
`mode`	`str`	Mode string.	`''`

Returns:

Name	Type	Description
`lev`	`float`	Level reading.

Source code in pyvoicebox/v_ppmvu.py

def v_ppmvu(sp, fs, mode='') -> float:
    """Calculate PPM and VU meter readings.

    Simplified implementation. The full MATLAB version implements
    detailed PPM and VU metering standards.

    Parameters
    ----------
    sp : array_like
        Input signal.
    fs : float
        Sample frequency in Hz.
    mode : str, optional
        Mode string.

    Returns
    -------
    lev : float
        Level reading.
    """
    sp = np.asarray(sp, dtype=float).ravel()
    if len(sp) == 0:
        return 0.0

    # Simple RMS level
    rms = np.sqrt(np.mean(sp ** 2))
    if rms > 0:
        return 20 * np.log10(rms)
    return -np.inf

v_snrseg ¶

V_SNRSEG - Measure segmental and global SNR.

v_snrseg ¶

v_snrseg(
    s, r, fs, m="wz", tf=0.01
) -> tuple[float, float, ndarray, ndarray, ndarray]

Measure segmental and global SNR.

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Test signal (noisy).	required
`r`	`array_like`	Reference signal (clean).	required
`fs`	`float`	Sample frequency in Hz.	required
`m`	`str`	Mode string: 'w' : No VAD - use whole file (default). 'z' : Do not do any alignment (default). 'q' : Use linear interpolation to remove delays ± 1 sample.	`'wz'`
`tf`	`float`	Frame increment in seconds. Default: 0.01.	`0.01`

Returns:

Name	Type	Description
`seg`	`float`	Segmental SNR in dB.
`glo`	`float`	Global SNR in dB.
`tc`	`ndarray`	Time at centre of each frame (seconds).
`snf`	`ndarray`	Segmental SNR in dB in each frame.
`vf`	`ndarray`	Boolean mask indicating valid frames.

Source code in pyvoicebox/v_snrseg.py

def v_snrseg(s, r, fs, m='wz', tf=0.01) -> tuple[float, float, np.ndarray, np.ndarray, np.ndarray]:
    """Measure segmental and global SNR.

    Parameters
    ----------
    s : array_like
        Test signal (noisy).
    r : array_like
        Reference signal (clean).
    fs : float
        Sample frequency in Hz.
    m : str, optional
        Mode string:
          'w' : No VAD - use whole file (default).
          'z' : Do not do any alignment (default).
          'q' : Use linear interpolation to remove delays +/- 1 sample.
    tf : float, optional
        Frame increment in seconds. Default: 0.01.

    Returns
    -------
    seg : float
        Segmental SNR in dB.
    glo : float
        Global SNR in dB.
    tc : ndarray
        Time at centre of each frame (seconds).
    snf : ndarray
        Segmental SNR in dB in each frame.
    vf : ndarray
        Boolean mask indicating valid frames.
    """
    s = np.asarray(s, dtype=float).ravel()
    r = np.asarray(r, dtype=float).ravel()
    snmax = 100.0  # clipping limit for SNR

    mq = 'z' not in m  # flag for alignment
    nr = min(len(r), len(s))
    kf = round(tf * fs)  # frame length in samples
    ifr_start = kf + mq  # starting sample for first frame end (1-based in MATLAB)

    # Build frame end indices (0-based)
    ifr = np.arange(ifr_start, nr - mq + 1, kf) - 1  # 0-based ending sample
    nf = len(ifr)
    if nf == 0:
        return 0.0, 0.0, np.array([]), np.array([]), np.array([])

    ifl = ifr[-1] + 1  # last sample index + 1 (exclusive)

    if mq:
        # Linear interpolation alignment
        # For each frame, find optimal shift
        # s[1:ifl-1] vs s[2:ifl], s[0:ifl-2] (0-based)
        ssm = np.zeros((kf, nf))
        ssp = np.zeros((kf, nf))
        sr_mat = np.zeros((kf, nf))
        for j in range(nf):
            start = j * kf + 1  # 1-based converted: frame starts at mq + j*kf (0-based)
            for k in range(kf):
                idx = start + k  # 0-based index into s[1:ifl]
                ssm[k, j] = s[idx] - s[idx + 1]
                ssp[k, j] = s[idx] - s[idx - 1]
                sr_mat[k, j] = s[idx] - r[idx]

        am = np.clip(np.sum(sr_mat * ssm, axis=0) / np.sum(ssm**2, axis=0), 0, 1)
        ap = np.clip(np.sum(sr_mat * ssp, axis=0) / np.sum(ssp**2, axis=0), 0, 1)
        ef = np.minimum(
            np.sum((sr_mat - am[np.newaxis, :] * ssm)**2, axis=0),
            np.sum((sr_mat - ap[np.newaxis, :] * ssp)**2, axis=0)
        )
    else:
        # No interpolation
        ef = np.zeros(nf)
        for j in range(nf):
            start = j * kf
            frame_s = s[start:start + kf]
            frame_r = r[start:start + kf]
            ef[j] = np.sum((frame_s - frame_r)**2)

    # Calculate reference power per frame
    rf = np.zeros(nf)
    for j in range(nf):
        start = mq + j * kf
        rf[j] = np.sum(r[start:start + kf]**2)

    em = ef == 0  # zero noise frames
    rm = rf == 0  # zero reference frames
    snf = 10.0 * np.log10((rf + rm) / (ef + em))
    snf[rm] = -snmax
    snf[em] = snmax

    # Select frames to include
    if 'w' in m:
        vf = np.ones(nf, dtype=bool)
    else:
        vf = np.ones(nf, dtype=bool)  # default: use all

    tc = (np.arange(1, nf + 1) * kf + (1 - kf) / 2.0) / fs

    seg = np.mean(snf[vf]) if np.any(vf) else 0.0
    glo = 10.0 * np.log10(np.sum(rf[vf]) / np.sum(ef[vf])) if np.any(vf) and np.sum(ef[vf]) > 0 else 0.0

    return seg, glo, tc, snf, vf

v_addnoise ¶

V_ADDNOISE - Add noise at a chosen SNR.

v_addnoise ¶

v_addnoise(s, fs, snr=inf, m='') -> tuple[ndarray, ndarray]

Add white noise at a chosen SNR using energy-based measurement.

This is a simplified version that supports white noise addition with energy-based level measurement. For more advanced noise types, see the MATLAB original.

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Input speech signal (1-D).	required
`fs`	`float`	Sample frequency in Hz.	required
`snr`	`float`	Target SNR in dB. Default: Inf (no noise).	`inf`
`m`	`str`	Mode string: 'D' : SNR input given as power ratio instead of dB. 'e' : Use energy to calculate signal level (default). 'k' : Preserve original signal power. 'n' : Make signal level = 0 dB. 'N' : Make noise level = 0 dB. 't' : Make total = 0 dB (default). 'x' : Output signal and noise as separate columns.	`''`

Returns:

Name	Type	Description
`z`	`ndarray`	Noisy signal (or [signal, noise] if 'x' option).
`p`	`ndarray`	Levels: [s-in n-in s-out n-out] as power ratios or dB.

Source code in pyvoicebox/v_addnoise.py

def v_addnoise(s, fs, snr=np.inf, m='') -> tuple[np.ndarray, np.ndarray]:
    """Add white noise at a chosen SNR using energy-based measurement.

    This is a simplified version that supports white noise addition with
    energy-based level measurement. For more advanced noise types, see the
    MATLAB original.

    Parameters
    ----------
    s : array_like
        Input speech signal (1-D).
    fs : float
        Sample frequency in Hz.
    snr : float, optional
        Target SNR in dB. Default: Inf (no noise).
    m : str, optional
        Mode string:
          'D' : SNR input given as power ratio instead of dB.
          'e' : Use energy to calculate signal level (default).
          'k' : Preserve original signal power.
          'n' : Make signal level = 0 dB.
          'N' : Make noise level = 0 dB.
          't' : Make total = 0 dB (default).
          'x' : Output signal and noise as separate columns.

    Returns
    -------
    z : ndarray
        Noisy signal (or [signal, noise] if 'x' option).
    p : ndarray
        Levels: [s-in n-in s-out n-out] as power ratios or dB.
    """
    s = np.asarray(s, dtype=float).ravel()
    ns = len(s)

    # Calculate signal energy
    se = np.mean(s**2)

    # Generate white noise
    n = np.random.randn(ns)
    ne = np.mean(n**2)

    # Convert SNR
    if 'D' in m:
        snre = snr
    else:
        snre = 10.0 ** (0.1 * snr)

    # Determine scaling factors
    if snre > 1:
        if 'n' in m:
            sze = 1.0
        elif 'N' in m:
            sze = snre
        elif 'k' in m:
            sze = se
        else:
            sze = 1.0 / (1.0 + 1.0 / snre)
        nze = sze / snre
    else:
        if 'n' in m:
            nze = 1.0 / snre
        elif 'N' in m:
            nze = 1.0
        elif 'k' in m:
            nze = se / snre
        else:
            nze = 1.0 / (1.0 + snre)
        sze = nze * snre

    pe = np.array([se, ne, sze, nze])
    gm = np.array([
        np.sqrt(sze / (se + (se == 0))),
        np.sqrt(nze / (ne + (ne == 0)))
    ])

    if gm[1] > 0:
        if 'x' in m:
            z = np.column_stack([gm[0] * s, gm[1] * n])
        else:
            z = gm[0] * s + gm[1] * n
    elif 'x' in m:
        z = np.column_stack([gm[0] * s, np.zeros(ns)])
    else:
        z = gm[0] * s

    p = pe.copy()
    if 'D' not in m:
        mk = (pe != np.inf) & (pe != 0)
        p[mk] = 10.0 * np.log10(pe[mk])
        p[pe == 0] = -np.inf

    return z, p

v_sigalign ¶

V_SIGALIGN - Align a clean reference with a noisy signal.

v_sigalign ¶

v_sigalign(
    s, r, maxd=None, m="gs", fs=None
) -> tuple[int, float, ndarray, ndarray]

Align a clean reference with a noisy signal.

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Test signal.	required
`r`	`array_like`	Reference signal.	required
`maxd`	`float or array_like`	[+-max] or [min, max] delay allowed in samples. Fractions of len® are used if abs(maxd) < 1. Default ensures at least 50% overlap.	`None`
`m`	`str`	Mode string: 'u' - unity gain 'g' - find optimal gain (default) 's' - maximize correlation coefficient (default) 'S' - maximize energy of common component	`'gs'`
`fs`	`float`	Sample frequency (only used for filtering).	`None`

Returns:

Name	Type	Description
`d`	`int`	Optimum delay to apply to r.
`g`	`float`	Optimal gain to apply to r.
`rr`	`ndarray`	g * r(shifted by -d), zero-padded to match s if ss not returned.
`ss`	`ndarray`	s truncated to match rr.

Source code in pyvoicebox/v_sigalign.py

def v_sigalign(s, r, maxd=None, m='gs', fs=None) -> tuple[int, float, np.ndarray, np.ndarray]:
    """Align a clean reference with a noisy signal.

    Parameters
    ----------
    s : array_like
        Test signal.
    r : array_like
        Reference signal.
    maxd : float or array_like, optional
        [+-max] or [min, max] delay allowed in samples. Fractions of len(r)
        are used if abs(maxd) < 1. Default ensures at least 50% overlap.
    m : str, optional
        Mode string:
          'u' - unity gain
          'g' - find optimal gain (default)
          's' - maximize correlation coefficient (default)
          'S' - maximize energy of common component
    fs : float, optional
        Sample frequency (only used for filtering).

    Returns
    -------
    d : int
        Optimum delay to apply to r.
    g : float
        Optimal gain to apply to r.
    rr : ndarray
        g * r(shifted by -d), zero-padded to match s if ss not returned.
    ss : ndarray
        s truncated to match rr.
    """
    s = np.asarray(s, dtype=float).ravel()
    r = np.asarray(r, dtype=float).ravel()
    ns = len(s)
    nr = len(r)

    if maxd is None:
        maxd_arr = np.array([])
    else:
        maxd_arr = np.atleast_1d(np.asarray(maxd, dtype=float))

    if len(maxd_arr) == 0:
        if nr < ns:
            lmm = np.array([-0.25 * nr, ns - 0.75 * nr])
        else:
            lmm = np.array([-0.25 * ns, nr - 0.75 * ns])
    elif len(maxd_arr) == 1:
        lmm = np.array([-maxd_arr[0], maxd_arr[0]])
    else:
        lmm = maxd_arr[:2].copy()

    # Convert fractions of nr to samples
    lmm = np.round(lmm * (1 + (nr - 1) * (np.abs(lmm) < 1))).astype(int)
    lmin = int(lmm[0])
    lmax = int(lmm[1])
    lags = lmax - lmin + 1

    if lags <= 0:
        raise ValueError('Invalid lag limits')

    # Note: A-weighting and BS-468 weighting (m='a', m='b') require
    # v_stdspectrum which is not yet converted. Skip filtering.

    # Cross correlation
    rxi = max(1, 1 - lmin)       # first reference sample needed (1-based)
    rxj = min(nr, ns - lmax)     # last reference sample needed (1-based)
    nrx = rxj - rxi + 1         # length of reference segment

    if nrx < 1:
        raise ValueError('Reference signal too short')

    fl = int(2 ** np.ceil(np.log2(lmax - lmin + nrx)))
    sxi = max(1, rxi + lmin)    # first signal sample needed (1-based)
    sxj = min(ns, rxj + lmax)   # last signal sample needed (1-based)

    # Zero-padded FFT cross-correlation
    s_seg = np.zeros(fl)
    s_seg[:sxj - sxi + 1] = s[sxi - 1:sxj]
    r_seg = np.zeros(fl)
    r_seg[:rxj - rxi + 1] = r[rxi - 1:rxj]

    S = np.fft.fft(s_seg)
    R = np.fft.fft(r_seg)
    rs_full = np.real(np.fft.ifft(S * np.conj(R)))
    rsu = rs_full[:lags]

    ssq = np.cumsum(s[sxi - 1:sxj] ** 2)
    ssqd = np.zeros(lags)
    ssqd[0] = ssq[nrx - 1]
    if lags > 1:
        ssqd[1:] = ssq[nrx:nrx + lags - 1] - ssq[:lags - 1]

    if 'S' in m:
        icx = np.argmax(np.abs(rsu))
    else:
        # Maximize correlation coefficient
        with np.errstate(divide='ignore', invalid='ignore'):
            corr_coeff = rsu ** 2 / ssqd
            corr_coeff[ssqd == 0] = 0
        icx = np.argmax(corr_coeff)

    d = icx + lmin

    # Extract common region
    ia = max(1, d + 1)       # first sample of s in common region (1-based)
    ja = min(ns, d + nr)     # last sample of s in common region (1-based)
    ija = np.arange(ia, ja + 1)  # 1-based
    ijad = ija - d

    rr = r[ijad - 1].copy()
    ss = s[ija - 1].copy()

    if 'u' in m:
        g = 1.0
    else:
        g = np.sum(rr * ss) / np.sum(rr ** 2)

    rr = rr * g

    return d, g, rr, ss

v_txalign ¶

V_TXALIGN - Find best alignment of two sets of time markers.

v_txalign ¶

v_txalign(
    x, y, maxt, nsd=None
) -> tuple[ndarray, ndarray, int, float, float]

Find the best alignment of two sets of time markers.

Parameters:

Name	Type	Description	Default
`x`	`array_like`	First set of non-decreasing time values.	required
`y`	`array_like`	Second set of non-decreasing time values.	required
`maxt`	`float`	Penalty threshold.	required
`nsd`	`float`	If specified, threshold is nsd times std dev from mean.	`None`

Returns:

Name	Type	Description
`kx`	`ndarray`	Alignment from x to y (kx[i]=j means x[i] matched to y[j], 0=unmatched).
`ky`	`ndarray`	Alignment from y to x.
`nxy`	`int`	Number of matched pairs.
`mxy`	`float`	Mean of y-x for matched pairs.
`sxy`	`float`	Std dev of y-x for matched pairs.

Source code in pyvoicebox/v_txalign.py

def v_txalign(x, y, maxt, nsd=None) -> tuple[np.ndarray, np.ndarray, int, float, float]:
    """Find the best alignment of two sets of time markers.

    Parameters
    ----------
    x : array_like
        First set of non-decreasing time values.
    y : array_like
        Second set of non-decreasing time values.
    maxt : float
        Penalty threshold.
    nsd : float, optional
        If specified, threshold is nsd times std dev from mean.

    Returns
    -------
    kx : ndarray
        Alignment from x to y (kx[i]=j means x[i] matched to y[j], 0=unmatched).
    ky : ndarray
        Alignment from y to x.
    nxy : int
        Number of matched pairs.
    mxy : float
        Mean of y-x for matched pairs.
    sxy : float
        Std dev of y-x for matched pairs.
    """
    x = np.asarray(x, dtype=float).ravel().copy()
    y = np.asarray(y, dtype=float).ravel().copy()
    lx = len(x)
    ly = len(y)

    if lx == 0 or ly == 0:
        return np.zeros(lx, dtype=int), np.zeros(ly, dtype=int), 0, 0.0, 0.0

    if nsd is not None:
        kx, ky, nxy, mxy, sxy = v_txalign(x, y, maxt)
        nxy0 = nxy + 1
        while nxy < nxy0:
            nxy0 = nxy
            mxy0 = mxy
            kx, ky, nxy, mxy, sxy = v_txalign(x + mxy0, y, nsd * sxy)
        mxy = mxy + mxy0
        return kx, ky, nxy, mxy, sxy

    # Add sentinel values
    x_ext = np.append(x, 2 * abs(y[-1]) + 1)
    y_ext = np.append(y, 2 * abs(x[-1]) + 1)
    lxy = lx + ly + 1

    d = np.zeros((lxy, 4))
    c0 = maxt ** 2
    ix = 0  # 0-based index into x_ext
    iy = 0  # 0-based index into y_ext
    d[0, :] = [0, 0, -1, 1 if y_ext[0] < x_ext[0] else 0]
    vp = 0.0

    for id_ in range(1, lxy):
        if y_ext[iy] < x_ext[ix]:
            v = y_ext[iy]
            d[id_, 3] = 1
            iy += 1
        else:
            v = x_ext[ix]
            d[id_, 3] = 0
            ix += 1

        if d[id_, 3] == d[id_ - 1, 3]:
            d[id_, 2] = -1
        else:
            d[id_, 1] = d[id_ - 1, 0] - c0 + (v - vp) ** 2

        if d[id_ - 1, 2] == 0 and d[id_ - 1, 0] >= d[id_ - 1, 1]:
            d[id_, 0] = d[id_ - 1, 1]
            d[id_ - 1, 2] = 1
        else:
            d[id_, 0] = d[id_ - 1, 0]

        vp = v

    if d[lxy - 1, 2] == 0 and d[lxy - 1, 0] >= d[lxy - 1, 1]:
        d[lxy - 1, 2] = 1

    # Traceback
    ix = lx - 1
    iy = ly - 1
    nxy = 0
    mxy = 0.0
    sxy = 0.0
    kx = np.zeros(lx, dtype=int)
    ky = np.zeros(ly, dtype=int)

    while ix >= 0 and iy >= 0:
        id_ = ix + iy + 2  # 0-based in d array (was ix+iy+1 in 1-based)
        if d[id_, 2] > 0:
            ky[iy] = ix + 1  # 1-based output
            kx[ix] = iy + 1
            dt = y[iy] - x[ix]
            nxy += 1
            mxy += dt
            sxy += dt ** 2
            ix -= 1
            iy -= 1
        else:
            if d[id_, 3] == 1:
                iy -= 1
            else:
                ix -= 1

    if nxy > 0:
        mxy = mxy / nxy
        sxy = np.sqrt(sxy / nxy - mxy ** 2)

    return kx, ky, nxy, mxy, sxy

Psychoacoustics¶

v_importsii ¶

V_IMPORTSII - Calculate the SII importance function.

v_importsii ¶

v_importsii(f, m='') -> ndarray

Calculate the SII importance function per Hz or per Bark.

Parameters:

Name	Type	Description	Default
`f`	`array_like`	Frequencies in Hz (or Bark if 'b' flag).	required
`m`	`str`	Mode string: 'b' : Frequencies given in Bark rather than Hz. 'c' : Calculate cumulative importance. 'd' : Calculate importance of n-1 bands. 'h' : Calculate importance per Hz or per Bark.	`''`

Returns:

Name	Type	Description
`q`	`ndarray`	Importance values.

References

[1] ANSI Standard S3.5-1997 (R2007). [2] C. V. Pavlovic. JASA, 82:413-422, 1987.

Source code in pyvoicebox/v_importsii.py

def v_importsii(f, m='') -> np.ndarray:
    """Calculate the SII importance function per Hz or per Bark.

    Parameters
    ----------
    f : array_like
        Frequencies in Hz (or Bark if 'b' flag).
    m : str, optional
        Mode string:
          'b' : Frequencies given in Bark rather than Hz.
          'c' : Calculate cumulative importance.
          'd' : Calculate importance of n-1 bands.
          'h' : Calculate importance per Hz or per Bark.

    Returns
    -------
    q : ndarray
        Importance values.

    References
    ----------
    [1] ANSI Standard S3.5-1997 (R2007).
    [2] C. V. Pavlovic. JASA, 82:413-422, 1987.
    """
    f = np.asarray(f, dtype=float)

    if 'b' in m:
        b = f.copy()
        d_bark = np.ones_like(f)  # not needed for bark input
    else:
        b, d_bark = v_frq2bark(f)

    if 'c' in m or 'd' in m:
        q = _mi * b + _ci + _ai * (b < 4) * (b - 4)**2 - _bi * (b > 18) * (b - 18)**2
        q[b < _xi0] = 0.0
        q[b > _xi1] = 1.0
        if 'd' in m:
            q = q[1:] - q[:-1]
    else:
        q = _mi + _ai * (b < 4) * (b - 4) - _bi * (b > 18) * (b - 18)
        q[b < _xi0] = 0.0
        q[b > _xi1] = 0.0
        if 'b' not in m:
            q = q / d_bark

    return q

v_phon2sone ¶

V_PHON2SONE - Convert PHON loudness values to SONEs.

v_phon2sone ¶

v_phon2sone(p) -> ndarray

Convert PHON loudness values to SONEs.

Parameters:

Name	Type	Description	Default
`p`	`array_like`	Matrix of phon values.	required

Returns:

Name	Type	Description
`s`	`ndarray`	Matrix of sone values, same shape as p.

Notes

The phon scale measures perceived loudness in dB; at 1 kHz it is identical to dB SPL relative to 20e-6 Pa sound pressure. The sone scale is proportional to apparent loudness and, by definition, equals 1 at 40 phon.

References

[1] J. Lochner and J. Burger. Form of the loudness function in the presence of masking noise. JASA, 33: 1705, 1961. [2] ISO/TC43. Acoustics Normal equal-loudness-level contours. Standard ISO 226:2003, Aug. 2003.

Source code in pyvoicebox/v_phon2sone.py

def v_phon2sone(p) -> np.ndarray:
    """Convert PHON loudness values to SONEs.

    Parameters
    ----------
    p : array_like
        Matrix of phon values.

    Returns
    -------
    s : ndarray
        Matrix of sone values, same shape as p.

    Notes
    -----
    The phon scale measures perceived loudness in dB; at 1 kHz it is identical
    to dB SPL relative to 20e-6 Pa sound pressure. The sone scale is proportional
    to apparent loudness and, by definition, equals 1 at 40 phon.

    References
    ----------
    [1] J. Lochner and J. Burger. Form of the loudness function in the presence
        of masking noise. JASA, 33: 1705, 1961.
    [2] ISO/TC43. Acoustics Normal equal-loudness-level contours.
        Standard ISO 226:2003, Aug. 2003.
    """
    p = np.asarray(p, dtype=float)
    b = np.log(10) * 0.1 * 0.27  # 0.27 is the exponent from [1] and [2]
    d = np.exp(b * 2.4)           # 2.4 dB is the hearing threshold from [2]
    a = 1.0 / (np.exp(b * 40) - d)  # scale factor to make p=40 give s=1
    s = a * (np.exp(b * p) - d)
    return s

v_sone2phon ¶

V_SONE2PHON - Convert SONE loudness values to PHONs.

v_sone2phon ¶

v_sone2phon(s) -> ndarray

Convert SONE loudness values to PHONs.

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Matrix of sone values.	required

Returns:

Name	Type	Description
`p`	`ndarray`	Matrix of phon values, same shape as s.

Notes

The phon scale measures perceived loudness in dB; at 1 kHz it is identical to dB SPL relative to 20e-6 Pa sound pressure. The sone scale is proportional to apparent loudness and, by definition, equals 1 at 40 phon.

References

[1] J. Lochner and J. Burger. Form of the loudness function in the presence of masking noise. JASA, 33: 1705, 1961. [2] ISO/TC43. Acoustics Normal equal-loudness-level contours. Standard ISO 226:2003, Aug. 2003.

Source code in pyvoicebox/v_sone2phon.py

def v_sone2phon(s) -> np.ndarray:
    """Convert SONE loudness values to PHONs.

    Parameters
    ----------
    s : array_like
        Matrix of sone values.

    Returns
    -------
    p : ndarray
        Matrix of phon values, same shape as s.

    Notes
    -----
    The phon scale measures perceived loudness in dB; at 1 kHz it is identical
    to dB SPL relative to 20e-6 Pa sound pressure. The sone scale is proportional
    to apparent loudness and, by definition, equals 1 at 40 phon.

    References
    ----------
    [1] J. Lochner and J. Burger. Form of the loudness function in the presence
        of masking noise. JASA, 33: 1705, 1961.
    [2] ISO/TC43. Acoustics Normal equal-loudness-level contours.
        Standard ISO 226:2003, Aug. 2003.
    """
    s = np.asarray(s, dtype=float)
    b = 1.0 / (np.log(10) * 0.1 * 0.27)  # 0.27 is the exponent from [1] and [2]
    d = np.exp(2.4 / b)                    # 2.4 dB is the hearing threshold from [2]
    a = np.exp(40.0 / b) - d               # scale factor to make p=40 give s=1
    p = b * np.log(a * s + d)
    return p

v_pesq2mos ¶

V_PESQ2MOS - Convert PESQ speech quality scores to MOS.

v_pesq2mos ¶

v_pesq2mos(p) -> ndarray

Convert PESQ speech quality scores to MOS.

Parameters:

Name	Type	Description	Default
`p`	`array_like`	Matrix of PESQ scores.	required

Returns:

Name	Type	Description
`m`	`ndarray`	Matrix of MOS scores, same shape as p.

References

[1] ITU-T Recommendation P.862.1, Nov. 2003.

Source code in pyvoicebox/v_pesq2mos.py

def v_pesq2mos(p) -> np.ndarray:
    """Convert PESQ speech quality scores to MOS.

    Parameters
    ----------
    p : array_like
        Matrix of PESQ scores.

    Returns
    -------
    m : ndarray
        Matrix of MOS scores, same shape as p.

    References
    ----------
    [1] ITU-T Recommendation P.862.1, Nov. 2003.
    """
    p = np.asarray(p, dtype=float)
    a = 0.999
    b = 4.999 - a
    c = -1.4945
    d = 4.6607
    m = a + b / (1.0 + np.exp(c * p + d))
    return m

v_mos2pesq ¶

V_MOS2PESQ - Convert MOS speech quality scores to PESQ.

v_mos2pesq ¶

v_mos2pesq(m) -> ndarray

Convert MOS speech quality scores to PESQ.

Parameters:

Name	Type	Description	Default
`m`	`array_like`	Matrix of MOS scores.	required

Returns:

Name	Type	Description
`p`	`ndarray`	Matrix of PESQ scores, same shape as m.

References

[1] ITU-T Recommendation P.862.1, Nov. 2003.

Source code in pyvoicebox/v_mos2pesq.py

def v_mos2pesq(m) -> np.ndarray:
    """Convert MOS speech quality scores to PESQ.

    Parameters
    ----------
    m : array_like
        Matrix of MOS scores.

    Returns
    -------
    p : ndarray
        Matrix of PESQ scores, same shape as m.

    References
    ----------
    [1] ITU-T Recommendation P.862.1, Nov. 2003.
    """
    m = np.asarray(m, dtype=float)
    a = 0.999
    b = 4.999 - a
    c = -1.4945
    d = 4.6607
    p = (np.log(b / (m - a) - 1.0) - d) / c
    return p

v_stoi2prob ¶

V_STOI2PROB - Convert STOI to probability.

v_stoi2prob ¶

v_stoi2prob(s, m='i') -> ndarray

Convert STOI to probability.

Parameters:

Name	Type	Description	Default
`s`	`array_like`	Matrix containing STOI values.	required
`m`	`str`	Mapping: 'i' for IEEE sentences (default), 'd' for Dantale corpus.	`'i'`

Returns:

Name	Type	Description
`p`	`ndarray`	Corresponding probability values.

References

[1] C. H. Taal et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio, Speech, Language Processing, 19(7):2125-2136, 2011.

Source code in pyvoicebox/v_stoi2prob.py

def v_stoi2prob(s, m='i') -> np.ndarray:
    """Convert STOI to probability.

    Parameters
    ----------
    s : array_like
        Matrix containing STOI values.
    m : str, optional
        Mapping: 'i' for IEEE sentences (default), 'd' for Dantale corpus.

    Returns
    -------
    p : ndarray
        Corresponding probability values.

    References
    ----------
    [1] C. H. Taal et al. An algorithm for intelligibility prediction of
        time-frequency weighted noisy speech. IEEE Trans. Audio, Speech,
        Language Processing, 19(7):2125-2136, 2011.
    """
    s = np.asarray(s, dtype=float)
    if m and m[0] == 'd':
        a = -14.5435
        b = 7.0792
    else:
        a = -17.4906
        b = 9.6921
    p = 1.0 / (1.0 + np.exp(a * s + b))
    return p

v_psycdigit ¶

V_PSYCDIGIT - Psychoacoustic digit recognition test (stub).

v_psycdigit ¶

v_psycdigit(*args, **kwargs) -> None

Run a psychoacoustic digit recognition test.

This is an interactive GUI-based function that has no direct Python equivalent without a GUI framework.

Raises:

Type	Description
`NotImplementedError`	Requires interactive GUI.

Source code in pyvoicebox/v_psycdigit.py

def v_psycdigit(*args, **kwargs) -> None:
    """Run a psychoacoustic digit recognition test.

    This is an interactive GUI-based function that has no direct
    Python equivalent without a GUI framework.

    Raises
    ------
    NotImplementedError
        Requires interactive GUI.
    """
    raise NotImplementedError(
        "v_psycdigit is an interactive MATLAB GUI function. "
        "Consider implementing with tkinter or a web interface."
    )

v_psycest ¶

V_PSYCEST - Psychoacoustic estimation (stub).

v_psycest ¶

v_psycest(*args, **kwargs) -> None

Psychoacoustic parameter estimation.

This is a complex estimation function with many options. A stub is provided.

Raises:

Type	Description
`NotImplementedError`	Full implementation pending.

Source code in pyvoicebox/v_psycest.py

def v_psycest(*args, **kwargs) -> None:
    """Psychoacoustic parameter estimation.

    This is a complex estimation function with many options.
    A stub is provided.

    Raises
    ------
    NotImplementedError
        Full implementation pending.
    """
    raise NotImplementedError(
        "v_psycest is a complex psychoacoustic estimation function. "
        "Full implementation pending."
    )

v_psycestu ¶

V_PSYCESTU - Psychoacoustic estimation utilities (stub).

v_psycestu ¶

v_psycestu(*args, **kwargs) -> None

Psychoacoustic estimation utilities.

Raises:

Type	Description
`NotImplementedError`	Full implementation pending.

Source code in pyvoicebox/v_psycestu.py

def v_psycestu(*args, **kwargs) -> None:
    """Psychoacoustic estimation utilities.

    Raises
    ------
    NotImplementedError
        Full implementation pending.
    """
    raise NotImplementedError(
        "v_psycestu is a complex psychoacoustic utility function. "
        "Full implementation pending."
    )

v_psychofunc ¶

V_PSYCHOFUNC - Calculate psychometric functions.

v_psychofunc ¶

v_psychofunc(
    m_or_q=None, q=None, x=None, r=None
) -> ndarray

Calculate psychometric functions: trial success probability vs SNR.

Parameters:

Name	Type	Description	Default
`m_or_q`	`str or array_like`	Mode string or q parameters (if mode omitted).	`None`
`q`	`array_like`	Model parameters [p_threshold, threshold, slope, miss_prob, guess_prob, type].	`None`
`x`	`array_like`	SNR values or probability values (for inverse).	`None`
`r`	`array_like`	Test results (0 or 1).	`None`

Returns:

Name	Type	Description
`p`	`ndarray`	Probabilities.

Source code in pyvoicebox/v_psychofunc.py

def v_psychofunc(m_or_q=None, q=None, x=None, r=None) -> np.ndarray:
    """Calculate psychometric functions: trial success probability vs SNR.

    Parameters
    ----------
    m_or_q : str or array_like
        Mode string or q parameters (if mode omitted).
    q : array_like, optional
        Model parameters [p_threshold, threshold, slope, miss_prob, guess_prob, type].
    x : array_like, optional
        SNR values or probability values (for inverse).
    r : array_like, optional
        Test results (0 or 1).

    Returns
    -------
    p : ndarray
        Probabilities.
    """
    # Parse arguments
    if isinstance(m_or_q, str):
        m = m_or_q
    elif m_or_q is not None:
        # Mode argument was omitted
        r = x
        x = q
        q = m_or_q
        m = ''
    else:
        m = ''

    qq = np.array([0.5, 0, 0.1, 0, 0, 1])  # defaults

    if q is not None:
        q = np.asarray(q, dtype=float).ravel()
        qq[:len(q)] = q[:min(len(q), 6)]

    pt = qq[0]    # probability at threshold
    xt = qq[1]    # threshold
    sl = qq[2]    # slope at threshold
    pm = qq[3]    # miss/lapse probability
    pg = qq[4]    # guess probability
    ft = int(qq[5])  # function type

    if x is None:
        x = np.linspace(xt - 20, xt + 20, 100)
    x = np.asarray(x, dtype=float)

    if 'i' in m:
        # Inverse function
        p = np.clip(x, pg + 1e-10, 1 - pm - 1e-10)
        u = (p - pg) / (1 - pm - pg)
        u = np.clip(u, 1e-15, 1 - 1e-15)

        if ft == 1:  # logistic
            ut = (pt - pg) / (1 - pm - pg)
            k = sl / (ut * (1 - ut) * (1 - pm - pg))
            return xt + (np.log(u / (1 - u)) - np.log(ut / (1 - ut))) / k
        elif ft == 2:  # cumulative Gaussian
            from scipy.special import erfinv
            ut = (pt - pg) / (1 - pm - pg)
            sig = (1 - pm - pg) / (sl * np.sqrt(2 * np.pi))
            return xt + sig * np.sqrt(2) * (erfinv(2 * u - 1) - erfinv(2 * ut - 1))
        else:
            raise NotImplementedError(f"Function type {ft} inverse not implemented")

    # Calculate normalized variable
    pr = 1 - pm - pg  # probability range

    if ft == 1:  # logistic
        ut = (pt - pg) / pr
        k = sl / (ut * (1 - ut) * pr)
        v = k * (x - xt)
        # logistic with offset
        vt = np.log(ut / (1 - ut))
        u = 1.0 / (1.0 + np.exp(-(v + vt)))
    elif ft == 2:  # cumulative Gaussian
        ut = (pt - pg) / pr
        from scipy.special import erfinv
        sig = pr / (sl * np.sqrt(2 * np.pi))
        z = (x - xt) / (sig * np.sqrt(2))
        zt = erfinv(2 * ut - 1)
        u = 0.5 * erfc(-(z + zt))
    elif ft == 3:  # Weibull
        ut = (pt - pg) / pr
        k = sl * np.exp(1) * np.log(1 / (1 - ut)) / pr
        u = 1 - np.exp(-np.exp(k * (x - xt) + np.log(np.log(1 / (1 - ut)))))
        u = np.clip(u, 0, 1)
    else:
        raise NotImplementedError(f"Psychometric function type {ft} not implemented")

    p = pg + pr * u

    if r is not None:
        r = np.asarray(r, dtype=float)
        if 'n' not in m:
            # Normalize likelihoods
            p = np.clip(p, 1e-15, 1 - 1e-15)
            likelihood = r * np.log(p) + (1 - r) * np.log(1 - p)
            return np.exp(likelihood - np.max(likelihood))

    if 'r' in m:
        return (np.random.rand(*p.shape) < p).astype(float)

    return p

Miscellaneous¶

v_soundspeed ¶

V_SOUNDSPEED - Speed of sound, density and impedance of air.

v_soundspeed ¶

v_soundspeed(
    t=20, p=1, m=0.0289644, g=1.4
) -> tuple[float, float, float]

Calculate speed of sound, density, and acoustic impedance.

Parameters:

Name	Type	Description	Default
`t`	`float`	Air temperature in Celsius. Default 20.	`20`
`p`	`float`	Air pressure in atm. Default 1.	`1`
`m`	`float`	Average molecular weight of air in kg/mol. Default 0.0289644.	`0.0289644`
`g`	`float`	Adiabatic constant for air. Default 1.4.	`1.4`

Returns:

Name	Type	Description
`v`	`float`	Speed of sound in m/s.
`d`	`float`	Density of air in kg/m^3.
`z`	`float`	Characteristic impedance in Pa.s/m.

Source code in pyvoicebox/v_soundspeed.py

def v_soundspeed(t=20, p=1, m=0.0289644, g=1.4) -> tuple[float, float, float]:
    """Calculate speed of sound, density, and acoustic impedance.

    Parameters
    ----------
    t : float, optional
        Air temperature in Celsius. Default 20.
    p : float, optional
        Air pressure in atm. Default 1.
    m : float, optional
        Average molecular weight of air in kg/mol. Default 0.0289644.
    g : float, optional
        Adiabatic constant for air. Default 1.4.

    Returns
    -------
    v : float
        Speed of sound in m/s.
    d : float
        Density of air in kg/m^3.
    z : float
        Characteristic impedance in Pa.s/m.
    """
    p_pa = p * 101325  # convert atm to pascal
    k = t + 273.15     # absolute temperature
    r = 8.3144         # J/(mol K) universal gas constant
    d = p_pa * m / (r * k)
    v = np.sqrt(g * r * k / m)
    z = v * d
    return v, d, z

v_sigma ¶

V_SIGMA - Estimate glottal opening and closing instants using SIGMA algorithm.

This function requires SWT (Stationary Wavelet Transform) which is complex to implement. A simplified stub is provided that raises NotImplementedError.

v_sigma ¶

v_sigma(lx, fs, fmax=400) -> None

Estimate glottal opening and closing instants (SIGMA algorithm).

This function requires the Stationary Wavelet Transform (SWT) which is available in MATLAB's Wavelet Toolbox but not straightforwardly in scipy.

Parameters:

Name	Type	Description	Default
`lx`	`array_like`	LX (laryngograph) signal.	required
`fs`	`float`	Sampling frequency in Hz.	required
`fmax`	`float`	Maximum laryngeal frequency. Default 400 Hz.	`400`

Raises:

Type	Description
`NotImplementedError`	SWT-based SIGMA algorithm requires specialized wavelet toolbox support.

Source code in pyvoicebox/v_sigma.py

def v_sigma(lx, fs, fmax=400) -> None:
    """Estimate glottal opening and closing instants (SIGMA algorithm).

    This function requires the Stationary Wavelet Transform (SWT) which
    is available in MATLAB's Wavelet Toolbox but not straightforwardly
    in scipy.

    Parameters
    ----------
    lx : array_like
        LX (laryngograph) signal.
    fs : float
        Sampling frequency in Hz.
    fmax : float, optional
        Maximum laryngeal frequency. Default 400 Hz.

    Raises
    ------
    NotImplementedError
        SWT-based SIGMA algorithm requires specialized wavelet toolbox support.
    """
    raise NotImplementedError(
        "v_sigma requires SWT (Stationary Wavelet Transform) which is not "
        "readily available in scipy. Consider using pywt.swt from PyWavelets."
    )