YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ONNX Real-Time DOA Streaming

Real-time Direction of Arrival (DOA) detection using an ONNX model with microphone streaming. This script processes audio from a multi-channel microphone array (ReSpeaker) in real-time and displays detected sound source directions.

Overview

The script performs the following process:

  1. Audio Capture: Streams audio from a 6-channel microphone array (ReSpeaker)
  2. Channel Selection: Selects and reorders channels [1, 4, 3, 2] to get 4 channels
  3. Feature Extraction: Computes STFT features (magnitude, phase, cosine, sine) from the audio
  4. ONNX Inference: Runs the DOA model on GPU (CUDA) or CPU to get per-frame logits
  5. Histogram Aggregation: Aggregates logits into a circular histogram of azimuth angles
  6. Peak Detection: Finds peaks in the histogram to identify sound source directions
  7. Event Gating: Filters detections based on audio level changes and coherence
  8. Visualization: Displays detected directions on a polar plot in real-time

Prerequisites

Hardware

  • ReSpeaker 6-Mic Array (or compatible multi-channel microphone)
  • microphone: positions:
    • [0.0277, 0.0] # Mic 0: 0°
    • [0.0, 0.0277] # Mic 1: 90°
    • [-0.0277, 0.0] # Mic 2: 180°
    • [0.0, -0.0277] # Mic 3: 270°
  • NVIDIA GPU (optional, for faster inference)

Software Dependencies

Install the required packages:

conda activate doaEnv
pip install onnxruntime-gpu  # For GPU inference
# OR
pip install onnxruntime      # For CPU-only inference

pip install pyaudio numpy matplotlib torch pyyaml

ONNX Model

You need a converted ONNX model file. If you haven't converted your PyTorch model yet:

python convert_to_onnx.py --checkpoint models/basic/2025-11-06_22-37-00-6a5fbc92/last.pt --output models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx

Quick Start

1. List Available Audio Devices

First, find your ReSpeaker device index:

python onnx_stream_microphone.py --list-devices

Look for a device named "ReSpeaker" or "Seeed" or containing "2886". Note the device index.

2. Stop PulseAudio (Required)

On Linux, PulseAudio often locks the ALSA devices. You need to temporarily stop it:

pulseaudio --kill

Note: You can use the helper script run_onnx_stream.sh which automates this (see below).

3. Run the Streaming Script

Basic usage:

python onnx_stream_microphone.py \
    --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx \
    --device-index 9

4. Restart PulseAudio (After Stopping)

After you're done, restart PulseAudio:

pulseaudio --start

Using the Helper Script

A helper script automates PulseAudio management:

chmod +x run_onnx_stream.sh
./run_onnx_stream.sh --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx --device-index 9

This script will:

  1. Stop PulseAudio
  2. Run the streaming script
  3. Restart PulseAudio when you exit (Ctrl+C)

Command-Line Arguments

Required Arguments

  • --onnx PATH: Path to the ONNX model file

Audio Configuration

  • --device-index INT: Audio device index (use --list-devices to find it)
  • --sample-rate INT: Sample rate in Hz (default: 16000)
  • --window-ms INT: Analysis window length in milliseconds (default: 200)
  • --hop-ms INT: Hop size (overlap) in milliseconds (default: 100)
  • --chunk-size INT: Audio buffer chunk size (default: 1600)
  • --cpu-only: Use CPU only (disable GPU inference)
  • --list-devices: List all available audio input devices and exit

Model Configuration

  • --config PATH: Path to config.yaml (default: configs/train.yaml)

Histogram Detection Parameters

These control how DOA peaks are detected from the model logits:

  • --K INT: Number of azimuth bins (default: 72, should match model)
  • --tau FLOAT: Softmax temperature for histogram (default: 0.8)
  • --smooth-k INT: Histogram smoothing kernel size (default: 1)
  • --min-peak-height FLOAT: Minimum peak height threshold (default: 0.10)
  • --min-window-mass FLOAT: Minimum window mass for peak validation (default: 0.24)
  • --min-sep-deg FLOAT: Minimum angular separation between peaks in degrees (default: 20.0)
  • --min-active-ratio FLOAT: Minimum active frame ratio (default: 0.20)
  • --max-sources INT: Maximum number of sources to detect (default: 3)

Event Gate Parameters

These control when detections are considered valid (filtering noise):

  • --level-delta-on-db FLOAT: Level increase threshold to open gate (default: 2.5)
  • --level-delta-off-db FLOAT: Level decrease threshold to close gate (default: 1.0)
  • --level-min-dbfs FLOAT: Minimum audio level in dBFS (default: -60.0)
  • --level-ema-alpha FLOAT: Exponential moving average alpha for level tracking (default: 0.05)
  • --event-hold-ms INT: Minimum time to keep gate open after detection (default: 300)
  • --min-R-clip FLOAT: Minimum R_clip (coherence measure) to open gate (default: 0.18)
  • --event-refractory-ms INT: Minimum time between gate state changes (default: 120)

Onset Detection Parameters

  • --onset-alpha FLOAT: EMA alpha for spectral flux tracking (default: 0.05)

Example with Custom Parameters

python onnx_stream_microphone.py \
    --onnx doa_model.onnx \
    --device-index 9 \
    --window-ms 400 \
    --hop-ms 100 \
    --K 72 \
    --max-sources 2 \
    --tau 0.8 \
    --smooth-k 1 \
    --min-peak-height 0.08 \
    --min-window-mass 0.16 \
    --min-sep-deg 22.5 \
    --min-active-ratio 0.15 \
    --level-delta-on-db 4.0 \
    --level-delta-off-db 1.5 \
    --level-min-dbfs -55.0 \
    --level-ema-alpha 0.05 \
    --event-hold-ms 320 \
    --event-refractory-ms 200 \
    --min-R-clip 0.30 \
    --onset-alpha 0.05

Understanding the Output

Console Output

Each line shows:

[ 12.34s] LVL= -45.2 dBFS diff=+3.5 | FLUXz=2.10 COH=0.75 | GATE=OPEN | MODEL= 12.3ms HIST= 2.1ms | DOA(R=0.45, n=2) [45°, 180°]
  • [time]: Elapsed time in seconds
  • LVL: Audio level in dBFS
  • diff: Level difference from background (dB)
  • FLUXz: Spectral flux z-score (onset detection)
  • COH: Inter-microphone coherence
  • GATE: Gate state (OPEN/CLOSED)
  • MODEL: Model inference time (ms)
  • HIST: Histogram processing time (ms)
  • DOA(R=..., n=...): R_clip value and number of detected peaks
  • [angles]: Detected azimuth angles in degrees

Visual Output

A polar plot window shows:

  • Green lines: Detected sound source directions
  • Line thickness: Proportional to confidence score
  • Angle labels: Azimuth in degrees (0° = North/front)

Azimuth Convention

  • = North (front of microphone)
  • 90° = East (right)
  • 180° = South (back)
  • 270° = West (left)

How It Works

1. Audio Processing Pipeline

Microphone (6 ch) → Channel Selection [1,4,3,2] → 4-channel audio

2. Feature Extraction

For each analysis window:

  • Compute STFT for all 4 channels
  • Extract magnitude, phase, cosine, and sine components
  • Result: (T_frames, 12_features, F_freq_bins)

3. Model Inference

  • Batch process features through ONNX model
  • Output: (T_frames, K_bins) logits per frame
  • Each frame has K probability scores for different azimuth angles

4. Histogram Aggregation

  • Apply softmax with temperature tau to logits
  • Weight by circular coherence (R_clip)
  • Aggregate across all frames into a single histogram
  • Smooth the histogram

5. Peak Detection

  • Find local maxima in the histogram
  • Filter by minimum height, separation, and window mass
  • Refine peak positions using parabolic interpolation
  • Return up to max_sources peaks

6. Event Gating

  • Track audio level with exponential moving average
  • Open gate when:
    • Level increases by level_delta_on_db OR
    • Valid peaks detected AND R_clip > min_R_clip
  • Close gate when level drops and no valid peaks
  • Apply hold and refractory periods to prevent flickering

Troubleshooting

"Invalid number of channels" Error

Problem: Device reports 0 channels or PyAudio can't open it.

Solution:

  1. Stop PulseAudio: pulseaudio --kill
  2. Run the script
  3. Restart PulseAudio: pulseaudio --start

Or use the helper script run_onnx_stream.sh.

No Audio Detected

  • Check microphone connections
  • Verify device index with --list-devices
  • Check audio levels (should be above level_min_dbfs)
  • Adjust level_delta_on_db to be more sensitive

GPU Not Used

  • Verify CUDA is available: python -c "import torch; print(torch.cuda.is_available())"
  • Install onnxruntime-gpu instead of onnxruntime
  • Check that CUDA providers are listed in the model loading message

Model Mismatch Errors

  • Ensure --K matches the model's K value (usually 72)
  • Check that the ONNX model was exported with the correct input shape
  • Verify config.yaml matches training configuration

Poor DOA Accuracy

  • Increase --window-ms for longer analysis windows (more stable)
  • Adjust --min-peak-height and --min-window-mass thresholds
  • Tune --tau (lower = sharper peaks, higher = smoother)
  • Check microphone array calibration and positioning

Performance Tips

  • GPU Inference: Use onnxruntime-gpu for 5-10x speedup
  • Window Size: Larger windows (400ms) = more stable but higher latency
  • Hop Size: Smaller hops (50ms) = more responsive but more computation
  • Batch Size: The script uses batch_size=25 internally for efficient GPU usage

Stopping the Script

Press Ctrl+C to stop the stream. The script will:

  • Close the audio stream
  • Close the visualization window
  • Clean up resources

Integration

To use this in your own code, see onnx_doa_inference.py which provides a standalone inference class that can be integrated into other projects.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support