Multi-Modal (Multi-Channel Audio-Visual) Speech Recognition, Separation and Diarization, Everything Streaming All at Once
The directed graphical model for multi-modal cocktail party problem.
Goal of the task
## Streaming Input:
- multi-channel audio from microphone array
- video from RGB/depth cameras
## Streaming Output:
- [Who] says [What] from [t1] to [t2] at [Location Where]
Overall multi-mmodal system
# demo available by 7/20
Streaming multi-talker ASR demo (with 4 simultaneous speakers)
# demo available by 7/20
Streaming multi-talker Diarization demo
# demo available by 7/20