Input Tensor Shape Explanation
Hello, I am trying to integrate this model in my project (using Unity Sentis).
However, I struggle how to create the proper input tensor.
The input tensor shape is shown to me as [2, 'num_splits', 512, 1024] of type float ( read input shape via https://stackoverflow.com/a/73955585 ).
My questions:
- What is each input dimension about? Is there any mapping to well-known spleeter parameters?
- Is the input tensor different from the original Spleeter model?
- How to construct the proper input tensor from a float[] of audio samples? I assume the model expects 44100 Hz audio, is it correct?
- More specifically, which parameters for the Short-time fourier transform (STFT) are needed (windowSize, hopSize, etc.)?
- Similarly, could you please also explain the output tensor shape?
Thanks for the help!
You can find my current attempt on GitHub: https://github.com/achimmihca/SpleeterAiUnityDemo
More specifically: https://github.com/achimmihca/SpleeterAiUnityDemo/blob/main/Assets/Scenes/SpleeterAudioSeparator.cs
First of all, sherpa-onnx provides C# API.
Second, if you don't want to use sherpa-onnx, you can have a look at
https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/spleeter/separate_onnx.py
Thank you, the linked Python implementation is very helpful for me.
Somehow I was not able to find it. Sorry for the inconvenience.