Input Tensor Shape Explanation

#2
by anstdev - opened

Hello, I am trying to integrate this model in my project (using Unity Sentis).
However, I struggle how to create the proper input tensor.

The input tensor shape is shown to me as [2, 'num_splits', 512, 1024] of type float ( read input shape via https://stackoverflow.com/a/73955585 ).

My questions:

  • What is each input dimension about? Is there any mapping to well-known spleeter parameters?
  • Is the input tensor different from the original Spleeter model?
  • How to construct the proper input tensor from a float[] of audio samples? I assume the model expects 44100 Hz audio, is it correct?
    • More specifically, which parameters for the Short-time fourier transform (STFT) are needed (windowSize, hopSize, etc.)?
  • Similarly, could you please also explain the output tensor shape?

Thanks for the help!

First of all, sherpa-onnx provides C# API.

Second, if you don't want to use sherpa-onnx, you can have a look at
https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/spleeter/separate_onnx.py

csukuangfj changed discussion status to closed

Thank you, the linked Python implementation is very helpful for me.
Somehow I was not able to find it. Sorry for the inconvenience.

Sign up or log in to comment