It's not a model you can run on your own server but a free service on revoldiv.com. You can expect 40 to 50 second wait time to transcribe an hour long video/audio. We combine whisper with our model to get word level timestamps, paragraph separation and sound detections like laughter, music etc... We recently added very basic podcast search and transcription.