OpenAI Whisper Audio Transcriber
The code defines a class `SpeechToTextAction` within a module `Sublayer::Actions`, which is designed to convert audio data to text. It takes an audio data input upon initialization. The `call` method processes this audio data by first storing it in a temporary file with a `.webm` extension, using the ASCII-8-bit encoding.
This temp file is then sent to the OpenAI API endpoint for audio transcriptions (`https://api.openai.com/v1/audio/transcriptions`) via a POST request. The request includes an authorization header with an API key from the environment variables and specifies the use of the `multipart/form-data` content type. It also sets the `model` to 'whisper-1' in the request body, indicating the use of a specific model for transcription.
After posting the audio data, the method cleans up by closing and deleting the temporary file. It returns the transcribed text extracted from the response of the OpenAI API, specifically from the `text` field of the response JSON.
module Sublayer
module Actions
class SpeechToTextAction < Base
def initialize(audio_data)
@audio_data = audio_data
end
def call
tempfile = Tempfile.new(['audio', '.webm'], encoding: 'ascii-8bit')
tempfile.write(@audio_data.read)
tempfile.rewind
text = HTTParty.post(
"https://api.openai.com/v1/audio/transcriptions",
headers: {
"Authorization" => "Bearer #{ENV["OPENAI_API_KEY"]}",
"Content-Type" => "multipart/form-data",
},
body: {
file: tempfile,
model: "whisper-1"
})
tempfile.close
tempfile.unlink
text["text"]
end
end
end
end