I would like to request 2-3 modes when answering a voice (or video) message: - transcribe message and reply - transcribe message only (don't reply, go straight to whisper, saves tokens) - reply only (current implementation, could be kept as default or replaced by transcribe + reply)