So I've seen this talked about here a bit, but I wanted to give more context on Kokoro TTS. This model was open sourced back on December 25, and was trained almost entirely on synthetic data taken from Eleven Labs and Open AI. Legality aside, the quality speaks for itself. This is an 82 million parameter model, which is very small by today's standards, but that means it's incredibly fast even on CPU.
The main dev responsible for training seems to know much more than the average open source enthusiast about how to make high-quality TTS, and I think the results speak for themselves. The model is under very active development and still quite young, more data is currently being collected, and a new version will be trained and released likely in the coming months. Their Discord is quite active, and I'm over there as well if you'd like to join. I think this has the potential to be a great option for blind screen reader users, who may not be able to afford something like Vocalizer on Windows, but we're not quite there just yet in terms of performance.
Here is a demo of one of the voices reading about Android.
Link to model card on Huggingface: https://huggingface.co/hexgrad/Kokoro-82M
Link to Discord: https://discord.gg/QuGxSWBfQy
@ZBennoui that voice somehow oddly reminds me of Vocalizer Zoe. Hopefully some day we'll be able to fine-tune our own model from theirs with our voice training data, will probably jump more onboard with it once we get a notebook / more tools and instructions on doing so. But I'm glad to know others are making them aware of this potential as it would be a shame to let an engine like that go to waste.
@Tamasg This was already brought up and the model is just a fine tune of Style TTS2, which can already be trained. There is no benefit to additional fine-tuning on this model as it was almost exclusively trained on synthetic data, so they recommend that if you would like to make your own voices, you should fine-tune the base STTS2 instead.
@ZBennoui ooh, interesting, Style TTS I haven't heard of. Looks like they do have a way to train through notebooks (https://github.com/yl4579/StyleTTS2/discussions/144) - but that means it might be easier to get an NVDA driver for that specifically rather than this model subset? not sure.
@Tamasg Yeah. STTS came out last year, and at the time it was competitive with the more recent transformer based commercial models, such as 11 labs. I'm not sure how it's held up or how development has been proceeding, I have not tried fine-tuning that model yet myself as it looked quite hard initially.
@Tamasg The main advantage with something like style TTS is that you can give it data from say an audiobook, and the model will have a much easier time really understanding the intonation patterns and expression in the speech as opposed to a more simple architecture, such as what's used by Piper.
@ZBennoui yeah, I did partially wonder that - one limitation we have with screen reader voices is not just that it's a lot of small speech chunks, but we can't feed it text ahead to help the model sound more natural when expressing it, which could still cause a disjointing in speech patterns if it's not able to change punctuation as dynamically as in a full text passage.
@Tamasg that's a great point and something I haven't really considered. I wonder if that's part of the reason for the hesitation on the part of a lot of screen reader companies to switch over to neural options, especially the more recent transformer based approaches that really require proper sentence or paragraph level context in order to sound natural.
@ZBennoui would be interesting if you tried to do a sample inference of chunks that way for a simple dialog flow - but you would need to manually split as though it's indexing by the screen reader. For NVDA I've noticed this may be 2-4 chunks per flow: Dialog name and dialog role (1), body text (2), Focused item (3), additional state to that first-focused item such as collapsed (4). So you'd need to inference a passage with smaller splits like that to truly know, then merge all wavs into one single file xD
@pvagner @Tamasg @ZBennoui I would love to know how to train one of these and make a cool-sounding Bulgarian voice. IF it's really lighter than what I assume we have to our disposal as far as voices go, the heavy stuff. Because even my AMD processors don't take kindly to however our neural engine crap new voices are made. Now that'll be something to experiment with. :D
@tardis I hope you don't mind me being so curious. I am learning slowly and working on these things even slower but I have already helped to train slovak piper voice. The result was not that awfull hover it was not great either. So for a few months I am now working on improving espeak phonemization rules for slovak language trying to make sure all the language specific features are respected as much as possible even before training. I know at least piper and optispeech are using espeak for phonemization under the hood. These days another piper training of slovak voice with these improvements is running over here and it turns out it's sounding much better in terms of pronounciation. However only time will tell us if people will like it though. All this work is based off of previous work we were doing with friends while making slovak voices for RHVoice TTS. So we have text prompts and high quality voice recordings for slovak.
And now those questions:
Apart of the robotic sounding voice, How do you like bulgarian espeak pronounciation?
Is it similar in complexity to russian? I can't speak nor understand russian, however I do know russian espeak rules include a huge list of exceptions and russian speaking people don't still like it that much.
Do you have high quality bulgarian recordings of single speaker or do you know of a public data set that may include such a recordings?
@pvagner Hmm, other than the Espeak R as pronounciation being weird and throaty instead of the rolling r that we all are familiar with, I haven't had other issues.
@pvagner Oh, sure sounds like a lot of work. Surely, if I had no job, I could do it, seems time-consuming. And I understand basic phrases, I am learning, but slowly. I took Spanish as a second language in university, and have not graduated do to difficulties with the teachers and a lot of fighting for accessibility. It sucks. And also the fact Spanish isn't coming to my intuition like, say English, is also not nice. But I can speak both English and Bulgarian quite fluently.
@pvagner Yes, I have access to them all, but you are right that I would need to ask the person to grant me authorization to use them. Since I'm not sure I'll get it, I'm wondering whether Eloquence in Portuguese would be OK.
@pvagner OK, that's because ESpeak sounds too artificial in Portuguese, though there are people who like it, due to it being inteligible at fast speeds. In any case, I'll try contacting the person who recorded the speeches for RHVoice.
@pvagner OK, thanks much. I actually like to stody basics of linguistics. If ESpeak's tone won't interfere in the final voice quality, then most of the problem is probably solved. I believe I still have all the written sentences used for RHVoice. Is it OK to take those sentences and record then using ESpeak?
@pvagner @clv0 interesting that you helped the Slovak voice like that :) great dedication. Around 2009 or so we got Hungarian voice in ESpeak after Jonathan used my recordings to construct the phoneme data, but I must say, Hungarian RH Voice does have the better accent, although eSpeak still is not unpleasant like Vocalizer can be for longer passages so still a win. Since some of the RH Voices are taken from Piper and other open-source data voice quality can be a bit inconsistent but phoneme data is a bit better.
@clv0 @pvagner @tardis and for the small note. I am also still learning and i will have a knowledge to pass very soon, but i will need to systematize it. It is also regarding our RHVoice. I am learning the labelling.xml thing. This low level things which flags the properties of the trained voice, not just the foma code. I am happy that i can worki with other people when it comes to maintaining and developing new languages.
@pvagner I could find recordings, I would not record off of movies, thing is, people have dialects, so the best you could do with Bulgarian is read texts yourself and record them. Otherwise it becomes real weird. If I only knew how the neural engines were trained, I would provide you the recordings, sadly, they don't give us such info.