mindly.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
Mindly.Social is an English speaking, friendly Mastodon instance created for people who want to use their brains and their hearts to make social networking more social. 🧠💖

Administered by:

Server stats:

1.3K
active users

Update: Thanks @pitermach showing a great demo that it's actually Mist World Upsampling to 48 in this demo, not NVDA downsampling to 16!
I stitched together an audio file showing you how bad it is at ignoring the setting of -1 as the output. Instead tries to be too smart, enumerate the list and gather which you have set as your sound mapper output, and explicitly call that sound device when passing to the TTS outputs.
I updated this to add a little more at the end and show how Mist World treats audio output switching properly, that I now know is not proper.
Good night, Mastodon. This really ruined my weekend at first, until that amazing demo in my mentions by @pitermach clarified things. :)
Update: People are asking, "how can I tell?" Listen for the sharpness of S's and other consonants. If you have the ear you'll notice.

00:00/01:20

@Tamasg What you're hearing isn't actually downsampling to 16, it's aliasing artifacts introduced by whatever resampling algorithm Mistworld's audio library is using. Vocalizer actually runs at a native 22 KHZ as far as I know. I recorded a quick demo of what it sounds like when you bring a 22 KHZ file to 48 KHZ with a low quality resampling algorithm versus a file that's actually at 16 KHZ.

00:00/04:25

@Tamasg @pitermach and... if you want this aliasing, use Zdsr. I am trying to argue with people that this aliasing is bad. No, they know better, because they have more experience, and stuff like that... If i see that it cannot be hear for a long time... i stop using it.

@asael @pitermach yeah, I do think that the new R8Brain algo mentioned there does a lot better job at still making the voice have that higher crisp quality but not so much the sharpness on the actual consonants, which to me feels like the best of both worlds. You get a lot higher quality to the ear but also don't get those weird artifacts the older ways of upsampling to 44 or 48 introduces.
I don't think people who don't like it are wrong, especially for some minds, sharper noises like that in the audio can really stand out and become annoying or a headache.

@Tamasg @pitermach Because of this harsh aliasing, i start to loss concentration while listening to speech. I don't know, but RHVoice uses its original sample rates IN NVDA or using sapi. also a screen reader which is not doing good with sample rates is jfw.

Tamas G

@asael @pitermach I wonder if we'll ever get a true TTS that's not 22050 but true 44.1K sampling rate. Now I think I'm on the hunt for that. My guess is the newer AI voices might be the first of their kind this way if so. It's interesting because 22050 in actuality is 11025. The Nyquist frequency (or Nyquist limit) refers to the highest frequency that can be accurately represented by a given sample rate. It is half of the sample rate. The reason for this is due to how digital sampling works: you need at least two samples per cycle of the waveform to fully capture its shape. This is known as the Nyquist-Shannon sampling theorem. So really, any TTS claiming to be 22050 HZ is really just 11025, and any TTS claiming to be 11025 is just 5.5 K-hertz, youch

@Tamasg @asael We had one, for a very brief moment, Innoetics. But they got bought out by Samsung. I think there's also a bulgarian TTS that runs at 44, saw someone use it once but not sure what it's called.

@pitermach @Tamasg @asael I'm really sad that Innoetics got killed. :(

@masonasons @pitermach @Tamasg yes, and unfortunatelly, i was never able to try their voices, but i wanted it, for greek.

@masonasons @pitermach @asael wow did they ever do any demos with actual Innoetics voices to show that? Never heard of that company, but it would make sense why or how Samsung got such high quality voices and a major improvement from the sing-song Samsung TTS they had to the way better voices used now.

@pitermach @Tamasg I have the installer, It is the dnn synthesizer, but not so liked by bulgarians.

@pitermach @Tamasg @asael I have (had?) the Innoetics John voice. It's probably still installed on my old laptop. It was kind of odd and weirdly concatenated, but definitely had highs that nothing else did.

@BorrisInABox @pitermach @Tamasg but they used the internet activation. No way to use it...

@BorrisInABox @pitermach @Tamasg I have myself improoved the version of the Slt version for RHVoice.

@asael @BorrisInABox @pitermach @Tamasg is your improvements to that voice publicly available?

@valiant8086 @BorrisInABox @pitermach @Tamasg yes, on rhvoice.org and in android app. it was trained using pyworld. and with better recordings.

@asael I've thought of making a voice of myself in RHVoice but I'm unsure about the intonation, I often talk iwth a higher inflection and want that to be reflected, and haven't quite figured out the ridiculously complicated pipeline. Plus I've only got WSL to train it so it will be sloooooow! @valiant8086 @BorrisInABox @pitermach @Tamasg Oh, has that replaced the one they'll ship for the NVDA add-on? I only see one version of SLT on rhvoice.org

@x0 @asael @BorrisInABox @pitermach @Tamasg I kinda would like to do something like that too

@valiant8086 @x0 @BorrisInABox @pitermach @Tamasg You can do it. If you don't have good resources, i can help with training

@BorrisInABox @pitermach @asael OMG. This was a thing? What year did that all get created? Feel like I've missed like, a major milestone in speech history. LOL. It may be that using actual 44K data is too large in size, so that's why even the human voices by companies such as Nuance stuck with 22050, as it's tolerable enough to not be an AM radio but not high quality to be even like an FM signal could be.

@BorrisInABox @Tamasg @pitermach but, when it comes to innoetics, their greek voices are a thing

@asael @BorrisInABox @Tamasg Believe it or not I still have it here, yay for a Windows install on its 9th year now. Fun fact, that John voice was created from the voice of Jon St. John, aka the guy who voiced Duke Nukem, sadly not doing the Duke voice in this case. But that single reason is why we used it for a very long time either for reading chats or games while streaming with @talon. Here's a quick demo. Not a super responsive voice, but yeah there's a lot of highs.

00:00/01:14

@pitermach @asael @BorrisInABox @talon wow yeah, that dfor sure has some lag, although have heard worse. Sounds more like what you'd get on a Piper voice. But yeah, the quality in this is for sure what I like about the up-sampled Vocalizer, and there's none of that sharpness crap at the ends of certain letters people can notice, wow.

@Tamasg @pitermach @BorrisInABox @talon btw, there is something new about upcoming NVDA. The chinese dev, who is developing natural sapi adapter, started making improvements to speech protocols. He is going to add the silence trimming for all tts, making them responsive.

@asael @Tamasg @BorrisInABox @talon Oh neat. That's definitely something that's giving ZDSR a decent performance edge over NVDA with some SAPI voices it makes a very noticeable improvement.

@pitermach @Tamasg @BorrisInABox @talon and... silence removal will be there in NVAD, too. planned for 2025.1

@pitermach @BorrisInABox @Tamasg @talon sibilants of this voice are weird... and they tried to create a giant.

@Tamasg @pitermach Okay.. RHVoice is 24000 Hz sampling rate. Regarding the 48 kHz synths... it is interesting, how responible such are going to be.