SogetiLabs Posted

Speech isn’t free, but it can cost less

In our current peri-COVID world, we all now have far more experience than we could ever have imagined in remote working.

Our homes are now our offices; dress codes have become more relaxed; we can work somewhat more flexible hours to accommodate our personal lives.

This has all come at a cost, of course. The biggest, in my opinion, is the need for higher bandwidth and more reliable Internet connections to our homes. In many cases, Internet Service Providers (ISPs) have been hard-pressed to provide new pipes, and “last mile” service installations have lagged.

The Internet core network has similarly been stressed–in analysis done comparing pre- and peri-COVID data in several cities around the world, backbone data usage has gone up by as much as 40% year-to-year.

Much of this “need for speed” has been driven by widespread use of teleconferencing software. Zoom, Microsoft Teams, Skype, Chime and others are in constant use around the world. Even with clever bandwidth-saving measures, the massively increased use of teleconferencing has created what will probably remain with us post-COVID.

One of the contributors to the need for higher bandwidth in teleconferencing is the requirement to transmit timely and clear representations of speech in a digital format. Generally, audio is highly resistant to most compression technologies–it’s too full of unpredictable data patterns and, with noise added in, becomes even more of a problem.

A number of coder/decoder algorithms have been invented for the problem of transforming speech, in particular, to a digital form. Some are very clever, making use of models of speech generation to build compression models that are reasonably efficient of time and bandwidth. The models are made much more complex by the need to model a wide range of languages–many of which have substantial differences in their phonemes. Add in accents, speaking rate, and other variables and the models become extremely complex.

With the long history of language coder/decoder research, it would be easy to believe that there would be nothing new under the sun.

And that would be wrong.

Google has announced a new speech coding algorithm that appears to use much less bandwidth than existing algorithms, while preserving speech clarity and “normalness” better.

The new algorithm, named “Lyra”, is based on research done on new models for speech coding, generative models.

These shortcomings have led to the development of a new generation of high-quality audio generative models that have revolutionized the field by being able to not only differentiate between signals, but also generate completely new ones.

One of the major issues with using these generative models is their computational complexity. Google has offered a solution to that problem and the solution appears to offer better performance, at lower bandwidth, and with better apparent normalness to the sound quality.

Lyra is currently designed to operate at 3kbps and listening tests show that Lyra outperforms any other codec at that bitrate and is compared favorably to Opus at 8kbps, thus achieving more than a 60% reduction in bandwidth. Lyra can be used wherever the bandwidth conditions are insufficient for higher-bitrates and existing low-bitrate codecs do not provide adequate quality.

The Google webpage announcing this news has examples of their algorithm in action compared to existing, widely used algorithms. The results are quite impressive.

What impacts will this have on teleconferencing? Google predicts that it will make teleconference possible over lower bandwidth connections, and provide an algorithm that can be incorporated into existing and new applications.

Google plans to continue work in this area, most importantly to provide implementations that can be accelerated through GPUs and TPUs.

Be sure to listen for more exciting developments in speech coding, no matter what algorithm you use….

SogetiLabs Posted

Apple’s New iPod? A New AI Weakness Revealed

Artificial Intelligence–AI–has come far since its first incarnation in 1956 as a theorem-proving program.

Most recently OpenAI, a machine learning research organization, announced the availability of CLIP, a general-purpose vision system based on neural networks. CLIP outperforms many existing vision systems on many of the most difficult test datasets.

[These datasets] stress tests the model’s robustness to not recognizing not just simple distortions or changes in lighting or pose, but also to complete abstraction and reconstruction—sketches, cartoons, and even statues of the objects.

It’s been known for several years from work by brain researchers that there exist “multimodal neurons” in the human brain, capable of responding not just to a single stimulus (e.g., vision) but to a variety of sensory inputs (e.g., vision and sound) in an integrated manner. These multimodal neurons permit the human brain to categorize objects in the real world.

The first example found of these multimodal neurons was the “Halle Berry neuron“, found by a team of researchers in 2005 and which responds to pictures of the actress–including those that are somewhat distorted, such as caricatures–and even to typed letter sequences of her name.

[P]ictures of Halle Berry activated a neuron in the right anterior hippocampus, as did a caricature of the actress, images of her in the lead role of the film Catwoman, and a letter sequence spelling her name.

Many more such neurons have been found since this seminal discovery.

The existence of multimodal neurons in artificial neural networks has been suspected for a while. Now, within the CLIP system, the existence of multimodal neurons has been demonstrated.

One such neuron, for example, is a “Spider-Man” neuron (bearing a remarkable resemblance to the “Halle Berry” neuron) that responds to an image of a spider, an image of the text “spider,” and the comic book character “Spider-Man” either in costume or illustrated.

This evidence for the same structures in both the human brain and neural networks provides a powerful tool for better understanding how to understand the functioning of both, and how to better develop and train AI systems using neural networks.

The degree of abstraction found in the CLIP networks, while a powerful investigative tool, also exposes one of its weaknesses.

CLIP’s multimodal neurons generalize across the literal and the iconic, which may be a double-edged sword.

As a result of the multimodal sensory input nature of CLIP, it’s possible to fool the system by providing contradictory inputs.

For instance, providing the system a picture of a standard poodle results in correct identification of the object in a substantial percentage of cases. However, there appears to exist in CLIP a “finance neuron” that responds to pictures of piggy banks and “$” text characters. Forcing this neuron to fire by place “$” characters over the image of the poodle causes CLIP to identify the dog as a piggy bank with an even higher percentage of confidence.

This discovery leads to the understanding that a new attack vector exists in CLIP, and presumably other similar neural networks. It’s been called the “typographic attack”.

This appears to be more than an academic observation–the attack is simple enough to be done without special tools, and thus may appear easily “in the wild”.

As an example of this, the CLIP researchers showed the network a picture of an apple. CLIP easily identified the apple correctly, even going so far as to identify the type of the apple–a Granny Smith–with high probability.

Adding a handwritten note to the apple with the word “iPod” on it caused CLIP to identify the item as an iPod with an even higher probability.

The more serious issues here are easy to see: with the increased use of vision systems in the public sphere it would be very easy to fool such a system into making a biased categorization.

There’s certainly humor in being able to fool an AI vision system so easily, but the real lesson here is two-fold.

  • The identification of multimodal neurons in AI systems can be a powerful tool to understanding and improving their behavior.
  • With this power comes the need to understand and prevent the misuse of this power in ways that can seriously undermine the system’s accuracy.

We believe that these tools of interpretability may aid practitioners [in] the ability to preempt potential problems, by discovering some of these associations and ambiguities ahead of time.

With great power comes great responsibility, as Spiderman has said.