Sound and Speech (Shabda)

Shabda: Sound, speech, and language representation

Sound and speech are major factors in language and intelligence. The human race mastered language long before the invention of writing, and children learn to speak long before they can read. Thus, the brain does not use “text” as the primary means of representing language and storing knowledge.

The implication of this is that when attempting to build a “learning and thinking machine” (that will eventually have human-like intelligence), it is important to focus on how language should be represented, and not assume that “text” is adequate. Text is typically represented in computers as bytes of encoded characters, which is appropriate given the current architecture and usage of computers, but won’t be adequate for a true representation and emulation of language.

“Sound” and “speech” are primary translations of the Sanskrit word “shabda”, and shabda is crucial to Vedic philosophy. Also, the Vedic literature begins with works collectively known as “shruti”, which means “heard”, because these works represent an oral tradition that precedes the invention of writing. So for a computer system to do justice to the Vedic literature, it’s necessary that such a system be able to process and store sound and speech as well as the human brain does.

More will be said later about shruti and how it might be processed. Here we focus more generally on the processing and storage of speech, language, and sound in a computer. The Susiddha AI project will involve research into non-textual means of representing language and knowledge. Neuroscience is just beginning to provide insight into how language is stored in the brain, and that growing insight will serve as a guide to representing it non-textually in a computer.

Other AGI projects that currently exist ignore sound and speech, and just use “text” for communication and language representation, but that is probably inadequate for the goal of AGI. The human brain is our “existence proof” of general intelligence, it is quite possible that sound and speech (and of course language) have a lot to do with the emergence of human-like intelligence (and consciousness).

Consider the fact that all humans hear themselves think (though it’s hard to say in the case of someone born deaf). Also, we hear ourselves think at various levels of subtlety, i.e. the inner thoughts/speech can be very vague, or very clear. (The chapter on Four levels of awareness has more to say about such levels.) Because we hear ourselves think and speak, there must be self-referential feedback loops[1] going on the the brain, and this feedback plays an important role in the processes of thinking, learning, and consciousness.

Below are some thoughts about how sound and speech processing might be implemented in the Susiddha project with the help of several fields of research.

The field of Neuroscience is researching how sound and speech are stored in the brain. For instance, some studies [2] have already shown the possibility of recognizing words and thoughts in the brain. Neuroscience could resolve the debate as to whether there are “phonemes” in the brain (of pre-literate people) versus a much richer representation of words[3]. Such research will point the way towards better sound and speech processing in a computer.

The field of Computational Audition (a.k.a. “computer audition”) attempts to replicate the audio processing that goes on in the human ear and brain, so that a computer could hear in the same way that a human does.[4] This field is analogous to Computer Vision, and some of the algorithms may be transferable. For instance, computer processing of videos attempts to recognize activities, and predict what will happen next based on seeing the previous seconds of the video. Computational audition will enable a computer to do similar sorts of things with audio.[5] It will also provide the necessary research (along with the neuroscience findings) to appropriately represent sound and speech in the computer (in a realistic, non-textual manner).

Deep learning is another field which will contribute greatly to the processing of sound and speech. There are already programs[6] that can learn a piece of music from a recording, and store it in neural networks. Then, when a few seconds of the piece is played to the system, it can predict the music that will follow. There are also speech recognition systems that work without using the concept of “phonemes”.[7] Deep learning does not attempt to mimic how the human brain stores sound and speech (which is still largely unknown), but rather learns a (black box) function to map the inputs (such as audio patterns) into desired outputs.

Research in these three fields has a ways to go before a computer can store and process sound and speech as well as humans do, but rapid progress is being made. Thus, it is likely the Susiddha system will be able to process sound and speech well enough to be able to properly comprehend the Vedic literature, which depends on hearing the sounds of the Sanskrit language, rather than just processing transcribed text.

A note on “testimony”:

One of the most frequent (and misleading) translations of the word “shabda” is “testimony”. This however is not the primary meaning, and the Susiddha project does not use it.

Verbal testimony is fraught with error (as are most translations), but the actual sounds of Sanskrit are not. As explained in later chapters on Shruti and the Rig Veda, it might not be possible to adequately translate the Vedic Sanskrit language that precedes the invention of writing. And thus, the Susiddha project will explore neural network methods (such as deep learning) to get closer to the true meanings of the Vedic sounds.

That said, “testimony” is important in the context of education. Every child needs the teaching of parents, teachers, mentors, books, videos, etc., because there is not enough time to discover for oneself all the knowledge that has been discovered in the history (and pre-history) of humanity. This also applies to a young AGI system. Thus, all teachers and all knowledge sources that educate the young AGI will have to be validated to ensure that it learns only what is true and beneficial for humanity. A future chapter will be written on how the Susiddha project plans to educate the AGI/SSI system that is created.

Contents     —     Next chapter

Notes and References

  1. A strange sense of self: Review of Douglas Hofstadter’s “I Am a Strange Loop”, Susan Blackmore, Nature, May 3, 2007, pages 29-30,
  2. Tracking neural coding of perceptual and semantic features of concrete nouns, Gustavo Sudre, et al, NeuroImage journal (Elsevier), May 4, 2012,
  3. Forget about phonemes: Language processing with rich memory, Robert Port (Dept. of Linguistics and Cognitive Science, Indiana U.), May 3, 2010,
  4. Laboratory for Computational Audition, Josh McDermott, MIT,
  5. WaveNet: A Generative Model for Raw Audio, van den Oord, et al, DeepMind, 2016,
  6. An Associative Memorization Architecture of Extracted Musical Features From Audio Signals by Deep Learning Architecture, Tadaaki Niwa, et al, Procedia Computer Science, November 3, 2014,
  7. Deep Speech: Scaling up end-to-end speech recognition, Awni Hannun et al, Baidu Research, 2014,