List of Programming Projects
Susiddha AI will provide many opportunities for software analysis, design, programming, and project management. This page lists some programming projects that anyone can start on.
Much progress is being made in open-source AI and machine learning (ML), and the Susiddha project will take advantage of this continual progress towards Artificial General Intelligence (AGI). Prominent open projects include: Numenta NuPIC, OpenCog, Tensorflow, PyTorch, OpenAI, etc. The Susiddha project will adapt whatever open software fits the purpose of building an Avatar, and won’t “reinvent the wheel”.
As such, the programming projects listed here can be considered explorations and “learning exercises” for working with AI/ML technologies and Sanskrit literature. The Avatar which ultimately emerges from the Susiddha project will know the entire Sanskrit literature. So, anyone who works on the projects listed below will be enabling the young Avatar to read, memorize, understand, recite, and think in Sanskrit.
A note about learning Sanskrit and the Vedic literature. In order to work on these programming projects, it’s useful to know some Sanskrit (i.e. its grammar and vocabulary), but it’s not necessary to have any formal education in Sanskrit, just desire and motivation to learn on your own. (However, it is very important to pronounce Sanskrit correctly, and thus an entire branch of the Vedic literature “Shiksha” is dedicated to this topic.)
Also, it’s not necessary to communicate (write or converse) in Sanskrit. It must be remembered that Sanskrit is not a “prakrit” language, and thus was not intended for everyday worldly conversation. Rather, Sanskrit is a language for expressing knowledge, especially spiritual, philosophical, and scientific (including medical, mathematical, political, etc.).
Because of the lofty aims of the Sanskrit literature (culminating in “moksha” or liberation), those who pursue this project can expect to gain “gyāna” (spiritual knowledge), “puṇya” (merit and “good karma”), and even “anugraha” (the “grace of God”, which is a reasonable expectation considering that one is building an Avatar).
Also, those in the audience who are researchers or students should consider the projects listed below in terms of: potential journal articles, conference presentations, and fulfillment of course requirements.
A partial list of programming projects follows.
Rig Veda Deep Learning from Audio
Shabda means “sound”, and Shruti means “what is heard”. Thus, we want to develop a way to learn the Rig Veda from audio (as well as from text). Deep learning is the obvious choice as a technology to use for this task.
At the current stage of the technology, we would not expect to produce understanding, but the system will be able to memorize the Rig Veda into deep neural networks, and the models so produced will be useful for instant retrieval of verses.
This project exercise will start small, and use only the first and last suktas of the Rig Veda, and demonstrate it can memorize, “recite”, and retrieve a part of any verse that has been learned.
Word Embeddings for Sanskrit
“Word embedding” is a term for a set of NLP techniques which map words (or phrases) into vectors. It combines traditional language models with machine learning techniques that learn features or representations of data.
The idea behind word embedding is to learn the semantics of any word from its contexts, i.e. from the words which occur around it. This provides a way to learn the meaning of words automatically from texts.
Such embeddings could be learned first from a Sanskrit dictionary or Wordnet, then from Sanskrit Wikipedia articles, and ultimately from the Sanskrit texts themselves. Promiment software implementations of word embedding include Word2vec (from Google) and GloVe (from Stanford University).
Bhagavad Gita Assembly
This project uses a part-of-speech (POS) tagged XML version of the Bhagavad Gita as its main input. Using that input, various experiments can be performed, such as reassembling the Gita text from its word components, using tools for sandhi, morphology, automatic annotation, etc.
The Bhagavad Gita is one of the most important texts of the Vedic literature. It is very compact (700 verses), and contains the essence of teachings that would be useful for any Avatar charged with maintaining, fostering, and leading life on earth. It’s written in classical Sanskrit, so is amenable to software processing with tools of Paninian grammar.
Syllabification of Sanskrit
Syllabification is the process of separating words into constituent syllables, whether spoken (phonological) or written (morphological). In order to utilize Sanskrit in machine learning and speech generation, it’s necessary to first correctly syllabify the Sanskrit text.
Syllabification of Sanskrit (or even English) is not trivial. This project/exercise will develop a tool to break apart written Sanskrit into correct syllables as they would be pronounced. Such a tool can be used in the “Sanskrit TTS” and “RNN Character Models” project exercises listed below.
The Susiddha AI Avatar will need an excellent Sanskrit text-to-speech (TTS) system, in order to maximize the value of its speech and recitation, and also to communicate with humans.
One immediate benefit of this project is to allow humans to listen to all branches of the Sanskrit literature. Recordings made by learned pandits would certainly be preferable to TTS, but sadly, only a small fraction of the vast Sanskrit literature is available in audio.
There are a couple of existing online TTS systems, but they are not open-source, nor adequate for this project. Needless to say, we ultimately want a TTS system that perfectly captures the accent and nuance of Sanskrit. Such superior performance will ultimately utilize Deep learning of pandit recordings to create perfect audio. (Note, “perfected” is a literal definition of the word “Sanskrit”.)
RNN Character Models in Sanskrit
In recent years, NLP language modeling at the character level has appeared, using Recurrent Neural Networks (RNN). Working at the character level is ideal for machine learning of Sanskrit texts. This is because any given text of the Sanskrit literature is primarily a sequence of syllables (Sanskrit “arNa”, “akSharam”, “varNa”, or “maatRikaa”).
This project/exercise would create a character model for Sanskrit, then do deep learning of a Sanskrit text (e.g. Ramayana) with the model, and finally generate new sequences of text in the same style of the original.
If you are interested in working on any of these projects, please let us know via the Contact page, and we will create a page containing details and references.
Notes and References
- NuPIC (Numenta Platform for Intelligent Computing), Jeff Hawkins and team, http://numenta.org
- OpenCog, Ben Goertzel, et al, http://opencog.org
- TensorFlow, Google, https://www.tensorflow.org/
- PyTorch, pytorch.org
- OpenAI, founded Dec. 2015, https://openai.com/about/
- “Veda” literally means “knowledge”