In a groundbreaking move to bridge the chasm of communication across diverse languages, Cohere has unveiled a duo of innovative open-weight models under its Aya project. The newly minted Aya Expanse, with its 8 billion and 35 billion parameters, has made its debut on Hugging Face, promising to transcend linguistic barriers and enhance AI’s multilingual capabilities across an impressive 23 languages. As articulated in a recent blog post, the 8B parameter model is heralded as a gateway, making significant breakthroughs accessible to researchers globally, while the 35B variant pushes the frontier of state-of-the-art language processing.
Launched initially last year, the Aya initiative from Cohere for AI aims to democratize access to foundational language models, extending beyond the realm of English. Following the February release of the Aya 101, a formidable 13-billion-parameter model supporting 101 languages, the journey has now scaled greater heights with the introduction of these Expanse models. Notably, the Aya dataset was also developed to facilitate training efforts for underrepresented languages, amplifying the project’s inclusive ethos.
The architecture of Aya Expanse borrows heavily from the principles that shaped Aya 101, yet it embodies a significant evolution in AI language processing. With a relentless pursuit of excellence, Cohere has emphasized an innovative research agenda dedicated to narrowing the language divide. By disrupting traditional machine learning methodologies, they have pioneered enhancements in data arbitrage, safety training, and model merging, thus forging a more robust foundation for their AI models.
Aya’s Stellar Performance
Surpassing expectations, the new Aya Expanse models have consistently outperformed counterparts from tech giants like Google, Mistral, and Meta. In rigorous multilingual benchmark tests, the 32B model outclassed competitors such as Gemma 2’s 27B version and the larger 70B Llama 3.1. Equally remarkable, the 8B model has shown superior performance when stacked against its 9B and 8B competitors from the same league.
To achieve these commendable results, Cohere has employed an advanced data sampling technique known as data arbitrage, strategically avoiding the pitfalls of nonsensical outputs often precipitated by models reliant on synthetic data. The common practice of training with data generated from a “teacher” model can falter, particularly when applied to low-resource languages, where suitable teacher models are hard to come by.
Moreover, Cohere has prioritized cultivating a sense of “global preferences,” intricately accounting for the cultural and linguistic nuances that vary from region to region. In their own words, they have considered this aspect as the “final sparkle” necessary to refine the training of AI models. Nevertheless, existing safety protocols, often tailored to Western-centric datasets, pose challenges in multilingual applications. Cohere’s efforts stand out as one of the pioneering attempts to extend preference training within a massively multilingual context.
Diversification in Language Models
At the core of the Aya initiative lies a compelling commitment to ensuring linguistic accessibility in large language models (LLMs) beyond the dominant English language sphere. Although many models trickle into diverse languages, the sheer volume of data predominantly favors English—an advantage stemming from its status as a mainstay in global governance, finance, and digital dialogue. Consequently, it poses a formidable challenge for model developers to curate comparable datasets for less widely spoken languages.
Additionally, the nuances of language translations can complicate performance evaluation, often leading to skewed benchmarks. In response to these challenges, several developers, including OpenAI, have spearheaded the creation of multilingual datasets, such as their recently launched Multilingual Massive Multitask Language Understanding Dataset, designed to scrutinize model performance across a multitude of languages.
In recent weeks, Cohere has demonstrated its nimbleness in the tech landscape, augmenting its enterprise embedding product Embed 3 with enhanced image search capabilities. Additionally, fine-tuning advancements have been rolled out for its Command R 08-2024 model, reflecting a dynamic commitment to innovation and excellence.
With these strides, Cohere is not merely sculpting the future of AI language processing; it is igniting a cultural and linguistic renaissance, one model at a time.