Modern network and telecommunication technology computer concept: server room in datacenter
Projects

Open Source is the Future of LLMs

Before Large Language Models (LLMs) came into vogue, normal-sized (<1B parameters) Language Models (LMs) were released as open-source with permissive licensing terms. BERT, and the Transformer architecture in general, ushered in a revolution in the Natural Language Processing (NLP) space leading to significant innovation and democratized the power of ML to the masses. All of this was made possible because Google had chosen to release BERT with a commercially-viable license. With LLMs demonstrating novel human-level performance on a number of challenging tasks, we’ve observed a concerning trend of large companies favoring closed-source development and providing restricted API access (as OpenAI has done with their recent models). At the same time, we’ve seen amazing progress from the OSS community and startups that are leading the way with smaller, better tuned models that are developed out the in the open.

Here at VMware, we believe strongly in the power of open source. In the R&D AI Lab, we have and will continue to help drive the development of free and open LLMs. Our goal is to openly improve existing models and provide access to fine-tuning pipelines. Recently, we released a number of improved models and the code used to fine-tune both Encoder/Decoder models and Decoder-only models.

Instruction Tuning Encoder/Decoder Models

In the quest to recreate the instruction following capabilities of ChatGPT, the open-source community primarily adopted the LLaMA model released by Meta. However, the LLaMA model’s licensing prevents its usage in commercial settings.

To bridge this gap, we turned our attention to the Flan-UL2 model, which is fully open-source and based on the encoder-decoder architecture of T5. Flan-UL2 is a powerful LLM with 20 billion parameters, licensed for commercial usage, and has already been fine-tuned on various academic NLP tasks.

To enhance its instruction-following capability, we conducted further fine-tuning on the Flan-UL2 model using the Alpaca instructions dataset. Given our initial experimentation with limited computational resources (Three Nvidia V100 32GB GPUs), we explored the application of Low-Rank Adaptation (LoRA) to achieve optimal results. Leveraging the combination of Deepspeed’s CPU offloading and LoRA, we trained the model for three epochs over a span of twelve days. As part of our commitment to open source, we have shared the code for fine-tuning an LLM with limited resources through a medium article and our GitHub repository.

In addition to the Flan-UL2 model, we also trained the T5 Large and T5-XL models on the Alpaca dataset, further expanding the range of available models. All these models, including the Flan-UL2, T5 Large, and T5-XL, have been released on the Hugging Face hub for broader access and usage. These models have been downloaded over ~2,500 times since their release, indicating that the open-source community is actively engaged with our work.

Instruction Tuning Decoder-Only Models

While Flan-UL2 addressed the gap of non-commercially available LLMs, the Alpaca dataset used in our initial experiments, was created utilizing OpenAI’s API and is subject to OpenAI’s Terms of Service. As a result, there might be limitations on these models’ usage in commercial settings. To overcome this challenge, we expanded Mosaic’s dolly-HHRLHF, a commercially accessible instruction-tuning dataset, by extracting instructions from Open Assistant’s OASST-RLHF dataset to create a comprehensive 63k Open-Instruct dataset. We published this dataset on Hugging Face to provide an even broader range of instruction data.

During our work with the Flan-UL2 model, we observed certain limitations in its ability to handle code and mathematical expressions due to its tokenizer and pretraining data. Fortunately, the Open-LLaMA project has made remarkable progress in creating high-quality, commercially viable base LLMs with 7 billion and 13 billion parameters. These models can match the performance of state-of-the-art models of comparable size on academic benchmarks, including the original LLaMA-7B and LLaMA-13B models from Meta. To make these models more suitable for the instruction following task, we performed instruction-tuning on the Open-LLaMA models using our Open-Instruct dataset. Consequently, we have developed fully open-source instruction-following models named Open_LLaMA_7B_Open_Instruct and Open_LLaMA_13B_Open_Instruct which exhibit similar performance to their non-commercial counterparts such as Vicuna.

Looking Forward

As shown in the previous sections, a lot of recent research and dataset creation has been done with data generated by OpenAI models. That’s not bad for research, but their Terms of Service forbid the usage of any OpenAI-generated content from being used to train a competing model:

2. Usage Requirements. (c) Restrictions. You may not […] (iii) use output from the Services to develop models that compete with OpenAI.

Therefore, none of those datasets can be used to train a model that is used commercially in place of an OpenAI model. This is a detail that can be easily missed when looking at the model license on Hugging Face. For example, when looking at the instruction-tuned variants of Falcon (either 7B or 40B), they are listed as Apache 2.0. It’s an argument for the lawyers as to whether that license is valid, because, if when digging into the datasets that were used to train those models, they were generated by OpenAI models. So, the model may very well be Apache 2.0, but it can’t be used commercially.

The solution to this is straightforward but not easy. We, the greater NLP community, will have to reproduce those datasets with commercially viable models. What makes ChatGPT and GPT-4 special isn’t the language model itself. There’s no architectural magic there that hasn’t already been reproduced (or exceeded) by the open-source community. It’s the training data. That’s why models trained on the output from ChatGPT/GPT-4 can mimic the expressive power of those models with an order of magnitude (or two) fewer parameters. If we can come together as a community and create commercially viable training data, then we can train dramatically smaller models with the same reasoning and text-generating capabilities as the larger models from OpenAI!

Disclaimer: We have spoken to lawyers about dataset licensing issues, but we are not lawyers, and this should not be taken as legal advice.

Stay tuned to the Open Source Blog and follow us on Twitter for more deep dives into the world of open source contributing.