Since the release of OpenAI’s ChatGPT in November 2022, large language models (LLMs) have gone mainstream and are slowly making their way into open source.
Unlike similar chatbots that came before it, ChatGPT quickly spread widely, gaining more than 100 million monthly active users in just two months. With a simple user interface powered by LLM GPT-3.5 (free version) and GPT-4 (paid version) underneath to understand and generate human-like responses for a broad range of topics, LLM has now become accessible to the general public.
LLMs like GPT, LaMDA, and BLOOM, are deep learning models trained with a massive amount of data from various sources, including books, Wikipedia, academic papers, and social media, to learn the patterns and structures between words and phrases. As a result of training, this allows LLM to generalize to new, untrained text and perform natural language processing tasks that mimic human speech and typing. Compared to regular language models, the “large” in the name comes from the number of parameters the model utilizes that allows capturing more complex relationships between words for the machine to statistically model typical speech in narrow domains and generate language in the most human-natural way possible. In fact, comparing the previous version of GPT to GPT-3, the most significant difference is that GPT-3 has 175 billion parameters and is trained with more diverse datasets, compared to GPT-2, which has 1.5 billion parameters and is trained on 40GB of text data.
In simpler terms, LLMs predict the most plausible words to form a text, and they’re really good at it.
With GPT-4.5 rumored to be on the way and other competitors joining the artificial intelligence (AI) race, LLM capabilities will continue to grow and tremendously impact various industries. Open source is no exception.
Many AI-powered tools have already been released for software development teams. For example, some tools promise to increase the productivity of software engineers by auto-generating code, assist technical writers and engineers in writing better documentation for the software, and improve communication among the stakeholders with tools that transcribe meeting notes and paraphrase notes with action items. A few have already made their way into some open source communities, and more are expected to be adopted as demands grow.
The recent release of GitHub’s Copilot X is a prime example of how AI-powered tools will be used in open source. Their new copilot for pull requests aims to help review pull requests faster by writing better descriptions, generating missing tests for the code changes detected, and enabling ghost text for better communication. They are also experimenting with resolving open issues faster with an LLM bot to explain, generate, suggest, and create a pull request to address open issues. With tools like this, the future where open source maintainers and contributors use AI-powered tools to automate and streamline their work doesn’t seem so far away. Or is it?
Exploring LLM capabilities in open source is exciting but not without concerns. Several major concerns have been raised with AI-powered tools, and many are complicated. However, these questions need to be addressed to ensure AI tools are available fairly and safely to all open source communities.
Let’s take a look at a few that are on my mind.
Open Source Data and Generated Ownership
Many developer-friendly AI-powered tools are advertised as pair programming tools. It helps auto-complete your code, refactor, or just plainly, write the code for you based on a description. This has raised red flags among open source communities asking questions about the data used to train the LLMs. Specifically, could using or distributing AI-generated code violate the open source licenses of the source code used in training?
Copyright in layman’s terms as stated in the open source guide means “someone cannot legally use any part of your GitHub project in their code, even if it’s public, unless you explicitly give them the right to do so.” Open source licenses are the way we explicitly give such rights, with specific terms and conditions. How does this apply to AI systems trained on open source data that output generated content based on the training data?
Questions about ownership and intellectual property in AI-generated content have been a huge issue since the release of AI-generated images, and there are several lawsuits in progress that might provide the answer to us all
- Lawsuit against GitHub, Microsoft, and OpenAI
- Lawsuit against Stability AI, Midjourney, DeviantArt
- Lawsuit against Stability AI by Getty Images
- Lawsuit against Prisma Labs
The lawsuit against GitHub Copilot states that AI systems were trained on public repositories hosted on GitHub and violates the rights of many engineers who contribute their code under different open source licenses. GitHub and Microsoft in response state, “… there is no basis for any of the underlying claims, dismissal is appropriate.”
This lawsuit may be just the start of many more to come. Still, one thing for sure is that the outcome of any of the cases against the producers and users of generative AI systems will impact both machine learning and open source communities and may cause us to rethink how we ingest data and make data publicly available.
Authority Bias and Misinformation
LLM-powered tools generate outputs that always sound plausible. But just because it sounds plausible doesn’t mean it is correct.
Just like our bias towards search engines as the source of truth, AI-powered tools are also treated as an intellectual authority. However, all machine learning models are only as good as the data they are trained on. The model takes the original dataset as the representation of the world. This means, if you want to use an LLM-powered tool to help use a new software tool that just came out a few weeks ago or a tool that is not part of the LLM original dataset, it will have no idea what that tool is, or generate the statistically best output it can which will likely be a plausible-sounding but incorrect output. Adding new information to an LLM requires fine-tuning or retraining with the new dataset, which can take several days to months or even years, depending on resource availability, ML operations setup, and other factors.
Another way to think about it is that it will never have all the information in the world because that dataset does not exist. This means that all LLMs are limited. There are many important artifacts and information that are just not available in digital format that can be easily accessed to train these models. Without it, models can only depend on the trained data. Much of that training data comes unvetted from the internet and often with implicit or explicit biases, leading to a risk of AI-generated content amplifying misinformation. Furthermore, since the specific LLM train data information is typically not published by the model author, you never really know what the model was trained on and how much trust you can place on the model. You only know that it was abstract types of content sourced largely from the internet.
Due to the same concerns about misinformation and the trustability of AI-powered tools, Stack Overflow was one of the first to block generated text from LLMs, stating “because the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site and to users who are asking and looking for correct answers.”
So, before you trust AI-powered tools’ generated output, it’s always important to critically evaluate the output and verify the information before deciding how to use it. While it excels at many tasks and is a viable assistant in helping you code, it’s just a computer system after all. Today this act of verifying AI-powered tool outputs can be as costly as creating those outputs from scratch with high quality using traditional means.
Data Privacy and AI Regulations
Various security concerns exist with all machine learning systems, including adversarial attacks, model stealing, data privacy, and more. However, in recent days, one that’s making the biggest headline seems to be related to data and the lack of risk management.
The biggest news is Italy’s decision to block ChatGPT due to a breach of the European Union’s General Data Protection Regulation (GDPR) concerning unlawfully processing EU users’ personal data to train the algorithm. This has highlighted the AI Act, the first legislation on artificial intelligence in progress by European institutions, and fired the debate on the need to regulate AI research and development. This comes a few days after the Pause Giant AI Experiments: an Open Letter, proposing all AI labs to pause for at least 6 months on training systems more powerful than GPT-4 to “develop and implement a set of shared safety protocols for advanced AI design and development that are rigorously audited and overseen by independent outside experts.”
It’s no secret that laws and policies have been falling behind technological advancements. AI systems are in the same situation, and it’s apparent that many ML field experts fear unknown consequences. But, regulation of AI poses a risk that could limit the vast capabilities that machine learning communities have unlocked.
Shaping the Future of AI in Open Source Communities
Overall, the current state of AI is constantly evolving, and there is much room for improvement on many unanswered questions. It’s no doubt that open source communities will need to adapt to changes to come, and it will be crucial for the community to take collective responsibility in ensuring that AI is applied fairly and safely to provide inclusive environments for all.
Whether you’re ready or not, AI is coming. Let’s not forget about current concerns and shape AI’s impact and responsible usage in open source together as a community.
Stay tuned to the Open Source Blog and follow us on Twitter for more deep dives into the world of open source contributing.