What is Open-Source AI?
The debate over what constitutes open-source AI has long been contentious. However, a breakthrough may have been achieved with a new definition provided by the Open Source Initiative (OSI). This definition aims to clarify what qualifies as open-source AI and guide lawmakers in creating regulations to safeguard consumers from potential AI risks.
The OSI, known for setting standards for open-source technology across various fields, has now turned its attention to artificial intelligence. This is the organization’s first effort to define what open-source means specifically for AI models. To develop this definition, OSI convened a diverse group of 70 experts, including researchers, legal experts, policymakers, and representatives from major tech companies like Meta, Google, and Amazon.
According to the new definition, an open-source AI system must meet several criteria:
- Freedom of Use: The system can be utilized for any purpose without needing prior permission.
- Transparency: Researchers should be able to inspect the system’s components and understand how it operates.
- Modification and Sharing: The system should be adaptable and shareable with others, whether the modifications are made or not.
- Data and Code Disclosure: The model should provide clear information about its training data, source code, and weights.
Before this definition, the concept of open-source AI lacked clarity. While it's evident that OpenAI and Anthropic’s models are closed-source due to their secrecy, there has been debate about whether Meta and Google’s models, which are accessible but come with restrictive licenses and undisclosed training data, truly fit the open-source label.
Avijit Ghosh, an applied policy researcher at Hugging Face, notes that companies have often misused the term "open source" to enhance their models' perceived trustworthiness, even when independent verification of their openness isn’t possible.
Ayah Bdeir, a senior advisor to Mozilla who participated in the OSI’s definition process, highlights that some aspects of the definition, such as the need to disclose model weights, were relatively straightforward to agree upon. However, defining how much information about public training data should be disclosed was more complex.
The opacity around training data has led to numerous lawsuits against AI companies. For instance, companies like OpenAI and Suno, which generate music, have been criticized for not fully disclosing their training datasets, often only stating that they use "publicly accessible information." Advocates argue that full disclosure of training data is necessary, but Bdeir points out that enforcing such a standard is challenging due to issues related to copyright and data ownership.
The new definition requires that open-source models disclose enough information about their training data so that a knowledgeable individual could recreate a similar system using the same or similar data. This compromise aims to balance transparency with practical enforceability.
The OSI plans to introduce an enforcement mechanism to flag models that claim to be open source but do not adhere to the new definition. Additionally, OSI will publish a list of AI models that meet these criteria. While specifics are yet to be confirmed, expected models include Pythia by Eleuther, OLMo by Ai2, and models from the open-source collective LLM360.
In summary, the OSI’s new definition of open-source AI represents a significant step towards clarity and accountability in AI, aiming to foster innovation while addressing concerns about transparency and consumer protection.