Guest post from Andrew Dang, ASU Law Student and LLM Developer.
This week, OpenAI announced new features to their platform at their first key-note event, including a new GPT-4 Turbo with 128K context, GPT-4 Turbo with Vision, DALL·E 3 API, and more. Furthermore, announced their agent Assistants API, including their own retrieval augmentation pipeline. (RAG) Today, we will focus on OpenAI’s entry into the RAG market.
At the surface level, RAG boils down to text generation models like Chat-GPT, retrieving data such as documents to assist users with questions and answering, summarization, and so on. Behind the scenes, however, other factors are at play such as vector databases, document chunking, and embedding models. Most RAG pipelines rely on an external vector database and require compute to create the embeddings. However, what OpenAI’s retrieval tool brings to the table is an all-encompassing RAG system. The system eliminates the need for external databases, and compute required to create and store the embeddings. Whether OpenAI’s retrieval system is optimal is a story for another day. Today we are focusing on the data implications.
Data is the new currency fueling the new economy. Big Tech aims to take control of the economy by ingesting organizations’ private data including IP, leading to a “monolithic system” that completely controls users’ data. Google, Microsoft Adobe, and OpenAI are now offering indemnification to their users against potential copyright infringement lawsuits related to Generative AI, aiming to protect their business model by ensuring more favorable legal precedents. This strategy is underscored by the argument that both the input (ideas, which are uncopyrightable) and the output (machine-generated expressions, deemed uncopyrightable by the US Copyright Office) of Generative AI processes do not constitute copyright infringement. The consequences of Big Tech having their way could be dire, leading us to a cyberpunk dystopia that none of us want to live in. Technology and its algorithms would be in charge, and our personal data could be used to manipulate us. Our data reveals our interests, private health information, location status, etc. When algorithms feed us only limited, targeted information based on our existing interests and views, it restricts outside influence and diversity of opinion that is crucial to freedom of thought. Organizations must not contribute to this cyberpunk dystopia where Big Tech becomes Big Brother. Furthermore, companies are putting their employees, clients, and stakeholders at risk when handing data to Big Tech. Big Tech favors the role of tort feasor, rather than the role of the good Samaritan, and complies with consumer privacy laws.
To prevent Big Brother, organizations should implement their own RAG pipeline. Open-source frameworks such as Llama-index, Qdrant, and Langchain can be used to create powerful RAG pipelines with your privacy and interests protected. LLMWaare also released an open-source RAG pipeline and domain-specific embedding models. Generative AI is a powerful tool and can enhance our lives, but at the same time in the wrong hands, the cyberpunk nightmare can become a reality. The ease of using prebuilt, turn-key systems, such as those offered by OpenAI, is appealing. However, the long-term risks associated with entrusting our valuable data to corporations, without a regulatory framework or protections, raise concerns about a potentially perilous direction.