Roost.ai blog on Generative AI and Large Language Models

#142 The Inclusive Lake Vectoria

Written by Rishi Yadav | January 2024

In the latest installment of the #lakevectoria series, delving into the fascinating landscape of vector and embedding data lakes, I proposed a thought-provoking trajectory. Just like the evolution of big data lakes, Lake Vectoria is likely to follow a similar trajectory. Initially, it will primarily absorb unstructured text data. But soon, it will expand its reach to include semi-structured and structured data as well. Let's explore this spectrum in today's blog.

Lake Vectoria is highly inclusive, provided the data within the lake exists in the form of embeddings. Outside the lake, data can take any shape before it enters (ingress) and after it leaves (egress) the lake.

Optimizing Unstructured Data

The majority of applications you've observed in large language models and generative AI primarily focus on unstructured data. Labeling data as 'unstructured' is somewhat of a narrow perspective, especially considering our traditional understanding of normalized forms. This classification might feel a bit old-fashioned, but for the sake of continuity and legacy, it's useful to maintain. It's true that text, images, and videos possess their own structures, but these aren't normalized structures in the traditional sense.

One of the significant challenges with these formats is that their apparent structure isn't sufficient for deriving meaning. Therefore, large language models have had to embark on a learning journey akin to that of a superhuman. This learning process is complex and multifaceted. LLMs need to understand context and subtleties, deciphering meaning from arrangements that might seem chaotic or random.

Optimizing Semi-Structured Data

Ingesting semi-structured data, such as JSON and XML, into Lake Vectoria presents a significant challenge for Large Language Models (LLMs). This data type blends structured and unstructured elements in a complex way, requiring LLMs to have a high level of adaptability and deep understanding of context. The main difficulty lies in the intricate mix of textual and structural content, which complicates parsing and interpretation during the data ingress phase. This necessitates specialized training for LLMs to effectively manage these nuances.

However, the egress process in Lake Vectoria offers a notable bright spot. While the ingress of semi-structured data is challenging, the egress has been streamlined effectively. In fact, JSON has become the preferred format for data extraction from LLMs. This standardization simplifies the handling of data after analysis, representing a significant stride in enhancing data management within the system. It shows that despite the complexities in the initial stages, the output process has been optimized for efficiency and effectiveness.

To address the ingress challenges, Lake Vectoria employs vector databases, or feeder lakes, for storing transformed semi-structured data embeddings. This approach simplifies complex data structures, easing the integration process for LLMs. Vector databases efficiently manage the high-dimensional nature of embeddings, enhancing data retrieval and processing. This strategy ensures seamless ingestion and optimizes Lake Vectoria’s capacity in managing semi-structured data.

Optimizing Structured Data

Handling structured data in Lake Vectoria is inherently more straightforward for Large Language Models (LLMs) due to its compatibility with vector representations. Entities and relationships in structured databases are effectively converted into vectors, aligning seamlessly with LLMs' processing capabilities. This technical ease highlights the suitability of LLMs for efficiently managing structured data.

However, the integration of structured data into Lake Vectoria involves not just a technical shift but also a cultural one. Traditional SQL queries undergo a transformation into embeddings, a process that requires database professionals to adapt to a new paradigm of data retrieval. This adaptation signifies a significant change from conventional database querying to a more AI-driven approach.

To enhance this process, structured data in Lake Vectoria is directed into feeder lakes or vector databases, similar to semi-structured data. This storage method ensures uniformity across data types and optimizes the efficiency of LLMs in processing high-dimensional data. Feeder lakes offer scalable storage, advanced search capabilities, and better integration with existing systems, facilitating a smooth transition for database professionals and maximizing the utility of structured data within the Lake Vectoria ecosystem.

Conclusion

As Lake Vectoria evolves, mirroring the growth trajectory of Big Data Lakes, it is poised to inclusively ingest both semi-structured and structured data. The utilization of vector embeddings significantly eases the integration of semi-structured data, while structured data becomes even simpler to handle, thanks to the natural alignment with vectors. The primary challenge in this evolution lies not in the technical realm but in the cultural shift towards prioritizing embeddings.