April Castillo
- Dec 18, 2023
- 3 min read

Unlocking Data Insights: LakeHouse and Gen AI with Databricks

Updated: Mar 4

A prospective client is looking to build a LLM powered solution for their document summarization chatbot. Triseed was tagged to help build the solution with multiple teams of consultants to determine the overall architecture and design including potential implementation of the solution.

Context:

A potential client is looking for a solution to improve customer service efficiency as well as help their technicians, technology and IT staff to support, maintain and build network systems connected to the internet, including connecting their offices, buildings to the network for efficient communication. The design primarily is to help the client have consistent and strong network connectivity minimizing fault as well as downtime and able to handle ever increasing user base with connection predictability.

The flow of their customer communication is like this:

Customer —> Customer Care 1 —> Customer Care 2 —> Technical Support —> Researching all documentations and what not

Technicians on the Field —> Internal communications Care 1 —> Technical Support —> Reearch for Answers

Technicians will either change the equipment all-together, or return for a 2nd visit when information is much readily available. With the cost of inefficiency, client estimates, customer centric chatbot and technicians internal chatbot tool will save the company millions and time + resources improving overall customer experience. The additional revenue of retained customers + upsell opportunities and resources spent outweigh the cost of building a new model including the cost of not doing anything.

Problem:

The client has multiple documents that are specifications of the products they sell, the products consist of electronics, gadgets, server peripherals, mobile phones, modems, routers, switches and all things with regards to networking and what not. The combination of these documents with different models and versions makes support very complicated, thus, improving customer service to solve customer needs including technicians installing the apparatus.

Another issue is that data sources are everywhere and need to be consolidated, they have data in different formats including from different sources, that needs to be handled differently, frequently and using reproducible pipelines to keep delta tables fresh. Incremental updates are necessary to alleviate loads to the source database, including frequency or data updates for real time data updates and what not.

Solution:

For the purposes of this approach we separated Data Engineering work, and Data Science + GenAI work, Using Databricks.

A- Data Lakehouse Solution - Databricks, design was to manage data from multiple sources and store them to S3 like data-blob storage for long-term storage. Databricks will manage Delta Tables for all transforms, and cleaning, gold tables will also be hosted on databricks. Delta Tables will be the strategy for Datamarts, features stores, including feature engineering

Source: https://docs.databricks.com/en/_images/lakehouse-diagram.png

B - Data Science Feature Engineering + Unity Catalog <insert image>

Source: https://cms.databricks.com/sites/default/files/inline-images/operating-feature-store.png

For feature building, we will use data from our gold table, or destination, pristine tables prime for further transformation and joins, before running a machine learning training and model pipeline.

Databricks provides a training pipeline and data lineage, helps put DAG capabilities to Databricks, efficiently explains the way how data will be transformed and how the model will be trained.

Source: https://docs.databricks.com/en/_images/uc-expanded-lineage-graph.png

After the model is generated, we will version the model for use using databricks, repo for ml flow, this will track the model, and ensure model performance as well as features are deployed correctly because models are generated properly.

With Databricks comprehensive AI deployment process, managing LLM model functionality of summarizing documents, will be much simpler than typical model-feature-engineering and deployment, fine-tuning the model in-order to build the chatbot with search+recommendations capabilities will be streamlined by Databricks.

Source: https://docs.databricks.com/en/_images/ml-diagram-model-development-deployment.png

More technical discussions will be done, for actually building/fine-tuning the LLama2 for Document Summary, that will be done soon.

Summary:

As you can see, Databricks approach in End-to-end solution for Data Engineering + ML Feature engineering is superb. With the introduction of Unity Catalog, this changes on how we store and develop for handling delta tables, as we grasp the ever changing business climate, delta tables will make companies agile and able to adapt to change, and apply those changes quickly adjusting the model as needed. Companies shift and adapt to AI, and will accelerate. As a precursor, we will discuss other areas of actual implementations, like the type of tables we have that are candidates for feature engineering, as well as DataMarts, Deployment and monitoring of models, some techniques we are looking at to develop and fine tune the model.

Unlocking Data Insights: LakeHouse and Gen AI with Databricks

Context:

Problem:

Solution:

Summary:

Recent Posts