Where do dad's keep all of their jokes? In their dad-a-base!
I am excited and a bit nervous to share a project with you that in some ways has been years in the making. Since I learned about the power of embeddings and their potential for understanding data I've been trying to refine a combination of ML and data visualization techniques for exploring large unstructured datasets. I feel I finally built the tool I was missing in my explorations, a workflow that supports experimentation and paves a path through the dizzying array of choices in this fast moving environment.
I give you Latent Scope!
https://github.com/enjalot/latent-scope
I want to start by sharing what it does, because there are two important sides to it: the python-based data process and the web-based interactive visualization.
Exploring
The goal of the process is to get your data into the interface for exploring your dataset. The main idea is to be able to zoom in on interesting subsets of your data while staying in the context of all of your data. Interesting subsets can be similarity search results, the "retrieval" part of the popular RAG technique. They could also be the clusters determined by carving up the latent space, giving you a sense of how well represented various concepts are. Even just zooming in, clicking around and selecting data points can be an enlightening exercise because you stay in context as opposed to just pulling random values.
You can try exploring some demo datasets, and then read on to see how to run your own data through the process to explore it.
Trust the process
The data process captured in the tool is a series of scripts that help you ingest a tabular dataset (CSV, parquet or pandas dataframe), calculate embeddings for a text field from each row, run UMAP on the embeddings, cluster the UMAP points and optionally label the clusters with an LLM. The scripts can be run from the web UI or individually via the command line.
While the steps are linear, at each stage you may find yourself wanting to do it again with a different choice. You may want to embed your dataset with multiple models to see if they capture more or less meaning. Experimentation is key, you may want to play with the UMAP or HDBSCAN parameters, as there is no predetermined best choice. When you experiment you want to keep track of what you've tried, you want to be able to back track and go down a different path.
Latent Scope was designed to facilitate this, each run of each step has both its output and the metadata capturing the choices made written to flat files. It should be easy to see what you've done and it should be easy to pick up the process from any point.
Another important aspect of the process is that all of the steps are using indices into the original input dataset. This means we don't modify the input dataset and there is no possibility of copy errors. It also has some handy UX implications I'll get into in a future update!
Getting Started
The README has detailed instructions, but it really is as simple to get started as:
pip install latentscope
ls-serve ~/latent-scope-data
Try it out locally with a bunch of Transformers models like Jina v2, Nomic and e5 or setup API keys for popular services like OpenAI, Mistral, Cohere, Voyage and Together.
Just the beginning
I wanted to build this as an open source python module because I'm hoping that we can have a solid foundation for processing messy unstructured data into structures that facilitate exploration and understanding. I've tried to make it as easy as possible to install and use, and have the output be standard and portable. I have a whole bunch of things I still want to build, so I've started listing them as GitHub issues, and a couple would be great starting points for anyone who wants to get involved.
If you want to chat about using it, or need some support getting started you can now join me on the Latent Interfaces discord!
I’d love to hear from you!
P.S. Some Thanks & Acknowledgements:
This isn’t exactly a scientific publication but I have received a lot of support that enabled me to put this project together.
I need to thank Erik Hazzard for tons of feedback and encouragement. My wife Agnes for endless encouragement and patience. EJ Fox for early prototyping sessions and especially the idea to summarize clusters before I saw anyone else doing it. David Ha, Shan Carter and Chris Olah for patiently teaching me the foundation for everything I know about ML when we worked together years ago.
I had been working for 24 hours straight... so I decided to call it a day.
find more in the dadabase
This was super cool! I submitted to Hacker News hope that’s okay :)