Teaching AI from Search
Then:
Enterprises spend a lot of time to make their content well-curated so that it can be exposed in a search engine
Now:
Enterprises spend a lot of time to make their content well-curated so that it can be exposed to use in machine learning to train models and power AI.
Search Architecture is AI Architecture.
I feel these two statements speak for themselves, in that there is a huge overlap in both use cases. It’s no secret that Enterprises are focusing a lot of development attention towards projects that include machine learning and artificial intelligence. Some projects are more successful than others. I believe that enterprises that have a firm grasp on their enterprise search infrastructure are positioned to be exceedingly successful in their AI and ML development efforts.
As it turns out one of the biggest challenges for creating artificial intelligence is access to data. Data scientists need to be empowered to assemble the training data set as they see fit for the task. Unfortunately, on many projects this data access has been a blocker that prevents teams from reaching development milestones.
It’s often the case that a data scientist might be handed one static set of training data from which to train and build a machine learned model or classifier. Data scientists often have little option other than working with data provided.
When on a project, data scientists may often hear that some data set exists within the enterprise. But sometimes the exact location of that data is unknown, or access to that data is prohibitively difficult. The models produced are only as good as the data provided. Additionally, data in the real world changes over time. The feedback loop to the data scientist is often left unimplemented.
Businesses that have taken control of their enterprise search solution have a clear advantage over other businesses. Those enterprises long ago discovered that being able to connect to all of the different data repositories and pulling them into a centralized search is the only way to provide uniform and consistent access across all the digital assets of the firm.
The reality is having a robust enterprise connector framework that is agile that is easy to on board new data sources is exactly the same framework that the data scientists need in order to properly generate their training data sets. Having a framework that allows data scientists to map disparate data from different data sources to get it into a consistent format that is suitable for use with machine learning is often the most difficult task a data scientist faces in their day to day work.
This begs the question: Why would you want your data scientist worrying about the details of which systems the data they need is stored in? The most value from a data scientist will come when the data scientist has easy access to all of the data through a consistent API. The more time the data scientist spends debugging database connectivity issues, the less time the data scientist can be adding value with their specific skill set.
Perhaps at this point some light bulbs are going off. Why not use the search engine to create the training sets? Well, I believe this use case is addressed by search platforms that support “deep pagination”. Deep pagination is the act of running a search against a search engine and then proceeding to iterate through a very large number of results, sometimes even all the way to the end of the result set.
Historically, search engines have always excelled in extremely high performance retrieval of data. Modern search engines have added specialized APIs to facilitate the export of this data. Solr introduced the export handler, Elasticsearch has the Scroll API. Both of which have the ability to export at an extremely fast rate all of the documents that matched a query. These result sets also can include very important metadata about the data set being exported such as key statistical items like min, max and standard deviation of the data being exported. Many machine learning algorithms need this sort of information so it can properly normalize it for training.
Are you using search to power your AI and ML projects? Do the data scientists complain about access to data or having to normalize data? The solution to those problems might be easier to find than you think.