Search: Should You Build, Buy, Or Borrow?
A challenge that many companies face is deciding whether to build, buy, or borrow their core search technology. As with most technology decisions, there’s no single right answer. But it’s important to understand the space of options, and their associated trade-offs.
Should you build your own search engine from scratch?
For most companies, the answer is absolutely not. As far as I can tell, only three kinds of companies should build core search technology from scratch:
- Huge technology companies for whom search represents a critical core competency. The obvious examples are Google and Amazon, and there are a handful of other technology giants in this category. Even for these companies, building in-house technology may come at the cost of not being able to leverage the best technology built elsewhere.
- Vendors whose main business is developing core search technology. The biggest examples here are Lucidworks and Elastic, the main commercial entities developing search technology on top of the Apache Lucene open-source platform. This category also includes smaller vendors that provide closed-source and managed search solutions.
- Companies taking a non-traditional approach to search, such as visual search. Outside of text search, there aren’t any mature or dominant platforms. Companies using non-traditional search either have to rely on bleeding-edge startups or have to build the technology themselves. If this technology is key to their success, it should probably be built in-house.
If your company is not in one of these categories, you probably should not be building your own core search technology.
Should you buy rather than build?
In 2018, “buying” a search solution generally means using a fully managed search service like Amazon Cloudsearch or Algolia. There are still vendors that provide on-premise search solutions, but mostly they serve companies that haven’t gotten around to migrating off of legacy search technology.
The main advantage of using a fully managed search service is that you can have a smaller in-house team. Talented search engineers and product managers are scarce and expensive, as are engineers who can manage operations. For many companies, the cost of using a fully managed search service is significantly lower than that of hiring a full-scale in-house team.
But if you go this route, make sure you that understand the downsides:
- Loss of flexibility. Search may seem like a simple search box and a ranked list of results, but it offers a rich and complex space of possibilities, not all of which are obvious from the surface user experience. A fully managed service will constrain you, including in ways that you might not anticipate.
- Lock-in. If you work with a search vendor or consultancy, be prepared to stay with that solution for a long time. Migration projects tend to be long and expensive, especially if you don’t know what’s going on inside the system. You can’t have your cake and eat it: minimizing your investment in in-house search talent makes you highly dependent on your vendor.
A fully managed search service works best for companies that have very basic search needs, minimal resources to invest into building and maintaining search, and a willingness to accept a good-enough out-of-the box solution.
Should you borrow technology, i.e., use an open-source search platform?
Open-source search has come a very long way. Lucene has been around since 1999, and it has become ubiquitous in the software industry, from small businesses to large technology companies like Twitter and Salesforce. Most companies that manage their own search engines — and certainly most companies that aren’t weighed down by a legacy technology stack — are doing so on top of Lucene, usually by way of Solr or Elastic.
If you think your search needs go beyond the capabilities of a fully managed service, or you want to preserve some degree of flexibility, then you should probably build on top of an open-source search platform. Specifically, you should consider using either Solr or Elastic, both of which offer stability and strong community support. You’ll get decent out-of-the-box functionality, as well as the ability to tinker almost anywhere on the processing stack.
What about document understanding and query understanding?
Some of the toughest search challenges involve extracting structure or meaning from documents and search queries. These are hard problems, but solving them well can dramatically improve search quality. Not surprisingly, there are companies that specialize in this area.
If you are using an open-source search platform, it should be straightforward to integrate document understanding products into your indexing pipeline and a query understanding products into your run-time query processing. You may even be able to do so with a closed-source platform, if it provides the right entry points.
The catch is that you need find products that understand your particular documents and queries. No software can provide universal undersanding.
You have two options.
The first option is to find a product specifically built for your domain. If you can find a suitable product for your domain, then you don’t need to reinvent the wheel. Indeed, the cost of licensing such a product is likely to be far lower than the cost of building one.
The second option is to use a generic machine learning platform and train it using your own labeled examples. In theory, you can address any possible domain using this approach. In practice, it’s a lot of work, and you’ll need to have talented data scientists and machine learning engineers on your team to do it well. Still, if it’s your only option, it’s any area where you should consider investing. Ultimately, your search can only be as good as your document and query understanding.
Summary
To wrap it up: don’t build core text search technology unless you’re a huge technology or a search vendor. Consider a fully managed search solution if your budget is tight, but make sure you’re willing to live with the loss of flexibility. If you are implementing your own search engine, then leverage open-source technology — specifically Lucene. Finally, don’t forget document and query understanding, which can dramatically affect search quality. Buy or borrow what you can, but build if you have to.