[{"data":1,"prerenderedAt":188},["ShallowReactive",2],{"blog-\u002Fblog\u002Fchoosing-a-query-engine":3,"blog-related-\u002Fblog\u002Fchoosing-a-query-engine":187},{"id":4,"title":5,"author":6,"body":7,"category":170,"date":171,"description":172,"extension":173,"featured":174,"meta":175,"navigation":174,"path":176,"readingTime":177,"seo":178,"stem":179,"tags":180,"__hash__":186},"blog\u002Fblog\u002Fchoosing-a-query-engine.md","Choosing a query engine you can defend","Lucas Jahier",{"type":8,"value":9,"toc":156},"minimark",[10,14,17,23,28,31,66,69,73,76,81,84,88,91,95,98,102,105,119,123,126,146,150,153],[11,12,13],"p",{},"\"Which database should we use?\" is one of the most consequential questions a data team asks, and one of the most often answered by reflex. Someone used ClickHouse at their last job; someone read that DuckDB is fast; someone has a Postgres instance already running. None of these are reasons. They are starting points dressed up as conclusions.",[11,15,16],{},"The honest answer is that there is no fastest engine. There is only the engine that fits a given workload, team, and set of constraints. The job is not to find the best tool in the abstract. It is to make a defensible choice you can explain a year later when the workload has changed and someone asks why.",[18,19,20],"blockquote",{},[11,21,22],{},"There is no fastest engine. There is only the right fit for a workload.",[24,25,27],"h2",{"id":26},"start-with-the-workload-not-the-tool","Start with the workload, not the tool",[11,29,30],{},"Before naming a single technology, we describe the workload in plain terms. A few questions do most of the work:",[32,33,34,42,48,54,60],"ul",{},[35,36,37,41],"li",{},[38,39,40],"strong",{},"Read shape."," Are queries point lookups (give me this one row) or wide scans and aggregations (summarize a billion rows)? These pull toward completely different engine families.",[35,43,44,47],{},[38,45,46],{},"Write pattern."," High frequency single inserts, or large batch loads? Append only, or updates and deletes? Mutability is where many engines quietly fall apart.",[35,49,50,53],{},[38,51,52],{},"Latency budget."," Is \"fast\" 10 milliseconds or 10 seconds? Interactive dashboards and overnight reports are not the same problem.",[35,55,56,59],{},[38,57,58],{},"Concurrency."," Ten analysts, or ten thousand application requests per second? Engines that excel at one usually compromise on the other.",[35,61,62,65],{},[38,63,64],{},"Volume and growth."," Today's size matters less than the slope. An engine that's comfortable at a terabyte may be painful at a hundred.",[11,67,68],{},"Only once these are written down does the field of candidates narrow honestly. It narrows fast.",[24,70,72],{"id":71},"the-families-and-what-they-are-good-at","The families and what they are good at",[11,74,75],{},"Most engines fall into a handful of families, each with a center of gravity:",[77,78,80],"h3",{"id":79},"columnar-olap-stores","Columnar OLAP stores",[11,82,83],{},"Engines like ClickHouse store data by column, which makes wide scans and aggregations dramatically cheaper because you only read the columns you ask for. They reward analytical workloads and punish high frequency single row updates. When the question is \"summarize, group, and aggregate a lot of rows quickly,\" this is usually home.",[77,85,87],{"id":86},"embedded-and-in-process-engines","Embedded and in process engines",[11,89,90],{},"DataFusion and similar engines run inside your application rather than as a separate server. They shine when you want analytical query power without operating a cluster, or when the engine is a component inside a larger Rust service. Less operational surface, fewer network hops, more control.",[77,92,94],{"id":93},"search-and-key-value-stores","Search and key value stores",[11,96,97],{},"When the workload is point lookups at high request rates, or full text and relevance ranking, a columnar OLAP store is the wrong shape entirely. The right answer here looks nothing like an analytics database.",[24,99,101],{"id":100},"benchmark-on-your-data-before-anything-else","Benchmark on your data before anything else",[11,103,104],{},"Published benchmarks are marketing artifacts. They run on someone else's data, someone else's queries, and hardware tuned to flatter a result. They tell you almost nothing about your workload.",[11,106,107,108,112,113,115,116,118],{},"A useful benchmark is unglamorous. Take a representative slice of ",[109,110,111],"em",{},"your"," data, replay ",[109,114,111],{}," real queries at ",[109,117,111],{}," expected concurrency, and measure the metrics that actually constrain you, including p95 latency, cost per query, ingestion throughput, and what happens at the tail under load. Run it on infrastructure you would actually pay for. The result is rarely surprising, but it is defensible, and that is the point.",[24,120,122],{"id":121},"count-the-cost-you-cannot-see","Count the cost you cannot see",[11,124,125],{},"Raw query speed is the easiest number to compare and the least likely to decide the outcome. The costs that matter most are the ones that do not show up in a benchmark:",[32,127,128,134,140],{},[35,129,130,133],{},[38,131,132],{},"Operational burden."," Who runs this at 3am? An engine that is 20% faster but needs a dedicated team is often the slower choice in practice.",[35,135,136,139],{},[38,137,138],{},"Ecosystem and hiring."," Drivers, client libraries, observability, and people who already know it. Boring and well supported beats clever and obscure more often than not.",[35,141,142,145],{},[38,143,144],{},"Exit cost."," How hard is it to leave when the workload outgrows the choice? Open formats like Parquet and Arrow keep that door open.",[24,147,149],{"id":148},"write-the-decision-down","Write the decision down",[11,151,152],{},"The deliverable of this process is not a database. It is a short document that records the workload, the candidates, what was measured, what was chosen, and the condition under which the choice should be revisited. That last line matters most. A decision with a written trigger for reconsidering it is a decision you can defend, and one your successors can inherit without debating it again from scratch.",[11,154,155],{},"This is, in the end, what we mean by knowledge as code. The reasoning behind a system is captured where it can be found, reviewed, and trusted later.",{"title":157,"searchDepth":158,"depth":158,"links":159},"",2,[160,161,167,168,169],{"id":26,"depth":158,"text":27},{"id":71,"depth":158,"text":72,"children":162},[163,165,166],{"id":79,"depth":164,"text":80},3,{"id":86,"depth":164,"text":87},{"id":93,"depth":164,"text":94},{"id":100,"depth":158,"text":101},{"id":121,"depth":158,"text":122},{"id":148,"depth":158,"text":149},"Data Engineering","2026-06-15","There is no universally fastest engine. There is only the right fit for a workload. This is the framework we use to choose between columnar stores, OLAP databases, and embedded engines, and to back the call with evidence.","md",true,{},"\u002Fblog\u002Fchoosing-a-query-engine","4 min read",{"title":5,"description":172},"blog\u002Fchoosing-a-query-engine",[181,182,183,184,185],"Query engines","OLAP","DataFusion","ClickHouse","Benchmarking","BpEVKGhHFdjKu8Dm5UN9II6XNbS22_PIcGibeZ95AtM",[],1782741267337]