It is possible to build a service that starts the spark-sql CLI as a shell command application on our YARN cluster from various clients. However, the CLI approach does not work well for interactive applications and does not provide the best user experience. Spark SQL queries as shell command applications on Apache YARNĪnother common mechanism for running Spark SQL queries is through the spark-sql command-line interface (CLI). Learning from our experience, we decided not to choose this approach. Mostly, it was caused either due to a single query running in local mode with query optimization taking too much memory, or due to a query loading a native jar that caused a kernel panic on the server. Having used Hiveserver2 for interactive querying in the past, we saw several issues where a bad query brought down the entire server resulting in killing/ failure of all the queries running concurrently. However, this approach does not provide proper isolation between queries submitted to the same thrift server.Īn issue with a single query can affect all other queries running on the same thrift server. Using the STS would allow existing JDBC/ODBC protocol-supporting tools to seamlessly work with Spark SQL.
JDBC/ODBC protocols are one of the most popular ways for various clients to submit queries. Apache Spark’s Thrift JDBC/ODBC serverĪpache Spark’s Thrift JDBC/ODBC Server (STS) is similar to HiveServer2, allowing clients to execute Spark SQL queries over JDBC/ODBC protocols. Below are the various approaches we considered to support interactive querying with Spark SQL as we moved from Hive to Spark SQL. Spark SQL is used for all scheduled queries (soon after Hive deprecation is complete) and interactive querying on large datasets. Presto is used for quick interactive queries, as covered in this post.
However, we are deprecating Hive in favor of Spark SQL, leaving us with two primary query engines (i.e., Presto and Spark SQL). We support Hive, Presto, and Spark SQL for querying data.
We follow up by introducing the architecture and discuss how we addressed the challenges we faced along the way. We start by discussing how we use Spark SQL at Pinterest and challenges specific to interactive querying with Spark SQL. In the following sections, we dive deeper into how we extended interactive querying with Spark SQL at Pinterest. These characteristics make the needs of an interactive querying platform different from a scheduled querying platform. Unlike scheduled queries, users wait for interactive queries to finish and are unaware of potential issues that may cause query failures. Interactive Queries are queries that are executed when needed and are usually not repeated on a pre-defined cadence.These queries usually have strict Service Level Objectives (SLO). Scheduled Queries are queries that run on a pre-defined cadence.There are primarily two ways to submit these queries: scheduled and interactive. The applications of such analysis exist in all business/engineering functions like Machine Learning, Ads, Search, Home Feed Recommendations, Trust & Safety, and so on. Querying is the most popular way for users to derive understanding from data at Pinterest. Through an elaborate discussion on various architecture choices, challenges along the way, and our solutions for those challenges, we share how we made interactive querying with Spark SQL a success. Here, we’ll share how we built a scalable, reliable, and efficient interactive querying platform that processes hundreds of petabytes of data daily with Apache Spark SQL. In the past, we published how Presto at Pinterest serves this function.
Reliable, fast, and scalable interactive querying is essential to make those data-driven decisions possible. To achieve our mission of bringing everyone inspiration through our visual discovery engine, Pinterest relies heavily on making data-driven decisions to improve the Pinner experience for over 475 million monthly active users.
Engine diagnostics pinner software#
Sanchay Javeria | Software Engineer, Big Data Query Platform, Data EngineeringĪshish Singh | Technical Lead, Big Data Query Platform, Data Engineering Interactive Querying with Apache Spark SQL at Pinterest