ggsql: A Grammar of Graphics for SQL

(opensource.posit.co)

174 points | by thomasp85 4 hours ago

19 comments

  • anentropic 3 hours ago
    Maybe I skim read it too fast, but I did not find any clear description in the blog post or website docs of how this relates to SQL databases

    I was kind of guessing that it doesn't run in a database, that it's a SQL-like syntax for a visualisation DSL handled by front end chart library.

    That appears to be what is described in https://ggsql.org/get_started/anatomy.html

    But then https://ggsql.org/faq.html has a section, "Can I use SQL queries inside the VISUALISE clause," which says, "Some parts of the syntax are passed on directly to the database".

    The homepage says "ggsql interfaces directly with your database"

    But it's not shown how that happens AFAICT

    confused

    • thomasp85 3 hours ago
      That is fair - it is somewhat of a special concept.

      ggsql connects directly with your database backend (if you wish - you can also run it with an in-memory DuckDB backend). Your visual query is translated into a SQL query for each layer of the visualisation and the resulting table is then used for rendering.

      E.g.

      VISUALISE page_views AS x FROM visits DRAW smooth

      will create a SQL query that calculates a smoothing kernel over the data and returns points along that. Those points are then used to create the final line chart

    • georgestagg 3 hours ago
      ggsql has the concept of a "reader", which can be thought of as the way ggsql interfaces with a SQL database. It handles the connection to the database and generating the correct dialect of SQL for that database.

      As an alpha, we support just a few readers today: duckdb, sqlite, and an experimental ODBC reader. We have largely been focusing development mainly around driving duckdb with local files, though duckdb has extensions to talk to some other types of database.

      The idea is that ggsql takes your visualisation query, and then generates a selection of SQL queries to be executed on the database. It sends these queries using the reader, then builds the resulting visualisation with the returned data. That is how we can plot a histogram from very many rows of data, the statistics required to produce a histogram are converted into SQL queries, and only a few points are returned to us to draw bars of the correct height.

      By default ggsql will connect to an in-memory duckDB database. If you are using the CLI, you can use the `--reader` argument to connect to files on-disk or an ODBC URI.

      If you use Positron, you can do this a little easier through its dedicated "Connections" pane, and the ggsql Jupyter kernel has a magic SQL comment that can be issued to set up a particular reader. I plan to expand a little more on using ggsql with these external tools in the docs soon.

      • nojito 2 hours ago
        Highly suggest leveraging adbc. I would love to use this against our bigquery tables.
    • password4321 3 hours ago
      Yes this was my question as well, an example showing all the plumbing/dependencies to generate a graph from an external database server would be very helpful.
      • thomasp85 3 hours ago
        We certainly plan to create a few videos showing how to set it up and use it. If you use it in Positron with the ggsql extension it can interact directly with the connection pane to connect to the various backends you have there
    • tantalor 2 hours ago
      > SQL databases ... confused

      "SQL" and "databases" are different things

      SQL is a declarative language for data manipulation. You can use SQL to query a database, but there's nothing special about databases. You can also write SQL to query other non-database sources like flat files, data streams, or data in a program's memory.

      Conversely, you can query a database without SQL.

      • johnthescott 4 minutes ago
        > Conversely, you can query a database without SQL.

        fond memories of quel.

  • getnormality 2 hours ago
    I skimmed the article for an explanation of why this is needed, what problem it solves, and didn't find one I could follow. Is the point that we want to be able to ask for visualizations directly against tables in remote SQL databases, instead of having to first pull the data into R data frames so we can run ggplot on it? But why create a new SQL-like language? We already have a package, dbplyr, that translates between R and SQL. Wouldn't it be more direct to extend ggplot to support dbplyr tbl objects, and have ggplot generate the SQL?

    Or is the idea that SQL is such a great language to write in that a lot of people will be thrilled to do their ggplots in this SQL-like language?

    EDIT: OK, after looking at almost all of the documentation, I think I've finally figured it out. It's a standalone visualization app with a SQL-like API that currently has backends for DuckDB and SQLite and renders plots with Vegalite. They plan to support more backends and renderers in the future. As a commenter below said, it's supposed to help SQL specialists who don't know Python or R make visualizations.

    • nchagnet 2 hours ago
      I was quite psyched when I read this so maybe I can tell you why it's interesting to me, although I agree the announcement could have done a better job at it.

      In my experience, the only thing data fields share is SQL (analysts, scientists and engineers). As you said, you could do the same in R, but your project may not be written in R, or Python, but it likely uses an SQL database and some engine to access the data.

      Also I've been using marimo notebooks a lot of analysis where it's so easy to write SQL cells using the background duckdb that plotting directly from SQL would be great.

      And finally, I have found python APIs for plotting to be really difficult to remember/get used to. The amount of boilerplate for a simple scatterplot in matplotlib is ridiculous, even with a LLM. So a unified grammar within the unified query language would be pretty cool.

    • nojito 2 hours ago
      It seems to be for sql users who don’t know python or r.
      • nchagnet 2 hours ago
        I would even add that it fits into a more general trend where operations are done within SQL instead of in a script/program which would use SQL to load data. Examples of this are duckdb in general, and BigQuery with all its LLM or ML functions.
  • lmeyerov 1 hour ago
    This is great

    We reached a similar conclusion for GFQL (oss graph dataframe query language), where we needed an LLM-friendly interface to our visualization & analytics stack, especially without requiring a code sandbox. We realized we can do quite rich GPU visual analytics pipelines with some basic extensions to opencypher . Doing SQL for the tabular world makes a lot of sense for the same reasons!

    For the GFQL version (OpenCypher), an example of data loading, shaping, algorithmic enrichment, visual encodings, and first-class pipelines:

    - overall pipelines: https://pygraphistry.readthedocs.io/en/latest/gfql/benchmark...

    - declarative visual encodings as simple calls: https://pygraphistry.readthedocs.io/en/latest/gfql/builtin_c...

  • nicoritschel 1 hour ago
    This is neat. I do wish there was a way for this to gracefully degrade in contexts without support for the grammar, though.

    I devised a similar in spirit (inside SQL, very simplified vs GoG) approach that does degrade (but doesn't read as nice): https://sqlnb.com/spec

    • thomasp85 1 hour ago
      I'm not quite sure I understand what you mean by "degrade in context" - care to elaborate?
      • nicoritschel 19 minutes ago
        If you're familiar with the percent format for jupyter notebooks, something like that— so things gracefully degrade in a more "basic" execution context.

        # %%

        foo = 1

        # %%

        print(foo)

        Above is notebook with two "cells" & also a valid Python script. Perhaps it matters less with SQL vs Python, but it's a nice property.

        • thomasp85 9 minutes ago
          Ah - I don't think it really matters here, but if you find yourself in need then you can open a GitHub issue and we can discuss
  • kasperset 2 hours ago
    Will this ever integrate rest of the ggplot2 dependent packages described here: https://exts.ggplot2.tidyverse.org/gallery/ in the near or distant future? Sorry if it already mentioned somewhere.
    • thomasp85 2 hours ago
      I don't think we will get the various niche geoms that have been developed by the ggplot2 community anytime soon.

      The point of this is not to superseed ggplot2 in any way, but to provide a different approach which can do a lot of the things ggplot2 can, and some that it can't. But ggplot2 will remain more powerful for a lot of tasks in many years to come I predict

  • jorin 1 hour ago
    Really cool project! Would love to see a standard established for representing visualizations in SQL! I built a whole dashboarding tool on top of the idea: https://taleshape.com/shaper/docs/getting-started/ But Shaper takes a more pragmatic approach and just uses built in functionality to describe how to visualize the results. The most value I see with viz as SQL is that it's a great format for LLMs to specify what they want while making it easy to audit and reproduce. Just built a slack bot on top of that concept last week: https://taleshape.com/blog/build-your-own-data-analytics-sla...
  • efromvt 3 hours ago
    Love the layering approach - that solves a problem I’ve had with other sql/visual hybrids as you move past the basics charts.
  • thomasp85 4 hours ago
    The new visualisation tool from Posit. Combines SQL with the grammar of graphics, known from ggplot2, D3, and plotnine
    • zcw100 3 hours ago
      Don't forget Vega! https://vega.github.io/vega/
    • thomasp85 3 hours ago
      I'm one of the authors - happy to take any questions!
    • mi_lk 3 hours ago
      I don't think D3 uses grammar of graphics model?
      • thomasp85 3 hours ago
        I'd say it does, though it is certainly much more low-level than e.g. ggplot2. But the basic premises of the building blocks described be Leland Wilkinson is there
  • jiehong 1 hour ago
    The cli only produces vega-lite[0] json graphics, right?

    It would be nice if it included a rendering engine.

    [0]: https://github.com/vega/vega-lite

    • thomasp85 59 minutes ago
      That is certainly in the pipeline. We chose to start with vegalite so we could focus on the internals of the representation
  • jiehong 1 hour ago
    Outstanding!

    This can replace a lot of Excel in the end.

    It makes so much sense now that it exists!

  • kasperset 3 hours ago
    Looks intriguing. Brings plotting to Sql instead of “transforming” sql for plotting.
  • gh5000 2 hours ago
    It is conceivable that this could become a duckdb extension, such that it can be used from within the duckdb CLI? That would be pretty slick.
    • thomasp85 2 hours ago
      That is conceivable, not a top priority as we want to focus on this being a great experience for every backend, but certainly something we are thinking of
      • rustyconover 1 hour ago
        With the new PEG parser this is just a Claude session away in DuckDB.
  • radarsat1 3 hours ago
    Wow, love this idea.
  • data_ders 2 hours ago
    ok, this is definitely up my alley. color me nerd-sniped and forgive the onslaught of questions.

    my questions are less about the syntax, which i'm largely familiar with knowing both SQL and ggplot.

    i'm more interested in the backend architecture. Looking at the Cargo.toml [1], I was surprised to not see a visualization dependency like D3 or Vega. Is this intentional?

    I'm certainly going to take this for a spin and I think this could be incredible for agentic analytics. I'm mostly curious right now what "deployment" looks like both currently in a utopian future.

    utopia is easier -- what if databases supported it directly?!? but even then I think I'd rather have databases spit out an intermediate representation (IR) that could be handed to a viz engine, similar to how vega works. or perhaps the SQL is the IR?!

    another question that arises from the question of composability: how distinct would a ggplot IR be from a metrics layer spec? could i use ggsql to create an IR that I then use R's ggplot to render (or vise versa maybe?)

    as for the deployment story today, I'll likely learn most by doing (with agents). My experiment will be to kick off an agent to do something like: extract this dataset to S3 using dlt [2], model it using dbt [3], then use ggsql to visualize.

    p.s. @thomasp85, I was a big fan of tidygraph back in the day [4]. love how small our data world is.

    [1]: https://github.com/posit-dev/ggsql/blob/main/Cargo.toml

    [2]: https://github.com/dlt-hub/dlt

    [3]: https://github.com/dbt-labs/dbt-fusion

    [4]: https://stackoverflow.com/questions/46466351/how-to-hide-unc...

    • thomasp85 2 hours ago
      Let me try to not miss any of the questions :-)

      ggsql is modular by design. It consists of various reader modules that takes care of connecting with different data backends (currently we have a DuckDB, an SQLite, and an ODBC reader), a central plot module, and various writer modules that take care of the rendering (currently only Vegalite but I plan to write my own renderer from scratch).

      As for deployment I can only talk about a utopian future since this alpha-release doesn't provide much tangible in that area. The ggsql Jupyter kernel already allows you to execute ggsql queries in Jupyter and Quarto notebooks, so deployment of reports should kinda work already, though we are still looking at making it as easy as possible to move database credentials along with the deployment. I also envision deployment of single .ggsql files that result in embeddable visualisations you can reference on websites etc. Our focus in this area will be Posit Connect in the short term

      I'm afraid I don't know what IR stands for - can you elaborate?

      • stevedh 1 hour ago
        Intermediate Representation
        • thomasp85 11 minutes ago
          Ah - yes, in theory you could create a "ggplot2 writer" which renders the plot object to an R file you can execute. It is not too far away from the current Vega-Lite writer we use. The other direction (ggplot2->ggsql) is not really feasible
  • breakfastduck 1 hour ago
    This is fantastic. Feels like something that should've been in there from the start!
  • rvba 1 hour ago
    1) does this alllw to export to Excel?

    2) how to make manual adjustments?

    • thomasp85 1 hour ago
      My answers will probably disappoint

      1) No (unless you count 'render to image and insert that into your excel document') 2) This is not possible - manual adjustments are not reproducible and we live by that ethos

  • hei-lima 2 hours ago
    Really cool!
  • dartharva 2 hours ago
    Would be awesome if somehow coupled into Evidence.dev