Introduction to Ibis

Explore large datasets in OmniSci with the full power of SQL, with a pandas-like API, and build a wide variety of visualizations by just writing python code

Introduction to Ibis

Tip: The Ibis open source project documentation is the best place to learn more, and stay up to date

Ibis allows data scientists and researchers to explore data in a variety of remote storage systems, with an API that is inspired by the extremely popular pandas toolkit.

In a nutshell, you can use Ibis to directly interact with OmniSciDB and several other supported SQL systems, by writing high level python code rather than lower-level SQL, making it convenient for Data Scientists working in Python to use familiar tools for increased productivity.

Here is a more detailed summary of what it provides users, taken directly from the Ibis website

  • Full coverage of SQL features: You can code in Ibis anything you can implement in a SQL SELECT

  • Transparent to SQL implementation differences: Write standard code that translate to any SQL syntax

  • High performance execution: Execute at the speed of your backend, not your local computer

  • Integration with community data formats and tools (e.g. pandas, Parquet, Avro...)

Ibis backends

A major advantage of Ibis is its support for several execution backends including common SQL databases and processing engines. This is powerful because it allows you to author analytical workflows or operations once, that can be run consistently against these backends. It also allows the possibility of creating interesting workflows that can simultaneously materialize data from multiple backends using and combine the resulting outputs in interesting ways.

Notable backends besides OmniSciDB include PySpark, Google BigQuery and PostgreSQL.

Ibis and OmniSciDB

Quick tour of the workflow

Here is a short example of what Ibis does. Inside a notebook cell, you can first connect to an OmniSciDB database instance:

con = ibis.omniscidb.connect(
    host='omniscidb',
    database='mapd',
    user='admin',
    password='••••••••••'
)

If you're launching JupyterLab from within Immerse, this connection is already set up for you based on your Immerse login and credentials (you don't need to do the above)

Next, let's identify a table and define a simple Ibis expression against it:

tablename = 'github' 
t = con.table(tablename)
count_expr = t.count()
count_expr.compile()

This expression is compiled by Ibis into SQL which runs against OmniSciDB:

SELECT count(*) AS "count"
FROM github

When executed the above SQL, as expected, counts the rows of the table:

print(f'Count of records in table {tablename} is {count_expr.execute()}')
Count of records in table github is 2033049065

Advanced Tip: If you want to write raw SQL within Ibis, you can still do so! A SQL query can be wrapped as an Ibis expression and composed together with other Ibis expressions.

Ibis offers a powerful way to compose sophisticated analytical expressions, and then leverage the processing power and scale of one or supported storage backends to evaluate them and return results. Here is a decidedly more complex Ibis expression and the SQL it generates. Note how you can leverage a full programming language (Python) to assemble complex expressions that cover SQL features like joins, filters and aggregates.

well_list = ['4201334812', '4201335379', '4201335112', '4201334947', '4201334329', '4201334965', '4201334978', '4201334998', '4201335467', '4201334917', '4201334930', '4212331826', '4212331164', '4212300017', '4212331782', '4212332076', '4212331839', '4228532777', '4228533451', '4212332354', '4212332207', '4212330405', '4212332014']

#Find oil well production for the specific well list above
sum_oil=t.oil_monthly.sum().log10().name('oil_monthly')
expr = (t.filter([t.api_no_10.isin(well_list), t.oil_monthly.notnull()])\
        .group_by(['date_prod', 'api_no_10']).having(sum_oil >= 0.)\
        .aggregate(sum_oil))\
.sort_by('date_prod')

# find the max production month per well in the wellset
max_oil = expr.oil_monthly.max().name('oil_monthly')
expr1 = expr.group_by('api_no_10').aggregate(max_oil)

expr2 = (expr.join(expr1, [expr.api_no_10 == expr1.api_no_10,\
                           expr.oil_monthly == expr1.oil_monthly]))\
        [expr.api_no_10.name("api"), expr.oil_monthly.name("om2"), expr.date_prod.name("dp2")]

print(expr2.compile())

which results in the automatically generated SQL below

WITH t0 AS (
  SELECT "date_prod", "api_no_10", log10(sum("oil_monthly")) AS oil_monthly
  FROM upstream_reservoir
  WHERE ("api_no_10" IN ('4201334998', '4212331782', '4212331826', '4201334965', '4212300017', '4201334329', '4228533451', '4212331164', '4201335467', '4201334930', '4212330405', '4201334947', '4212332354', '4201334978', '4201334917', '4201335379', '4201334812', '4212332207', '4228532777', '4201335112', '4212332076', '4212331839', '4212332014')) AND
        ("oil_monthly" IS NOT NULL)
  GROUP BY date_prod, api_no_10
  HAVING log10(sum("oil_monthly")) >= 0.0
  ORDER BY "date_prod"
)
SELECT t0."api_no_10" AS api, t0."oil_monthly" AS om2, t0."date_prod" AS dp2
FROM t0
  JOIN (
    SELECT "api_no_10", max("oil_monthly") AS oil_monthly
    FROM t0
    GROUP BY api_no_10
  ) t1
    ON (t0."api_no_10" = t1."api_no_10") AND
       (t0."oil_monthly" = t1."oil_monthly")

The result of the evaluation is by default, a pandas dataframe. This makes it convenient to use Ibis inside other python data analysis workflows.

expr2.execute()

api

om2

dp2

0

4201335379

3.944927

2016-11-01

1

4201335467

4.125253

2017-12-01

2

4201334998

3.822887

2014-09-01

3

4201334930

3.698709

2014-05-01

4

4201334917

4.353243

2014-06-01

5

4201334812

4.127299

2014-03-01

6

4201334947

3.715502

2014-06-01

7

4201334978

3.822887

2014-09-01

8

4201335112

4.010766

2015-01-01

9

4201334329

3.721068

2015-07-01

Tip: While Ibis can output to pandas, remember that the result is first materialized and may be transferred over the network to the client running Ibis (e.g. a Jupyter notebook in a browser on a laptop). It's a bad idea to use it for expressions that try to move a million rows to a browser.

Apache Arrow support

Ibis depends on pymapd as a low-level connector. This means it can leverage some of the key features of pymapd including the ability to output query results to the Arrow format, via the select_ipcandselect_ipc_gpucalls.

Building on this capability, the OmniSciDB backend for Ibis can output query results using Apache Arrow, to CPU or GPU memory (the latter, using the cudf dataframe library from Nvidia). This makes it convenient to use with the GPU-based RAPIDS Machine Learning methods. See the RAPIDS example for how you can build a seamless workflow integrating OmniSci and RAPIDs.

Geospatial operations

The OmniSciDB backend for Ibis supports geospatial functions building directly on OmniSciDB's SQL support for these functions. Further, these operations produce geopandas dataframes, allowing for convenient usage in python-based geospatial analytics workflows.

User Defined Functions

Using Ibis, you can create and use User Defined Functions (UDFs) in Python that execute inside OmniSciDB. Of particular note, this UDF framework leverages integration with Numba, a Just-in-Time (JIT) compiler backend for python and produces lower level code to provide a far more performant execution path than execution within Python itself.

It also makes it easy to author UDFs in Python, and have these UDFs then be usable within a SQL workflow.

Additional Resources

Refer to the following sections in the Ibis documentation for how to get started

Last updated