Tables
DDL - Tables
Last updated
DDL - Tables
Last updated
These functions are used to create and modify data tables in OmniSci.
Table names must use the NAME format, described in regex notation as:
CREATE TABLE
Create a table named <table>
specifying <columns>
and table properties. Table and column names cannot include quotes, spaces, or special characters.
* In OmniSci release 4.4.0 and higher, you can use existing 8-byte DATE
columns, but you can create only 4-byte DATE
columns (default) and 2-byte DATE
columns (see DATE ENCODING FIXED(16)
).
For more information, see Datatypes and Fixed Encoding.
For geospatial datatypes, see Geospatial Primitives.
Create a table named tweets
and specify the columns, including type, in the table.
Create a table named customers
that accepts a default value of the current time for time created:
WITH
Clause PropertiesSharding partitions a database table across multiple servers so each server has a part of the table with the same columns but with different rows. Partitioning is based on a sharding key defined when you create the table.
Without sharding, the dimension tables involved in a join are replicated and sent to each GPU, which is not feasible for dimension tables with many rows. Specifying a shard key makes it possible for the query to execute efficiently on large dimension tables.
Currently, specifying a shard key is useful for joins, only:
If two tables specify a shard key with the same type and the same number of shards, a join on that key only sends a part of the dimension table column data to each GPU.
For multi-node installs, the dimension table does not need to be replicated and the join executes locally on each leaf.
A shard key must specify a single column to shard on. There is no support for sharding by a combination of keys.
One shard key can be specified for a table.
Data are partitioned according to the shard key and the number of shards (shard_count
).
A value in the column specified as a shard key is always sent to the same partition.
The number of shards should be equal to the number of GPUs in the cluster.
Sharding is allowed on the following column types:
DATE
INT
TEXT ENCODING DICT
TIME
TIMESTAMP
Tables must share the dictionary for the column to be involved in sharded joins. If the dictionary is not specified as shared, the join does not take advantage of sharding. Dictionaries are reference-counted and only dropped when the last reference drops.
Set shard_count
to the number of GPUs you eventually want to distribute the data table across.
Referenced tables must also be shard_count
-aligned.
Sharding should be minimized because it can introduce load skew accross resources, compared to when sharding is not used.
Examples
Basic sharding:
Sharding with shared dictionary:
Using the TEMPORARY argument creates a table that persists only while the server is live. They are useful for storing intermediate result sets that you access more than once.
Temporary tables do not support updates or deletes.
CREATE TABLE AS SELECT
Create a table with the specified columns, copying any data that meet SELECT statement criteria.
WITH
Clause PropertiesCreate the table newTable
. Populate the table with all information from the table oldTable
, effectively creating a duplicate of the original table.
Create a table named trousers
. Populate it with data from the columns name
, waist
, and inseam
from the table wardrobe
.
Create a table named cosmos
. Populate it with data from the columns star
and planet
from the table universe where planet has the class M.
ALTER TABLE
Rename the table tweets to retweets.
Rename the column source to device in the table retweets.
Add the column lang to the table tweets using a TEXT ENCODING DICTIONARY.
Add the columns lang and encode to the table tweets using a TEXT ENCODING DICTIONARY for each.
Add the column pt_dropoff to the table tweets with a default value point(0,0).
You cannot add a dictionary-encoded string column with a shared dictionary when using ALTER TABLE ADD COLUMN.
Currently, OmniSci does not support adding a geo column type (POINT, LINESTRING, POLYGON, or MULTIPOLYGON) to a table.
OmniSci supports ALTER TABLE RENAME TABLE and ALTER TABLE RENAME COLUMN for temporary tables. OmniSci does not support ALTER TABLE ADD COLUMN to modify a temporary table.
DROP TABLE
Deletes the table structure, all data from the table, and any dictionary content unless it is a shared dictionary. (See the Note regarding disk space reclamation.)
DUMP TABLE
Archives data and dictionary files of the table <table>
to file <filepath>
.
Valid values for <compression_program>
include:
gzip (default)
lz4
none
If you do not choose a compression option, the system uses gzip if it is available. If gzip is not installed, the file is not compressed.
The file path must be enclosed in single quotes.
Dumping a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being dumped.
The DUMP
command is not supported on distributed configurations.
You must have a least GRANT CREATE ON DATABASE privilege level to use the DUMP
command.
RESTORE TABLE
Restores data and dictionary files of table <table>
from the file at <filepath>
. If you specified a compression program when you used the DUMP TABLE
command, you must specify the same compression method during RESTORE
.
Restoring a table decompresses and then reimports the table. You must have enough disk space for both the new table and the archived table, as well as enough scratch space to decompress the archive and reimport it.
The file path must be enclosed in single quotes.
Restoring a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being restored.
The RESTORE
command is not supported on distributed configurations.
You must have a least GRANT CREATE ON DATABASE privilege level to use the RESTORE
command.
TRUNCATE TABLE
Use the TRUNCATE TABLE
statement to remove all rows from a table without deleting the table structure.
This releases table on-disk and memory storage and removes dictionary content unless it is a shared dictionary. (See the note regarding disk space reclamation.)
Removing rows is more efficient than using DROP TABLE. Dropping followed by recreating the table invalidates dependent objects of the table requiring you to regrant object privileges. Truncating has none of these effects.
When you DROP or TRUNCATE, the command returns almost immediately. The directories to be purged are marked with the suffix \_DELETE_ME_. The files are automatically removed asynchronously.
In practical terms, this means that you will not see a reduction in disk usage until the automatic task runs, which might not start for up to five minutes.
You might also see directory names appended with \_DELETE_ME_. You can ignore these, with the expectation that they will be deleted automatically over time.
OPTIMIZE TABLE
Use this statement to remove rows from storage that have been marked as deleted via DELETE
statements.
When run without the vacuum option, the column-level metadata is recomputed for each column in the specified table. OmniSciDB makes heavy use of metadata to optimize query plans, so optimizing table metadata can increase query performance after metadata widening operations such as updates or deletes. OmniSciDB does not narrow metadata during an update or delete — metadata is only widened to cover a new range.
When run with the vacuum option, it removes any rows marked "deleted" from the data stored on disk. Vacuum is a checkpointing operation, so new copies of any vacuum records are deleted. OmniSciDB currently does not remove old (pre-checkpoint) fragments after vacuuming, so you might notice your data directory growing.
VALIDATE CLUSTER
Perform checks and report discovered issues on a running OmniSci cluster. Compare metadata between the aggregator and leaves to verify that the logical components between the processes are identical.
If VALIDATE CLUSTER
detects issues, it returns a report similar to the following:
If no issues are detected, it will report as follows:
You can include the WITH(REPAIR_TYPE)
argument. (REPAIR_TYPE='NONE')
is the same as running the command with no argument. (REPAIR_TYPE='REMOVE')
removes any leaf objects that have issues. For example:
Datatype
Size (bytes)
Notes
BIGINT
8
Minimum value: -9,223,372,036,854,775,807
; maximum value: 9,223,372,036,854,775,807
.
BOOLEAN
1
TRUE: 'true'
, '1'
, 't'
. FALSE: 'false'
, '0'
, 'f'
. Text values are not case-sensitive.
DATE
*
4
Minimum value: -2,147,483,648
; maximum value: 2,147,483,647
Range in years: +/-5,883,517
around epoch. Maximum date January 1, 5885487 (approximately). Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING FIXED(32)
4
Default encoding, equivalent to DATE
.
Range in years: +/-5,883,517
around epoch, maximum date January 1, 5885487 (approximately). Values range from -185542587187200 to 185542587100800. Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING FIXED(16)
2
Range: -32,768
- 32,767
Range in years: +/-90
around epoch, April 14, 1880 - September 9, 2059. Values range from -2831155200 to 2831068800. Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DECIMAL
2, 4, or 8
Takes precision and scale parameters: DECIMAL(precision,scale)
.
Size depends on precision:
Up to 4
: 2 bytes
5
to 9
: 4 bytes
10
to 18
(maximum): 8 bytes
Scale must be less than precision.
DOUBLE
8
Variable precision. Minimum value: -1.79 x e^308
; maximum value: 1.79 x e^308
.
FLOAT
4
Variable precision. Minimum value: -3.4 x e^38
; maximum value: 3.4 x e^38
.
INTEGER
4
Minimum value: -2,147,483,647
; maximum value: 2,147,483,647
.
SMALLINT
2
Minimum value: -32,767
; maximum value: 32,767
.
TEXT ENCODING DICT
4
Max cardinality 2 billion distinct string values
TEXT ENCODING NONE
Variable
Size of the string + 6 bytes
TIME
8
Minimum value: 00:00:00
; maximum value: 23:59:59
.
TIMESTAMP
8
Linux timestamp from -30610224000
(1/1/1000 00:00:00.000
) through 29379542399
(12/31/2900 23:59:59.999
).
Can also be inserted and stored in human-readable format:
YYYY-MM-DD HH:MM:SS
YYYY-MM-DDTHH:MM:SS
(The T
is dropped when the field is populated.)
TINYINT
1
Minimum value: -127
; maximum value: 127
.
Encoding
Descriptions
DICT
Dictionary encoding on string columns (default for TEXT
columns). Limit of 2 billion unique string values.
FIXED
(bits)
Fixed length encoding of integer or timestamp columns. See Datatypes and Fixed Encoding.
NONE
No encoding. Valid only on TEXT
columns. No Dictionary is created. Aggregate operations are not possible on this column type.
Property
Description
fragment_size
Number of rows per fragment that is a unit of the table for query processing. Default: 32 million rows, which is not expected to be changed.
max_rows
Used primarily for streaming datasets to limit the number of rows in a table, to avoid running out of memory or impeding performance. When the max_rows
limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows
setting. If you attempt load more rows at one time than the max_rows
setting defines, the records up to the max_rows
limit are removed, leaving only the additional rows. Default: 2^62.
In a distributed system, the maximum number of rows is calculated as max_rows * leaf_count
. In a sharded distributed system, the maximum number of rows is calculated as max_rows * shard_count
.
page_size
Number of I/O page bytes. Default: 1MB, which does not need to be changed.
partitions
Partition strategy option:
SHARDED
: Partition table using sharding.
REPLICATED
: Partition table using replication.
shard_count
Number of shards to create, typically equal to the number of GPUs across which the data table is distributed.
sort_column
Name of the column on which to sort during bulk import.
Property
Description
fragment_size
Number of rows per fragment that is a unit of the table for query processing. Default = 32 million rows, which is not expected to be changed.
max_chunk_size
Size of chunk that is a unit of the table for query processing. Default: 1073741824 bytes (1 GB), which is not expected to be changed.
max_rows
Used primarily for streaming datasets to limit the number of rows in a table. When the max_rows
limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows
setting. If you attempt load more rows at one time than the max_rows
setting defines, the records up to the max_rows
limit are removed, leaving only the additional rows. Default = 2^62.
page_size
Number of I/O page bytes. Default = 1MB, which does not need to be changed.
partitions
Partition strategy option:
SHARDED
: Partition table using sharding.
REPLICATED
: Partition table using replication.
vacuum
Formats the table to more efficiently handle DELETE
requests. The only parameter available is delayed
. Rather than immediately remove deleted rows, vacuum marks items to be deleted, and they are removed at an optimal time.