Data is crucial to many software systems and applications. And there are of course many ways to store, manage and query data. Different database models and approaches have been developed over the years, including relational, non-relational, client/server and serverless models etc. Relational database systems with the client/server model using a standard query language is probably the most popular approach, and one of the most popular such systems is MySQL. It’s open-source, widely used and easy to get started with, but not without criticism.1 Non-relational database models are also becoming increasingly popular. They can address issues with large amounts of data that is produced by people and systems, and where the data structures might not be available up front during the design phase.

But relational database systems are probably not going away anytime soon. An interesting alternative to MySQL (or its binary-compatible fork MariaDB) is PostgreSQL (or just “Postgres”).

So, what is Postgres and why is it interesting?

The Postgres project has great manuals that contain everything you will want to know and on which a lot of the research for this material is based. The most recent version (as of this writing), says:

PostgreSQL is an object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department. POSTGRES pioneered many concepts that only became available in some commercial database systems much later. PostgreSQL is an open-source descendant of this original Berkeley code.

While not necessary, I think it’s often good to know something about the origins and history of the software and products you use. By the way, POSTGRES Version 4.2 is still available.2 So Postgres is object-relational, open-source, was originally developed at Berkeley and was pioneering software and concepts in the database space.

The original reason I become interested in learning more about Postgres were two-fold:

  1. Great built-in support for full-text search
  2. Great extension for geospatial data

But before we can even get to those things, we have to learn a little more about the Postgres basics.

Terminology and data organization

To store and manage data we know we want a “database,” and then put things into that database. But let’s step back and look at the big picture and the concepts.

Database clusters

Postgres follows the client/server model. A Postgres server is running on a machine and listens and responds to client requests over the network. The server instance needs to keep track of all the data it manages and this is called a database cluster. This is what the manual says about clusters:

Before you can do anything, you must initialize a database storage area on disk. We call this a database cluster. (SQL uses the term catalog cluster.) A database cluster is a collection of databases that is managed by a single instance of a running database server.

A cluster can be created anywhere on the file system and when the Postgres server is started, you can simply point to that location in order for the server to use the data stored there. This seems straightforward and gives you a feeling of control: create a database cluster somewhere on the file system and point the server to that location. If you want to serve data from a different cluster just point the server to that location when starting the server.

The practical details are covered below. Let’s stick to the concepts for a little while longer.

Databases, schemas and tables

Tables are of course the relational structures that ultimately contain our data. Tables have columns that define the structure of the data and each row contains a single data record.

What about databases and schemas? What are they really and how do they relate? First, it turns out that other popular systems, such as MySQL, do not differentiate between databases and schemas. There, a database and a schema is the same thing. However, Postgres makes the distinction. Let’s see if we can figure it out. The manual again:

Every instance of a running PostgreSQL server manages one or more databases. Databases are therefore the topmost hierarchical level for organizing SQL objects (“database objects”) […] A database is a named collection of SQL objects (“database objects”). Generally, every database object (tables, functions, etc.) belongs to one and only one database. (However there are a few system catalogs, for example pg_database, that belong to a whole cluster and are accessible from each database within the cluster.) More accurately, a database is a collection of schemas and the schemas contain the tables, functions, etc. So the full hierarchy is: server, database, schema, table (or some other kind of object, such as a function).

Ok, so an easy way to think about the overall structure is: server -> database -> schema -> table.

When a client connects to the database server only one database can be accessed for that session (one database per connection). So, unrelated data that should be kept separate should probably go into separate databases. Regarding related data, the manual says:

If the projects or users are interrelated and should be able to use each other’s resources, they should be put in the same database but possibly into separate schemas. Schemas are a purely logical structure and who can access what is managed by the privilege system.

Schemas are logical structures? We have to search a little more to get a clearer explanation:

A schema is essentially a namespace: it contains named objects (tables, data types, functions, and operators) whose names can duplicate those of other objects existing in other schemas. Named objects are accessed either by “qualifying” their names with the schema name as a prefix, or by setting a search path that includes the desired schema(s). A CREATE command specifying an unqualified object name creates the object in the current schema (the one at the front of the search path, which can be determined with the function current_schema).

So, schemas are ways of organizing named objects (such as tables). Named objects are defined within schemas. To refer to the named objects, either use the fully qualified name including the schema ([schema name].[table name]), or use an unqualified name but be sure to include the schema on the search path used to find named objects you refer to. In addition, if named objects are created without full qualification the object is created in the “current” schema (defined by the search path, and we assume the order in which the schemas appear in the path).

We now have a better understanding of clusters, databases, schemas and tables and how they all relate.

Users, roles and authorization

Most database management systems come with built in support for authorization configuration. That is, we want to be able to define who can access and modify what part of our database. This is usually important to understand because chances are you will have to use it sooner or later. Postgres has its own way of handling this, which you may or may not be familiar with.

To handle authorization, Postgres makes use of the concept of roles. Let’s see what the manual says:

PostgreSQL manages database access permissions using the concept of roles. A role can be thought of as either a database user, or a group of database users, depending on how the role is set up. Roles can own database objects (for example, tables) and can assign privileges on those objects to other roles to control who has access to which objects. Furthermore, it is possible to grant membership in a role to another role, thus allowing the member role to use privileges assigned to another role.

The concept of roles subsumes the concepts of “users” and “groups”. In PostgreSQL versions before 8.1, users and groups were distinct kinds of entities, but now there are only roles. Any role can act as a user, a group, or both.

Unlike some systems, there is no concept of a “user”, or a “group” of users, just roles. But as mentioned, roles can be seen from different perspectives—they are flexible. You can create a role that represents a real-world user, or you can create a role that represents a real-world group and then have “user roles” become members of a “group role”. This makes it easy to manage a set of real-world users with similar security privileges.

Roles can also have attributes that define their privileges. The important ones seem to be:

  • Login privilege – Roles must have this attribute to be used as the initial role for a database connection. Roles with this attribute can hence be seen as “database users”.
  • Superuser status – Roles with this attribute bypasses all permission checks and should be used sparingly.
  • Database creation – Needed to create databases.
  • Role creation – Needed to create roles.
  • Password – Associates a password with the role.

The attributes of a role can be changed after the role has been created.

Practical basics

The first step of trying things out for yourself is of course to install Postgres on your system. There are probably some great installers and helper applications for your platform that can get you up and running with little or no hassle. For Mac OS X there is for example the excellent Postgres.app application. Please refer to the Postgres manual for detailed instructions for installing from the source code, or search the Web for platform-specific installers. Postgres might even come pre-installed on your system.

We’ll go through some of the practical aspects by touching on all the concepts discussed in the terminology and concept section.

Database clusters

If we start from scratch we first need to create a database cluster. If you install Postgres from a platform-specific installer or package a default database cluster is probably automatically created for you. However, it can be good to see how its done (and how easy it is!). A cluster is created by issuing the following command in your terminal:3

> initdb -D /usr/local/pgsql/data

The path given after the -D parameter specifies the location of the cluster data directory.

Once the cluster has been initialized we can start the Postgres server by specifying the location of the cluster data directory:

> postgres -D /usr/local/pgsql/data

You probably want to start the server in a background process. This can easily be accomplished with the postgres command and standard Unix-style shell magic, but Postgres also provides the following simple command:

> pg_ctl start -D /usr/local/pgsql/data -l logfile 

The logfile file will contain any logs generated by the server, and again the value to the -D parameter specifies the path to the cluster data directory.

To stop the server simply execute:

> pg_ctl stop -D /usr/local/pgsql/data 

If you install Postgres using some existing package you will most likely be provided with simple “start” and “stop” scripts or commands that automatically handles the location of the cluster data directory.4

Connect to the server

To connect to the database from the command line we use the psql command:

 > psql 
 > psql -U user 

To connect as a particular user, simply use the -U flag and provide the user name. I know we said that there are no users, only roles. By a user we here mean a user with the login privilege attribute set. Only roles with this attribute can create connections against the database.

When this is done we will have a psql prompt into which we can enter commands:

user=# _

A password is not really needed to connect and start the prompt. If a password is required due to some request made by the user, the system will prompt for the password.

Manage roles and privileges

A fresh Postgres installation always comes with a pre-defined “superuser” role. The role by default has the same name as the system user that initialized the database cluster.

To create a new role simply use the CREATE ROLE command:

user=# CREATE ROLE name;

The role attributes can also be specified when creating the role and the commands for them are: LOGIN, SUPERUSER, CREATEDB, CREATEROLE and PASSWORD.

Here are some examples:

user=# CREATE ROLE name LOGIN;
user=# CREATE ROLE name LOGIN SUPERUSER;
user=# CREATE ROLE name LOGIN PASSWORD 'password';

A role with the LOGIN attribute can be seen as a user, so we can also do:

user=# CREATE USER name;
user=# CREATE USER name PASSWORD 'password';

To remove a role just say:

user=# DROP ROLE name;

To list all roles you can query the system-defined pg_roles table:

user=# SELECT rolname FROM pg_roles;

If you want to see more details about the roles, for example their attributes, you can do:

user=# SELECT rolname, rolcanlogin, rolsuper FROM pg_roles;
user=# SELECT * FROM pg_roles;

There is a separate table called pg_user that contains roles with the LOGIN attribute:

user=# SELECT usename FROM pg_user;
user=# SELECT * FROM pg_user;

To handle memberships and let a role become a member of another role use the GRANT and REVOKE commands:

user=# GRANT group_role TO role1, role2; 
user=# REVOKE group_role FROM role1, role2; 

Privileges on named objects are also handled using the GRANT and REVOKE commands:

user=# GRANT INSERT ON table TO role;
user=# REVOKE UPDATE ON table FROM role; 

There are different actions that can be granted and revoked for different named objects, for example, tables and databases. For tables privileges involve SELECT, INSERT, UPDATE, DELETE etc. For databases we talk about CREATE, CONNECT etc. However, the ALL option can always be used:

user=# GRANT ALL ON DATABASE postgis TO dbuser;

There is also a special name PUBLIC that can be used to refer to every role on the system, e.g.:

user=# GRANT INSERT ON table TO PUBLIC;

Manage databases

Before we can create schemas and tables and populate them with data we need to create a database. To see which databases exist in the cluster you can issue this command:

user=# SELECT datname FROM pg_database;

To create a new database we issued the follow SQL statement:

user=# CREATE DATABASE dbname;

The active role that creates the database automatically becomes the owner of the new database. To create a database owned by someone else the following can be done:

user=# CREATE DATABASE dbname OWNER rolename; 

Further info from manual:

Since you need to be connected to the database server in order to execute the CREATE DATABASE command, the question remains how the first database at any given site can be created. The first database is always created by the initdb command when the data storage area is initialized. […] This database is called postgres. So to create the first “ordinary” database you can connect to postgres.

The following from the manual is interesting regarding the process of creating databases and the use of cloning:

A second database, template1, is also created during database cluster initialization. Whenever a new database is created within the cluster, template1 is essentially cloned. This means that any changes you make in template1 are propagated to all subsequently created databases. Because of this, avoid creating objects in template1 unless you want them propagated to every newly created database.

So, to summarize: databases are created by cloning existing databases. The database template1 created during cluster initialization is used for this purpose. If you actually want to change how databases are created for a given installation, you can (with caution) modify template1. If you ever start listing all the databases that have been defined in a cluster you will also see a database named template0. This database should never be modified once a database cluster has been initialized. This database can be used to create so-called “virgin” databases that contain none of the potential site-specific changes made to template1.

There is also a convenience command for creating a database provided by Postgres that you can execute from the command line:

> createdb dbname

Creating the database from the command line for a specific role:

> createdb -O rolename dbname

To delete a database, use the command:

> dropdb dbname 

or connect to the server and issue:

user=# DROP DATABASE dbname; 

Schemas and the search path

Let’s take a quick look at how unqualified names are resolved, and the search path that controls this. Every named object (e.g. a table) lives within a schema. The schema works as a namespace and is simply a logical way of organizing database objects.

There is a default schema created during cluster initialization that is called public. The active schema can be queried using the current_schema function:

user=# SELECT current_schema;

 current_schema 
----------------
 public
(1 row)

The current schema is determined using a search path, which is a list of schema names. The search path can be queried like this:

user=# SHOW search_path;

   search_path   
-----------------
 "$user", public
(1 row)

The result is a list of schema names (that may or may not exist). Schema names that do not exist are simply skipped. The order is important for two reasons:

  1. To determine the order of searching for named objects.
  2. To determine the default schema to use when an object (e.g. a table) is created and a fully qualified name is not used.

The search path can easily be changed as demonstrated by the following commands and responses:

user=# CREATE SCHEMA myschema;
CREATE SCHEMA
user=# SET search_path TO myschema,public;
SET
user=# SHOW search_path;
   search_path   
-----------------
 myschema, public
(1 row)

The public schema is nothing special and can be removed.

Tables and data

After we create a database, and optionally a schema, and connect to that database we can finally create our data tables. As you probably already know, tables consist of rows and columns. The columns define the structure of the data while each row is a data record. Tables are created using standard SQL:

user=# CREATE TABLE book (book_id INTEGER, title TEXT, author TEXT);

We can insert data:

user=# INSERT INTO book VALUES (1, 'Tiger Land', 'John Doe');
user=# INSERT INTO book VALUES (2, 'Cocounut Island', 'Mary Lonesome');

and query the data:

user=# SELECT * FROM book; 
book_id |      title      |    author     
--------+-----------------+---------------
      1 | Tiger Land      | John Doe
      2 | Cocounut Island | Mary Lonesome
(2 rows)

Tables are removed using DROP TABLE:

user=# DROP TABLE book; 

Conclusions

We have of course just scratched the surface of Postgres, but hopefully this cleared up some of the very basic concepts. From here on it will be easier to dig into some of the more complex features that Postgres has to offer.

Again, the Postgres manual is really well written and provides everything you will want to know. Go take a look and keep it handy.

Ps. Postgres provides some useful meta-commands that can be used at the Postgres prompt.

  1. http://grimoire.ca/mysql/choose-something-else

  2. http://db.cs.berkeley.edu/postgres.html

  3. The process which is executing the command must have necessary permissions to create and write to the specified directory.

  4. To find the cluster data directory for an existing cluster you can issue the following SQL query: SELECT setting FROM pg_settings WHERE name = 'data_directory'; or the shorter alternative giving you the same result: SHOW data_directory;