Postgres as a Queue: Skip Kafka & RabbitMQ

You Probably Don't Need a Message Broker

Every time a team adds Kafka or RabbitMQ to process background jobs, they're signing up for another cluster to operate, another protocol to learn, and another failure mode to debug at 3 AM. Meanwhile, PostgreSQL -- the database already running their application -- ships with all the primitives needed to build a reliable task queue: row-level locking with SKIP LOCKED, asynchronous notifications with LISTEN/NOTIFY, and advisory locks for coordination. For the vast majority of workloads, a Postgres queue handles 1,000 to 10,000 jobs per second on modest hardware. That covers background emails, webhook deliveries, report generation, image processing pipelines, and most async workflows you'll encounter below the scale of a large social network.

This guide walks through building a production-grade task queue entirely in PostgreSQL, from the jobs table schema to retry logic with exponential backoff, dead-letter handling, and priority scheduling. We'll also cover when Postgres queues break down and a dedicated broker actually earns its keep.

What Is a PostgreSQL Queue?

Definition: A PostgreSQL queue is a pattern where a database table acts as a job queue, using row-level locking (SELECT FOR UPDATE SKIP LOCKED) to allow multiple workers to dequeue tasks concurrently without conflicts. Unlike dedicated message brokers, the queue lives inside your existing database, sharing its transactional guarantees and operational tooling.

The key insight is that a queue is just a table with a status column and a way to atomically claim rows. PostgreSQL's MVCC model and locking semantics give you exactly-once processing when combined with transactions. You don't need a separate system for this -- you need one SQL statement.

The Core Primitive: SELECT FOR UPDATE SKIP LOCKED

This single SQL clause is what makes Postgres queues viable. Introduced in PostgreSQL 9.5, SKIP LOCKED tells the query to skip any rows that are currently locked by another transaction instead of waiting for them. This turns a regular table into a concurrent work queue.

-- Dequeue a single job atomically
BEGIN;

SELECT id, payload, attempt
FROM jobs
WHERE status = 'pending'
  AND scheduled_at <= NOW()
ORDER BY priority DESC, created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;

-- Process the job in your application...

-- Mark it done
UPDATE jobs
SET status = 'completed', completed_at = NOW()
WHERE id = ;

COMMIT;

Without SKIP LOCKED, two workers trying to dequeue simultaneously would block each other -- one waits while the other finishes. With SKIP LOCKED, each worker instantly grabs a different unlocked row. No contention, no waiting, no duplicate processing.

Building the Jobs Table

A production jobs table needs more than id and payload. Here's a schema that handles retries, priorities, scheduling, and dead-letter tracking:

CREATE TABLE jobs (
  id          BIGSERIAL PRIMARY KEY,
  queue       TEXT        NOT NULL DEFAULT 'default',
  payload     JSONB       NOT NULL,
  status      TEXT        NOT NULL DEFAULT 'pending'
                          CHECK (status IN ('pending', 'running', 'completed', 'failed', 'dead')),
  priority    SMALLINT    NOT NULL DEFAULT 0,
  attempt     SMALLINT    NOT NULL DEFAULT 0,
  max_attempts SMALLINT   NOT NULL DEFAULT 5,
  scheduled_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  started_at  TIMESTAMPTZ,
  completed_at TIMESTAMPTZ,
  failed_at   TIMESTAMPTZ,
  last_error  TEXT,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- The critical index: only pending jobs, ordered by priority and age
CREATE INDEX idx_jobs_dequeue
  ON jobs (priority DESC, created_at ASC)
  WHERE status = 'pending';

-- Index for finding running jobs (heartbeat checks, stuck job recovery)
CREATE INDEX idx_jobs_running
  ON jobs (started_at)
  WHERE status = 'running';

-- Partition by queue name if you run multiple logical queues
-- CREATE INDEX idx_jobs_queue ON jobs (queue) WHERE status = 'pending';

Pro tip: The partial index on WHERE status = 'pending' is essential. Without it, the dequeue query scans completed and failed rows too -- and those grow without bound. With the partial index, the planner only touches rows that are actually available for processing.

The Complete Worker Loop

Here's a worker implementation that handles claiming, processing, retrying, and dead-lettering in a single transaction:

-- Worker dequeue function
CREATE OR REPLACE FUNCTION dequeue_job(target_queue TEXT DEFAULT 'default')
RETURNS TABLE (
  job_id BIGINT,
  job_payload JSONB,
  job_attempt SMALLINT
) AS $$
BEGIN
  RETURN QUERY
  UPDATE jobs
  SET status = 'running',
      started_at = NOW(),
      attempt = attempt + 1
  WHERE id = (
    SELECT id FROM jobs
    WHERE queue = target_queue
      AND status = 'pending'
      AND scheduled_at <= NOW()
    ORDER BY priority DESC, created_at ASC
    LIMIT 1
    FOR UPDATE SKIP LOCKED
  )
  RETURNING id, payload, attempt;
END;
$$ LANGUAGE plpgsql;

-- Mark job as completed
CREATE OR REPLACE FUNCTION complete_job(job_id BIGINT)
RETURNS VOID AS $$
BEGIN
  UPDATE jobs
  SET status = 'completed', completed_at = NOW()
  WHERE id = job_id;
END;
$$ LANGUAGE plpgsql;

-- Mark job as failed with retry or dead-letter
CREATE OR REPLACE FUNCTION fail_job(job_id BIGINT, error_message TEXT)
RETURNS VOID AS $$
DECLARE
  job RECORD;
BEGIN
  SELECT attempt, max_attempts INTO job FROM jobs WHERE id = job_id;

  IF job.attempt >= job.max_attempts THEN
    -- Dead-letter: exhausted all retries
    UPDATE jobs
    SET status = 'dead',
        failed_at = NOW(),
        last_error = error_message
    WHERE id = job_id;
  ELSE
    -- Retry with exponential backoff: 2^attempt seconds
    UPDATE jobs
    SET status = 'pending',
        failed_at = NOW(),
        last_error = error_message,
        scheduled_at = NOW() + (POWER(2, job.attempt) || ' seconds')::INTERVAL
    WHERE id = job_id;
  END IF;
END;
$$ LANGUAGE plpgsql;

Retry with Exponential Backoff

Naive retries that fire immediately after failure cause thundering herd problems. Exponential backoff spaces retries out progressively:

Attempt	Delay	Next Retry After
1	2 seconds	Near-instant recovery
2	4 seconds	Brief pause
3	8 seconds	Moderate backoff
4	16 seconds	Gives downstream time to recover
5	32 seconds	Final attempt before dead-letter

The scheduled_at column is the trick. When a job fails, we set scheduled_at to a future timestamp. The dequeue query's WHERE scheduled_at <= NOW() clause naturally skips jobs that aren't ready for retry yet. No timers, no polling loops -- just a column and a WHERE clause.

Dead-Letter Handling

After exhausting all retry attempts, jobs move to status = 'dead'. These are jobs that need human attention -- a broken API endpoint, a malformed payload, a downstream service that's permanently unavailable. Don't silently discard them.

-- Query dead-letter jobs for investigation
SELECT id, queue, payload, last_error, attempt, created_at
FROM jobs
WHERE status = 'dead'
ORDER BY failed_at DESC
LIMIT 50;

-- Requeue a dead-letter job after fixing the underlying issue
UPDATE jobs
SET status = 'pending',
    attempt = 0,
    scheduled_at = NOW(),
    last_error = NULL,
    failed_at = NULL
WHERE id = 12345;

In production, set up an alert when dead-letter count exceeds a threshold. A dashboard showing dead-letter trends over time catches systemic failures before they become incidents.

LISTEN/NOTIFY: Avoiding Polling

Without notifications, workers must poll the jobs table on a loop. Polling every 100ms is wasteful; polling every 5 seconds adds latency. PostgreSQL's LISTEN/NOTIFY mechanism solves this by pushing a notification when new work arrives.

-- Create a trigger that fires on new job insertion
CREATE OR REPLACE FUNCTION notify_new_job()
RETURNS TRIGGER AS $$
BEGIN
  PERFORM pg_notify('new_job', NEW.queue);
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER jobs_notify
  AFTER INSERT ON jobs
  FOR EACH ROW
  EXECUTE FUNCTION notify_new_job();

// Node.js worker using LISTEN/NOTIFY
const { Client } = require('pg');

const listener = new Client({ connectionString: process.env.DATABASE_URL });
await listener.connect();
await listener.query('LISTEN new_job');

listener.on('notification', async (msg) => {
  // New job arrived -- dequeue it
  const result = await pool.query('SELECT * FROM dequeue_job($1)', [msg.payload]);
  if (result.rows.length > 0) {
    await processJob(result.rows[0]);
  }
});

// Fallback: poll every 30 seconds in case a notification was missed
setInterval(async () => {
  const result = await pool.query('SELECT * FROM dequeue_job($1)', ['default']);
  if (result.rows.length > 0) {
    await processJob(result.rows[0]);
  }
}, 30000);

Watch out: LISTEN/NOTIFY requires a persistent connection -- it does not work through connection poolers in transaction mode (PgBouncer, Supavisor). Dedicate a single long-lived connection for listening, and use your pooled connections for dequeue queries. Notifications are also not durable -- if no listener is connected when NOTIFY fires, the message is lost. That's why the fallback poll is important.

Advisory Locks for Coordination

Sometimes you need to ensure that only one worker processes a specific type of job at a time -- for example, a daily report generator that should never run concurrently. PostgreSQL advisory locks provide this without touching the jobs table.

-- Try to acquire an advisory lock (non-blocking)
SELECT pg_try_advisory_lock(hashtext('daily_report_generator'));

-- Returns true if acquired, false if another worker holds it
-- The lock is released when the session disconnects or you call:
SELECT pg_advisory_unlock(hashtext('daily_report_generator'));

Advisory locks are lightweight and don't create any table rows or WAL entries. They exist only in shared memory. Use them for singleton workers, rate limiting, or preventing duplicate processing of the same logical entity.

Throughput: What to Expect

Benchmarks depend heavily on hardware, payload size, and processing time, but here are realistic numbers for a single PostgreSQL instance:

Configuration	Enqueue/sec	Dequeue/sec	Notes
Single worker, 4-core VM	5,000	1,000-2,000	Bottleneck is single-threaded dequeue
4 workers, 4-core VM	5,000	3,000-5,000	Near-linear scaling with workers
8 workers, 8-core VM	10,000	6,000-10,000	Diminishing returns above core count
Batch dequeue (10 at a time)	10,000	10,000-15,000	Reduces round trips significantly

Pro tip: Dequeue in batches to reduce round trips. Change LIMIT 1 to LIMIT 10 in the dequeue query and process multiple jobs per transaction. This alone can double throughput. Keep batch sizes reasonable (10-50) to avoid holding locks too long.

Libraries and Extensions

You don't have to build from scratch. Several battle-tested tools implement Postgres queues with more features than a hand-rolled solution.

Tool	Language	Key Feature	Maturity
PGMQ	PostgreSQL extension	SQL-native API, visibility timeout, VT extension	Production-ready (Tembo)
Graphile Worker	Node.js / TypeScript	Cron jobs, batch processing, plugin system	Production-ready
River	Go	Type-safe jobs, periodic scheduling, unique jobs	Production-ready
Oban	Elixir	Telemetry, pruning, web UI (Oban Pro)	Production-ready
good_job	Ruby	ActiveJob backend, concurrency control	Production-ready
Procrastinate	Python	Django integration, async support	Stable

PGMQ: The Extension Approach

PGMQ is a PostgreSQL extension developed by Tembo that provides a clean SQL API modeled after AWS SQS. It handles visibility timeouts, message archival, and metrics out of the box.

-- Create a queue
SELECT pgmq.create('email_notifications');

-- Enqueue a message
SELECT pgmq.send('email_notifications',
  '{"to": "user@example.com", "template": "welcome"}'::JSONB
);

-- Dequeue with a 30-second visibility timeout
SELECT * FROM pgmq.read('email_notifications', 30, 1);

-- Delete after successful processing
SELECT pgmq.delete('email_notifications', msg_id);

-- Archive instead of delete (for audit trails)
SELECT pgmq.archive('email_notifications', msg_id);

Graphile Worker: Node.js Integration

Graphile Worker is the strongest option for Node.js and TypeScript applications. It handles job scheduling, retries, cron tasks, and batch processing with minimal configuration.

// Define a task
import { run, Task } from 'graphile-worker';

const sendEmail: Task = async (payload, helpers) => {
  const { to, template } = payload as { to: string; template: string };
  await emailService.send(to, template);
  helpers.logger.info(`Email sent to ${to}`);
};

// Start the worker
await run({
  connectionString: process.env.DATABASE_URL,
  taskList: { sendEmail },
  concurrency: 10,
});

// Enqueue from your application
import { makeWorkerUtils } from 'graphile-worker';

const utils = await makeWorkerUtils({
  connectionString: process.env.DATABASE_URL,
});

await utils.addJob('sendEmail', {
  to: 'user@example.com',
  template: 'welcome',
}, {
  maxAttempts: 5,
  runAt: new Date(Date.now() + 60_000), // delay 1 minute
});

Postgres Queue vs. Kafka vs. RabbitMQ

The choice isn't binary -- each tool optimizes for different things. Here's an honest comparison:

Capability	Postgres Queue	RabbitMQ	Kafka
Throughput ceiling	1K-10K/sec	10K-50K/sec	100K-1M+/sec
Message replay	Manual (keep completed rows)	No (consumed = gone)	Yes (log retention)
Fan-out (pub/sub)	LISTEN/NOTIFY (ephemeral)	Exchanges + bindings	Consumer groups
Ordering guarantees	Per-queue (via ORDER BY)	Per-queue	Per-partition
Delivery semantics	Exactly-once (transactional)	At-least-once	At-least-once (exactly-once with EOS)
Operational overhead	None (your existing DB)	Moderate (Erlang cluster)	High (ZooKeeper/KRaft + brokers)
Transactional enqueue	Native (same transaction as your data)	Requires outbox pattern	Requires outbox pattern
Best for	Background jobs, task queues	Routing, RPC, work queues	Event streaming, log aggregation

The biggest advantage of a Postgres queue isn't performance -- it's transactional enqueue. When you insert an order and enqueue a "send confirmation email" job in the same transaction, either both happen or neither does. With an external broker, you need the outbox pattern or accept the risk of an order being saved without the email job being enqueued.

When Postgres Queues Break Down

Postgres is not the right queue for every scenario. Here's when to reach for a dedicated system:

Fan-out / pub-sub patterns -- If multiple consumers need to independently process the same message, you need a real pub/sub system. LISTEN/NOTIFY is ephemeral (no persistence) and SKIP LOCKED gives each message to exactly one consumer. RabbitMQ fanout exchanges or Kafka consumer groups handle this natively.
Sustained throughput above 10K jobs/sec -- PostgreSQL can burst higher, but sustained high throughput creates WAL pressure, vacuum load, and index bloat on the jobs table. Kafka handles millions of messages per second because it's an append-only log with sequential I/O.
Message replay -- If consumers need to reprocess historical messages (rebuilding a projection, replaying events), Kafka's log retention model is purpose-built for this. Keeping completed rows in a Postgres jobs table works but doesn't scale to billions of messages.
Cross-service communication without a shared database -- If services own their own databases (as they should in a microservice architecture), a shared jobs table creates coupling. An external broker is the right integration point between independently deployed services.
Complex routing logic -- RabbitMQ's exchange types (topic, headers, direct, fanout) provide routing capabilities that would require application-level logic with a Postgres queue.

Production Checklist

If you're deploying a Postgres queue to production, walk through this list:

Partial index on pending jobs -- Without it, dequeue scans the entire table including completed rows.
Heartbeat for running jobs -- Workers crash. Add a heartbeat_at column and a reaper process that requeues jobs with stale heartbeats (e.g., older than 5 minutes).
Archival strategy -- Move completed and dead jobs to a separate archive table or delete them on a schedule. A jobs table with millions of completed rows hurts vacuum performance.
Connection management -- Dedicate a connection for LISTEN. Use pooled connections for dequeue. Don't mix the two.
Monitoring -- Track queue depth (pending count), processing latency (time from creation to completion), failure rate, and dead-letter count.
Graceful shutdown -- On SIGTERM, finish the current job before exiting. Don't abandon running jobs.
Batch dequeue -- Process multiple jobs per transaction to reduce round trips when throughput matters.

Frequently Asked Questions

Can PostgreSQL really replace Kafka or RabbitMQ?

For background job processing and task queues, yes. PostgreSQL handles 1,000 to 10,000 jobs per second reliably, which covers the majority of web application workloads. Where PostgreSQL falls short is pub/sub fan-out, message replay, sustained throughput above 10K/sec, and cross-service messaging in microservice architectures. If your use case is "process this thing in the background," Postgres is the right default.

What is SELECT FOR UPDATE SKIP LOCKED?

SELECT FOR UPDATE SKIP LOCKED is a PostgreSQL clause that locks selected rows for the duration of the transaction while silently skipping any rows already locked by other transactions. This prevents multiple workers from claiming the same job. Without SKIP LOCKED, workers would block each other waiting for locks. It was introduced in PostgreSQL 9.5 and is the foundation of every Postgres queue implementation.

How do I handle failed jobs in a Postgres queue?

Track the attempt count and a max_attempts limit on each job. When a job fails, increment the attempt counter and reschedule it with exponential backoff (scheduled_at = NOW() + 2^attempt seconds). Once attempts are exhausted, move the job to a "dead" status for manual investigation. Never silently discard failed jobs -- dead-letter tracking is how you catch systemic failures.

What throughput can I expect from a Postgres queue?

On a single PostgreSQL instance with 4-8 cores, expect 1,000 to 10,000 dequeues per second depending on worker count and batch size. Batch dequeuing (10-50 jobs per transaction) roughly doubles throughput. This is sufficient for the vast majority of applications. If you need sustained throughput above 10K/sec, consider Kafka or a dedicated queue.

Should I use LISTEN/NOTIFY or polling?

Use both. LISTEN/NOTIFY provides near-instant wake-up when new jobs arrive, eliminating the latency of polling intervals. But notifications are ephemeral -- if no listener is connected, they're lost. Add a fallback poll (every 15-30 seconds) to catch any missed notifications. Keep the LISTEN connection separate from your pooled connections since it requires a persistent session.

What is PGMQ and should I use it?

PGMQ is a PostgreSQL extension by Tembo that provides an SQS-like SQL API for queue operations. It handles visibility timeouts, message archival, and queue metrics out of the box. Use it if you want a proven implementation without writing the queue machinery yourself. If you need deep customization (custom retry logic, priority schemes, multi-tenant queues), building on raw SKIP LOCKED gives you more control.

How does Graphile Worker compare to building my own queue?

Graphile Worker provides a complete job processing framework for Node.js: cron scheduling, batch processing, job deduplication, and a plugin system. It uses SKIP LOCKED under the hood but handles all the edge cases -- stuck job recovery, graceful shutdown, migration management. For Node.js applications, Graphile Worker is almost always better than a hand-rolled queue. You get months of battle-tested edge case handling for free.

Postgres as a Queue: When You Don't Need Kafka or RabbitMQ