Recently I've been working a bit with Postgres. I wanted to offload as much work as possible to the database. I had never seriously learned SQL before, so this was a first for me.
One of the things I wanted to do in SQL was taking a sample of 1'000 entities out of a big table with 70'000 rows.
SQL specifies the
TABLESAMPLE clause to do just that (and Postgres supports this since 9.5).
SELECT * FROM mytable TABLESAMPLE BERNOULLI (100 * (1000 / 70000));
The only thing that is a bit annoying is that it only supports percentages. But it's still pretty awesome.
In Postgres you can use
BERNOULLI as an algorithm.
SYSTEM is quick and dirty, while
BERNOULLI is more exact. Bernoulli Sampling is a method where each item in the entire population has an equal chance of being included in the sample.
SYSTEM will take a sample of storage pages instead (a page includes multiple items). It will go through each table storage page and decide whether to include the whole page or not.
Your result of a
TABLESAMPLE SYSTEM might look like this:
something_id -------------- 29766 29767 29768 29769 29770 94348 94349 94350 94352 94353
An additional neat feature is the
REPEATABLE option. You can set a seed value for the pseudo-random function with it. So when running the query with the same seed twice you're guaranteed to get the same result (hence the results are "repeatable").
SELECT * FROM mytable TABLESAMPLE BERNOULLI (1) REPEATABLE (42);
This can also be very useful if you want to get an idea of what kind of data is stored in a table. Instead of doing
LIMIT 1 and getting some bias in the data you're reading through, you can simply use a
SELECT * FROM mytable TABLESAMPLE SYSTEM (0.1);