Don’t Invent Data for Integration Tests, Synthesize It

Ivan Ponomarev

ivan

Ivan Ponomarev

  • Staff Engineer @ Synthesized.io

  • Teaching Java @ МФТИ and Mainor

What do we want from tests?

  • Clean & neat

  • Fast running

  • Easy to write and maintain

What do we want from tests?

  • Clean & neat

  • Fast running

  • Easy to write and maintain

The tech stack

  • RDBMS (PostgreSQL or others)

  • Backend: Java/Kotlin + Spring Boot

  • Restful API, frontend/other services: out of scope

A very traditional architecture

diag 40bc0673fd0fbfc5cc25c0c0313827f6

A slippery terminology: E2E test

diag 1749f7880ee2bbb169e419aa8bc2989c

A slippery terminology: integration test

(sometimes called E2E)

diag 75b34a565c166cb1ccc773b9161218ef

Unit Test

diag 2cadb90da195d0862384ade1131b0984

A slippery terminology: component test

(sometimes called integration or even unit test)

diag 93b8cca1768c2217ad8824c0fe4ad779

Testcontainers

testcontainers transparent

Application under test

dbschema

Methods under test

  • TalkDao.getTalksByConference(Conference conference)

diag 58b0fb436ab2ea14d1f59a920e24ec74

Must return a set of Talk object together with related
Speaker objects and Conference object

Methods under test

  • TalkService.changeStatus(in talkId, Status status)

diag 2362998e483f5ed5065a919c146f76b1

No talk should be rejected without a feedback!

Approach #1: The Golden Dump

goldendump

Spinning up a subset of beans

diag d962923bbf8ff20d6880fd176f30e7c4

Spinning up a subset of beans

diag c53a4af3e67e91b68d290c6870a1aba8

Spinning up a subset of beans

diag 53ab2aa04e460f2cd69d71104d85f328

Spinning up a subset of beans

diag 5ad6e9a4861a4048e8f796e8076a6327

The Golden Dump Pros and Cons

  • thumbs up A straighforward approach

  • thumbs down Fragile: init script can significantly diverge from the actual schema

The Golden Dump Pros and Cons

  • thumbs down Obscure: what does this mean?

insert into talkspeakers (talkid, speakerid) values (1001, 1004);
insert into talkspeakers (talkid, speakerid) values (1002, 1002);
insert into talkspeakers (talkid, speakerid) values (1002, 1003);

The Golden Dump Pros and Cons

  • thumbs down Poor cohesion: tests rely on initialization script which is in a separate file far away.

Test setup relying on the script:

//A talk with a feedback
int id = 1001;
//A talk without a feedback
int id = 1002;

The Golden Dump Pros and Cons

  • thumbs down Poor cohesion: tests rely on initialization script which is in a separate file far away.

Assertions relying on the script:

assertThat(talk.getName()).isEqualTo(
        "Reactive, or not reactive: that is the question");
assertThat(talk.getSpeakers().stream().map(Speaker::getName))
        .containsExactlyInAnyOrder("Evgeny Borisov", "Kirill Tolkachev");
assertThat(talk.getStatus()).isEqualTo(IN_REVIEW);

The Golden Dump Pros and Cons

  • thumbs down Might be easy to write for the fist time, but time consuming to maintain.

Let’s count lines of code

Golden Dump

goldendump

SQL

56

DaoTest

58

ServiceTest

40

Total

154

Approach #2: Object Mother

objectmother

Object Mother Pros and Cons

  • thumbs up No SQL scripts, total decoupling from schema migrations.

  • thumbs up Type safe!

Object Mother Pros and Cons

  • thumbs up Good cohesion: you can give your fixtures the meaningful names, it’s easy to understand what you test.

service.changeStatus(
    talkWithFeedback().getId(),
    Status.REJECTED);

Object Mother Pros and Cons

  • shrug We have to pre-fill database with objects before the tests

    • index up We use our own DAOs

    • index up We have to care about cleaning up the database after each test
      (which is ok, but can be slow)

Object Mother Pros and Cons

  • thumbs down It’s relatively easy to maintain ObjectMother class (thanks to type safety),
    but "inventing" objects with all their attributes can be difficult

Let’s count lines of code

Golden Dump

goldendump

Object Mother

objectmother

SQL

56

ObjectMother

50

DaoTest

58

64

ServiceTest

40

65

Total

154

179

Approach #3: synthesized data

tdk

Synthesized Data Pros and Cons

  • thumbs upthumbs upthumbs up We are using all the benefits of Object Mother approach, and

    • thumbs up no need to "invent" test examples

    • thumbs up no need to pre-fill the database using our own DAOs

Synthesized Data Pros and Cons

  • shrug We have to provide the config for TDK, but it’s less prone to changes.

  • thumbs down Little or no control for what’s actually in your database, we have to modify values occasionally using our DAO

Let’s count lines of code

Golden Dump

goldendump

Object Mother

objectmother

TDK

tdk

Config

15

SQL

56

ObjectMother

50

36

DaoTest

58

64

47

ServiceTest

40

65

44

Total

154

179

142

TDK features

tdk.png
modes

No-op

# Just copy everything from the source to the target

default_config:
  mode: KEEP

Data subsetting

# Take 50% of the rows from the source database,
# do not transform the data

default_config:
  mode: KEEP
  target_ratio: 0.5

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS

Data subsetting

# Take 50% of the rows and mask them,
# but keep public.productlines as is

default_config:
  mode: MASKING
  target_ratio: 0.5

tables:
  - table_name_with_schema: "public.productlines"
    mode: "KEEP"
    target_ratio: 1

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: "RELAXED"

Data masking: Before

CUSTOMERNUMBERCUSTOMERNAMECONTACTLASTNAMECONTACTFIRSTNAMEPHONEADDRESSLINE1

103

Atelier graphique

Schmitt

Carine

40.32.2555

54, rue Royale

112

Signal Gift Stores

King

Jean

5551838

8489 Strong St.

114

Australian Collectors, Co.

Ferguson

Peter

03 9520 4555

636 St Kilda Road

119

La Rochelle Gifts

Labrune

Janine

40.67.8555

67, rue des Cinquante Otages

Data masking: After

CUSTOMERNUMBERCUSTOMERNAMECONTACTLASTNAMECONTACTFIRSTNAMEPHONEADDRESSLINE1

7604416

Fwomnubri Jdorefqpfjn Vepbwfoe

Iylapd

Wqvj

7010-370953

Lmmejweaooyha 954

29919216

Ubkcg & Hhitqfm Wz

Hjpxftbrdybfxev

Rgykous

(72) 478-1967

um. Akseehys 97

65286067

Pwaibpu Lmypt

Llsrsmu

Grrnbqbi

281-942530

Sqy Teyfbglp qo Cdqo 21

68022786

Fudqrgjjycmz Sau Gsct Dms.

Ywsyew

Uurxn

5919048124

7833 Xtuvvhu Gb.

Data generation

# Take the source database
# and generate the database twice as big with the same schema

default_config:
  mode: GENERATION
  target_ratio: 2

table_truncation_mode: TRUNCATE
schema_creation_mode: CREATE_IF_NOT_EXISTS
safety_mode: "RELAXED"

Data generation

# Same as above,
# but use specific generator for `public.products` table
default_config:
  mode: GENERATION
  target_ratio: 2

tables:
  - table_name_with_schema: "public.products"
    mode: "GENERATION"
    target_ratio: 2
    transformations:
      - columns: [ "productname" ]
        params:
          type: "formatted_string_generator"
          pattern: "[A-Z]{16}"

A couple of dozens generators

generators

Pagila Demo

pagila

Thanks for listening!

Give tdk-tc a star!

@inponomarev