Data Import Strategies

with Elasticsearch

Elastic Meetup #15, 16. November 2016

© David Buchmann

You've got a "situation"

Data in enterprise systems
Websites take a lot of effort
Data access is slow
Data distributed over several systems

Indexing Architecture

Workflow

Loaders

Per external data source
Lots of formats: SOAP, JSON, csv and XML files, Doctrine (Oracle)
Modes
- Push notifications
- Changed since last successful run
- Full import
- Sanity checks: Compare item id list

Loaders

Cronjobs hourly / daily / weekly
Message queue workers for push
All persist to MySQL and send message

Indexer

Per API entity, aggregate and combine sources
Mostly message queues
Cronjobs for synchronising
Sanity checks: Don't delete more than 10% of a type

Data Mapper

Determine values for API models
Modular, separate mapper per related fields
- Testable code
- Readable code
- Reusable code
Dependency modelling for partial updates

Queuing

Record import state of each item
When already waiting in queue, don't publish second message
Track item state and processing delay

Retry queues

Everything will go wrong once in a while
On error, send message into retry queue
Delayed for 4h, hope problem resolved by then
Count retry and abort after 3 attempts
Transparent from message consumer point of view

Notes on Elasticsearch

Indexes

one index per language
=> one index per type
de-normalize everything for fast responses
no joins or parent-child relations

Schema Changes

ES guesses field configurations
Not always correctly => manual definitions
No way to change definition of existing index
Deploy new code with new index, but not yet online
Copy from old index
(or rebuild from MySQL)

Thank you!

@dbu

David Buchmann, Liip AG

Indexing Architecture