Data Import Strategies
with Elasticsearch
Elastic Meetup #15, 16. November 2016
© David Buchmann
You've got a "situation"
- Data in enterprise systems
- Websites take a lot of effort
- Data access is slow
- Data distributed over several systems
Indexing Architecture
Workflow
Loaders
- Per external data source
- Lots of formats: SOAP, JSON, csv and XML files, Doctrine (Oracle)
- Modes
- Push notifications
- Changed since last successful run
- Full import
- Sanity checks: Compare item id list
Loaders
- Cronjobs hourly / daily / weekly
- Message queue workers for push
- All persist to MySQL and send message
Indexer
- Per API entity, aggregate and combine sources
- Mostly message queues
- Cronjobs for synchronising
- Sanity checks: Don't delete more than 10% of a type
Data Mapper
- Determine values for API models
- Modular, separate mapper per related fields
- Testable code
- Readable code
- Reusable code
- Dependency modelling for partial updates
Queuing
- Record import state of each item
- When already waiting in queue, don't publish second message
- Track item state and processing delay
Retry queues
- Everything will go wrong once in a while
- On error, send message into retry queue
- Delayed for 4h, hope problem resolved by then
- Count retry and abort after 3 attempts
- Transparent from message consumer point of view
Notes on Elasticsearch
Indexes
- one index per language
- => one index per type
- de-normalize everything for fast responses
- no joins or parent-child relations
Schema Changes
- ES guesses field configurations
- Not always correctly => manual definitions
- No way to change definition of existing index
- Deploy new code with new index, but not yet online
- Copy from old index
(or rebuild from MySQL)
Thank you!
@dbu
David Buchmann, Liip AG
Indexing Architecture