1/17
Data Import Strategies
with Elasticsearch
Elastic Meetup #15, 16. November 2016
© David Buchmann
You've got a "situation"
Data in enterprise systems
Websites take a lot of effort
Data access is slow
Data distributed over several systems
Indexing Architecture
Workflow
Loaders
Per external data source
Lots of formats: SOAP, JSON, csv and XML files, Doctrine (Oracle)
Modes
Push notifications
Changed since last successful run
Full import
Sanity checks: Compare item id list
Loaders
Cronjobs hourly / daily / weekly
Message queue workers for push
All persist to MySQL and send message
Indexer
Per API entity, aggregate and combine sources
Mostly message queues
Cronjobs for synchronising
Sanity checks: Don't delete more than 10% of a type
Data Mapper
Determine values for API models
Modular, separate mapper per related fields
Testable code
Readable code
Reusable code
Dependency modelling for partial updates
Queuing
Record import state of each item
When already waiting in queue, don't publish second message
Track item state and processing delay
Retry queues
Everything will go wrong once in a while
On error, send message into retry queue
Delayed for 4h, hope problem resolved by then
Count retry and abort after 3 attempts
Transparent from message consumer point of view
Notes on Elasticsearch
Indexes
one index per language
=> one index per type
de-normalize everything for fast responses
no joins or parent-child relations
Schema Changes
ES guesses field configurations
Not always correctly
=> manual definitions
No way to change definition of existing index
Deploy new code with new index, but not yet online
Copy from old index
(or rebuild from MySQL)
Thank you!
@dbu
David Buchmann, Liip AG
Indexing Architecture