Marcus Young

Software. Microcontrollers. Beer.

MarkLogic Collections – Performance Issues and Why You Should Denormalize!

| Comments

Background

Recently I’ve been on a new team at my work doing MarkLogic work. In a nutshell we’ve been developing a new HealthCare quality application from the ground up.  This would be our second application, but we learned enough from the first to make our priority creating a robust and scalable RESTful interface for document CRUD. With this we developed a middle tier that is almost completely configuration driven, meaning any new apps would be able to utilize this and not have to re-invent the wheel to get CRUD/Search working out-of-the-box.

The Approach To A Problem Using Lists

The application is revolved around patient CCD’s (XML representing all of a patients history per retrieval). MarkLogic is ideal for this since it’s built on XQuery, a functional language revolved around XML. Let me paint you a picture of what led us to collection issues. Say you want as a doctor, to search “gender:M” on a database, and just keep tacking on until you get to what you want to find (end result maybe: “gender:M AND dob GT 01-10-1987 AND problem:Asthma”). Now as a doctor, I found X patients and want to save that as a NON changing list…meaning that if I search any time after this, I’ll still get those X patients, and no more that match that criteria since.

We implemented this using MarkLogic collections. Collections are handy in that you can take some documents, put them in a collection, and say:

1
2
for $i in fn:collection('/doctors/marc/1')
  return $i

That means whatever is in the collection ‘/doctors/marc/1′ will always be there. Cool, that works great. So we implemented this “static list” as this:

  1. Take the query and execute it
  2. For each document in the result set, add them to a unique list: /lists/{unique id}
  3. change the query to list:{unique id} and add a new xml node {id of user}

The great thing about this, is you can have a collection constraint in our patients so you can just execute “list:{unique id}” and it will return all documents in that list. Awesome! It worked Great! Also note, yes we handled updating the list by re-executing it, and adding the new patients to that list. Deleting a list is also possible.

The Problem With MarkLogic Collections

As I said, this solution worked fantastically…at first. It worked until we hit 1500 fairly large patient documents. The performance crawled to a stop. The problem with collections is that the collection data is stored internally. That means that if you decide to make a list for “gender:M”, you’re modifying and locking potentially half the documents in the database. For us, that meant a 9+ minute list creation. Not acceptable. You can try to speed this up, sure, by creating limits on how many documents you can modify, perhaps other ways involving spawning it in threads or in the background, but that still creates server overhead and gets rid of transaction safety.

Denormalize Your Database/The Solution

The greatest thing about a NoSQL database is that it can be denormalized. This may make some people cringe, why afterall would you want to duplicate data? If used correctly, it works great. For this specifically, it means that instead of modifying all documents in the database by adding/removing them from collections, is to use a separate file, stored as ‘/lists/{user}/{unique id}’ that contains links (uri’s) to the documents. This is fine, especially for this application, since patient records are never deleted. I’ve shown an example further below on how to see it in action!

The Code/TL;DR – Code or GTFO

The first thing you can do if you want to follow along is put a crap ton of delicious bacon documents into your database. You should copy/paste this into the QConsole and hit RUN! (it might take a few – it inserts 1200 documents…if it times out, insert 300 documents 4 times instead).
Click here for the code…too large to paste and the code isn’t too important
Now for the bad code. This will search for *tips* effectively. Any documents that match that, it will put it into a collection called ‘/marc/favorites’.

Then it will decide my favorites should be *ham*, so it will go through the collection, remove the originals, and add my new favorites. The important thing is to look at the number of documents affected and how long it takes! Ridiculous!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
let $docs :=  
  fn:distinct-values(
    for $i in cts:search(//beef,cts:word-query("tip"))
      return fn:base-uri($i)
  )
return
  for $i in $docs return
    xdmp:document-add-collections(
      $i,
      '/marc/favorites'
    )
(: 781 documents affected - Profile 2105 Expressions PT57.822894S :)
 
let $remove-from-collections :=
  for $i in fn:collection('/marc/favorites')
    return xdmp:document-remove-collections(fn:base-uri($i),'/beef/tips')
    
let $docs := 
  fn:distinct-values(
    for $i in cts:search(//pork,cts:word-query("ham"))
      return fn:base-uri($i)
  )
 
return
  for $i in $docs return
    xdmp:document-add-collections(
      $i,
      '/marc/favorites'
    )
 
(: 1051 documents affected - Profile 4245 Expressions PT48.439176S  :)

So what’s the best way to do it? Create a new document of URI’s to the documents, and store that! Don’t believe me? Look at the performance without indexes!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
let $docs := 
  fn:distinct-values(
    for $i in cts:search(//beef,cts:word-query("tip")) return
      fn:base-uri($i)
  )
  
let $favorites :=
  <favorites>{
    for $i in $docs return
      <doc>{$i}</doc>
  }</favorites>
  
return
  xdmp:document-insert(
    '/marc/favorites',
    $favorites
  )
(: 1 document affected - Profile 2643 Expressions PT0.326612S :)
 
let $docs := 
  fn:distinct-values(
    for $i in cts:search(//pork,cts:word-query("ham")) return
      fn:base-uri($i)
  )
  
let $favorites :=
  <favorites>{
    for $i in $docs return
      <doc>{$i}</doc>
  }</favorites>
  
return xdmp:node-replace(fn:doc('/marc/favorites')/favorites,$favorites)
(: 1 document affected - Profile 2144 Expressions PT0.67532S :)

And to improve your search, you could use something like:

1
2
3
4
5
cts:search(/,
  cts:document-query(
    fn:doc('/marc/favorites')/favorites//doc/text()
  )
)

Comments