MongoDB University Courses Completed!

For the last few weeks I’ve been doing two of the free online courses at https://university.mongodb.com/.  If you’re interested in learning about MongoDB then they are definitely worth doing.  So long as you followed through the short videos the weekly homework and final exams are pretty straight forward.  I’ve just found out my scores and I got the following.

M101N – MongoDB for .Net Developers – 95%

M102 – MongoDB for DBAs – 95%

I’m now thinking about doing M202 and then maybe the certification but for the meantime back to SQL Server and exam 70-463 which is scheduled for 2nd July 2015.

Advertisements

MongoDB Indexes

Like most database engine MongoDB allows you to optimise queries by creating an index on the collection. Indexes are created and used in a similar way to in SQL Server. There is no concept of a heap in MongoDB, all documents in a collection must have an _id field and the _id field is indexed automatically. You can create additional indexes on a collection in a similar way to creating non clustered index in SQL Server, i.e. a copy of the subset of fields in each document is copied to another location on disk in a specified order. The optimiser will then decide whether it is better to use the index than access the data directly from the collection.

The explain() function

In a similar way to viewing an execution plan in SQL Server you can using the explain() function to return information on how the query is going to be executed including whether any indexes were used. When you use the explain() function with any operation just execution details are returned and not the results of the query itself.

An example

The following code examples are all run in the Mongo shell.  I’m just going to use the following snippet of JavaScript to create some documents in the indexTest collection of the simonBlog database.

> use simonBlog
> db.dropDatabase()
> var someStuff = ""
  someStuff = someStuff.pad(1000)
  for (i=0;i<100;i++) {
    for (j=0;j<10;j++) {
      for (k=0;k<32;k++) {
        db.indexTest.insert({a:i, b:j, c:k, d:someStuff})
      }
    }
  }

This will write 32,000 documents to the collection.

Now if we run a find() to return all documents and use the explain() method as follows…

> db.indexTest.find().explain()

we get the following result…

{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "simonBlog.indexTest",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "$and" : [ ]
     },
    "winningPlan" : {
      "stage" : "COLLSCAN",
      "filter" : {
        "$and" : [ ]
      },
      "direction" : "forward"
    },
    "rejectedPlans" : [ ]
  },
  "serverInfo" : {
    "host" : "Simon-Laptop",
    "port" : 27017,
    "version" : "3.0.1",
    "gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
  },
  "ok" : 1
}

The namespace is the database and collection we’re accessing.
The parsedQuery shows us the query executed – this is blank because we are searching for everything.
The winningPlan and rejectedPlans contains details of all the execution plans considered and which one won.
In this case there is only possible plan and there is only one stage to the plan, a collection scan (COLLSCAN) with no filter in the forward direction.

As mentioned above there is an index created on the _id field automatically. We can prove that by running the following…

> db.indexTest.getIndexes()

which gives the following result…

[
  {
    "v" : 1,
    "key" : {
      "_id" : 1
    },
    "name" : "_id_",
    "ns" : "simonBlog.indexTest"
  }
]

Now suppose we want to find all documents where the “b” field equals 4…

> db.indexTest.find({b:4}).explain()
{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "simonBlog.indexTest",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "b" : {
        "$eq" : 4
      }
    },
    "winningPlan" : {
      "stage" : "COLLSCAN",
      "filter" : {
        "b" : {
          "$eq" : 4
        }
      },
      "direction" : "forward"
    },
    "rejectedPlans" : [ ]
  },
  "serverInfo" : {
    "host" : "Simon-Laptop",
    "port" : 27017,
    "version" : "3.0.1",
    "gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
  },
  "ok" : 1
}

Again we’re doing a collection scan but this time we can see the filter details:

 { "b" : { "$eq" : 4 } }

The explain() method accepts an optional “verbosity” parameter. If this is omitted then the default queryPlanner verbosity is used. The parameters dictates how much information is returned in by the explain() method. You can see in the previous example that a field queryPlanner is returned.

If we run the same query using the executionStats verbosity we get an additional field in the results with the execution statistics

> db.indexTest.find({b:4}).explain("executionStats")
{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "simonBlog.indexTest",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "b" : {
        "$eq" : 4
      }
    },
    "winningPlan" : {
      "stage" : "COLLSCAN",
      "filter" : {
        "b" : {
          "$eq" : 4
       }
     },
     "direction" : "forward"
   },
   "rejectedPlans" : [ ]
  },
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 3200,
    "executionTimeMillis" : 29,
    "totalKeysExamined" : 0,
    "totalDocsExamined" : 32000,
    "executionStages" : {
      "stage" : "COLLSCAN",
      "filter" : {
        "b" : {
          "$eq" : 4
        }
      },
      "nReturned" : 3200,
      "executionTimeMillisEstimate" : 30,
      "works" : 32002,
      "advanced" : 3200,
      "needTime" : 28801,
      "needFetch" : 0,
      "saveState" : 250,
      "restoreState" : 250,
      "isEOF" : 1,
      "invalidates" : 0,
      "direction" : "forward",
      "docsExamined" : 32000
    }
  },
  "serverInfo" : {
    "host" : "Simon-Laptop",
    "port" : 27017,
    "version" : "3.0.1",
    "gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
  },
  "ok" : 1
}

This gives some information about how many documents will be returned by the query (nReturned) and how many documents and index keys were examined (totalDocsExamined and totalKeysExamined respectively). Here we can see that all 32,000 documents were accessed because a collections scan occurred.

The final verbosity option is allPlansExecution and this returns statistics on the other plans considered for execution.

Now lets add an index that might be useful to our query. We’re filtering on the “b” field so lets add an index there.

> db.indexTest.createIndex({"b":1})
{
  "createdCollectionAutomatically" : false,
  "numIndexesBefore" : 1,
  "numIndexesAfter" : 2,
  "ok" : 1
}

I’m using version 3.0 of MongoDB and in previous versions to create an index you had to use ensureIndex() instead of createIndex() which is still available for backwards compatibility.

As you can see the index was successfully created: we did have 1 index (the one on _id) and we now have 2 indexes. Creating the index on {“b”:1} means create the index in ascending order. {“b”:-1} would be in descending order.

> db.indexTest.getIndexes()
[
  {
    "v" : 1,
    "key" : {
      "_id" : 1
    },
    "name" : "_id_",
    "ns" : "simonBlog.indexTest"
  },
  {
    "v" : 1,
    "key" : {
      "b" : 1
    },
    "name" : "b_1",
    "ns" : "simonBlog.indexTest"
  }
]

Now let’s run our query again using the explain() method with the execution statistics on

> db.indexTest.find({b:4}).explain("executionStats")
{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "simonBlog.indexTest",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "b" : {
        "$eq" : 4
      }
    },
    "winningPlan" : {
      "stage" : "FETCH",
      "inputStage" : {
        "stage" : "IXSCAN",
        "keyPattern" : {
          "b" : 1
        },
        "indexName" : "b_1",
        "isMultiKey" : false,
        "direction" : "forward",
        "indexBounds" : {
          "b" : [
            "[4.0, 4.0]"
          ]
        }
      }
    },
    "rejectedPlans" : [ ]
  },
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 3200,
    "executionTimeMillis" : 171,
    "totalKeysExamined" : 3200,
    "totalDocsExamined" : 3200,
    "executionStages" : {
      "stage" : "FETCH",
      "nReturned" : 3200,
      "executionTimeMillisEstimate" : 10,
      "works" : 3201,
      "advanced" : 3200,
      "needTime" : 0,
      "needFetch" : 0,
      "saveState" : 26,
      "restoreState" : 26,
      "isEOF" : 1,
      "invalidates" : 0,
      "docsExamined" : 3200,
      "alreadyHasObj" : 0,
      "inputStage" : {
        "stage" : "IXSCAN",
        "nReturned" : 3200,
        "executionTimeMillisEstimate" : 10,
        "works" : 3200,
        "advanced" : 3200,
        "needTime" : 0,
        "needFetch" : 0,
        "saveState" : 26,
        "restoreState" : 26,
        "isEOF" : 1,
        "invalidates" : 0,
        "keyPattern" : {
          "b" : 1
        },
        "indexName" : "b_1",
        "isMultiKey" : false,
        "direction" : "forward",
        "indexBounds" : {
          "b" : [
            "[4.0, 4.0]"
          ]
        },
        "keysExamined" : 3200,
        "dupsTested" : 0,
        "dupsDropped" : 0,
        "seenInvalidated" : 0,
        "matchTested" : 0
      }
    }
  },
  "serverInfo" : {
    "host" : "Simon-Laptop",
    "port" : 27017,
    "version" : "3.0.1",
    "gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
  },
  "ok" : 1
}

Now we can see that an index scan occurred (IXSCAN) and 3,200 keys were examined (keysExamined). However, you can see from the docsExamined field that we still had to access 3,200 documents. This is an improvement from the 32,000 documents we accessed without the index and MongoDB has to access the actual documents because we are returning all the fields in the documents. Only the “b” field is included in the index so it uses the index to do the filtering then uses the pointers in the index to directly access the correct documents.

Now suppose we only want to return the “c” field in our query. A more useful index would be on the “b” field and then the “c” field. The “b” must be the first one in the index as this is the one used to filter on.

> db.indexTest.createIndex({"b":1,"c":-1})
{
  "createdCollectionAutomatically" : false,
  "numIndexesBefore" : 2,
  "numIndexesAfter" : 3,
  "ok" : 1
}

Now’s let use projections only return the “c” field in the results.

> db.indexTest.find({b:4},{_id:0,c:1}).explain("executionStats")
{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "simonBlog.indexTest",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "b" : {
        "$eq" : 4
      }
    },
    "winningPlan" : {
      "stage" : "PROJECTION",
      "transformBy" : {
        "_id" : 0,
        "c" : 1
      },
      "inputStage" : {
        "stage" : "IXSCAN",
        "keyPattern" : {
          "b" : 1,
          "c" : -1
        },
        "indexName" : "b_1_c_-1",
        "isMultiKey" : false,
        "direction" : "forward",
        "indexBounds" : {
          "b" : [
            "[4.0, 4.0]"
          ],
          "c" : [
            "[MaxKey, MinKey]"
          ]
        }
      }
    },
    "rejectedPlans" : [
      {
        "stage" : "PROJECTION",
        "transformBy" : {
          "_id" : 0,
          "c" : 1
        },
        "inputStage" : {
          "stage" : "FETCH",
          "inputStage" : {
            "stage" : "IXSCAN",
            "keyPattern" : {
              "b" : 1
            },
            "indexName" : "b_1",
            "isMultiKey" : false,
            "direction" : "forward",
            "indexBounds" : {
              "b" : [
                "[4.0, 4.0]"
              ]
            }
          }
        }
      }
    ]
  },
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 3200,
    "executionTimeMillis" : 4,
    "totalKeysExamined" : 3200,
    "totalDocsExamined" : 0,
    "executionStages" : {
      "stage" : "PROJECTION",
      "nReturned" : 3200,
      "executionTimeMillisEstimate" : 0,
      "works" : 3201,
      "advanced" : 3200,
      "needTime" : 0,
      "needFetch" : 0,
      "saveState" : 26,
      "restoreState" : 26,
      "isEOF" : 1,
      "invalidates" : 0,
      "transformBy" : {
        "_id" : 0,
        "c" : 1
      },
      "inputStage" : {
        "stage" : "IXSCAN",
        "nReturned" : 3200,
        "executionTimeMillisEstimate" : 0,
        "works" : 3201,
        "advanced" : 3200,
        "needTime" : 0,
        "needFetch" : 0,
        "saveState" : 26,
        "restoreState" : 26,
        "isEOF" : 1,
        "invalidates" : 0,
        "keyPattern" : {
          "b" : 1,
          "c" : -1
        },
        "indexName" : "b_1_c_-1",
        "isMultiKey" : false,
        "direction" : "forward",
        "indexBounds" : {
          "b" : [
            "[4.0, 4.0]"
          ],
          "c" : [
            "[MaxKey, MinKey]"
          ]
        },
        "keysExamined" : 3200,
        "dupsTested" : 0,
        "dupsDropped" : 0,
        "seenInvalidated" : 0,
        "matchTested" : 0
      }
    }
  },
  "serverInfo" : {
    "host" : "Simon-Laptop",
    "port" : 27017,
    "version" : "3.0.1",
    "gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
  },
  "ok" : 1
}

The executionStats field contains details on the winning plan and we can now see that 3,200 keys are still examined but now 0 documents have been accessed. This index now covers this query. If you compare the executionTimeMillisEstimate now we have 0 (meaning less than 1) where when we still had to access the documents it was 10. If you can see it took 30 milliseconds when there was no index at all.  So having the covered index is over 30 times faster.

In this last result from explain() you can see that MongoDB did consider using the first index we created just on the “b” field but this plan was rejected.

All the examples above have been on fields with simple data types, either strings or integers. It is possible to index fields that contain arrays and embedded documents.  The creation of the index is the same, just how they are stored is slightly different.  More on that in a future post.

The documentation on the MongoDB website is pretty good once you’ve got a basic understanding of how things work and there are more details on the different fields returned by the explain() method.

MongoDB Storage Engines

I’m currently participating the “MongoDB for DBAs” and “MongoDB for .Net Developers” free online courses on the https://university.mongodb.com/ website.  This week int he DBA course they have talked about how the storage engine works in MongoDB.  It was quite interesting stuff and here are some of my notes…

Mongo DB Storage Engine

The storage engine is the part of the MongoDB server that runs all the CRUD commands and affects how the data is read, written to and removed from disk as well as how it is physically stored on the disk.

In MongoDB 3.0 you now get a second choice of storage engine.

In earlier versions you had the original MMAP v1 which is still the default storage version in 3.0. But now you have the option to use the Wired Tiger storage engine which is the newly supported an open source storage engine that is also used by other databases.

When you start the mongod server you can specify which storage engine to use, e.g.

mongod --storageEngine mmapv1

or

mongod --storageEngine wiredTiger

If you’ve specified a stored engine you’ll see it in the log when mongod starts but not if you’re just using the default.

01

You can also see it by running the following from the shell

db.serverStatus()

02

You MongoDB instance can only run either MMAP v1 or Wired Tiger. You cannot have both MMAP v1 or Wired Tiger data files in your data path directory.

03

MMAP v1

MMAP v1 stores the data as BSON and maps the raw BSON directly into virtual memory which allows the operating system to do most the work for the storage engine. If the files are not already in memory then they get pulled into RAM and any updates will get propagated back to disk

Collection level locking is now available in version 3.0 and database level locking with version 2.2 to 2.6.

The MongoDB lock is a “mutliple reader, single writer” lock which means you can have multiple readers on the data that will lock out all writers, however one writer will lock out all readers and also all other writers.

They don’t have document level locking in MMAP v1 as there are other shared resources that can cause contention. For example, if two separate documents exist in the same index, then an update to the first document will result in an update to the index and an update to the second document will result in another update to the index. If the update to both documents occurs simultaneously then we may have issues updating the index.

The journal is used a bit like the transaction log in SQL Server. Any update is written to the journal before it is applied to the database. This means that if the file system or the service goes down then the unplayed updates in the journal can be performed when everything is back up and running.

As soon as you create a document in a collection in a database a 64MB data file gets created. When the size of the database increases and the original 64 MB file is filled, a new double sized file is created, i.e. a 128MB file. As the database size increases a new data file double the size of the last will be created until the 2GB limit is reached. Once it reaches this limit new 2GB will keep being created.

04

Prior to version 3.0 MongoDB used to add some padding if it saw documents increasing and would try and guess how much padding to add. In version 3.0 they introduced a Power of 2 sized allocation similar to what they do with the data files. If a small document less than 32 bytes is inserted it will be allocated a 32B slot in the data file. If the document then grows to 33B it will be moved to a 64B slot. The size of the allocation slot will be a power of 2 up to the 2MB limit and after this more 2MB slots get allocated up to the 16MB document size limit. If a document has to move in the disk then all pointers in any indexes need to be updated, which has a performance overhead. This power of 2 padding doesn’t prevent document movement but it can reduce it and a document that has moved will leave behind a standardised slot in the data file for any new document to fit into nicely.

If you have a database where the documents will not grow or you create place holders in the documents to give them a certain size, you can disable the padding by explicitly creating the collection with no padding, i.e.

db.createCollection("simonTest", {noPadding:true})

Wired Tiger

There are three main benefits to Wired Tiger

  1. Performance gains
  2. Document level locking
  3. Compression

This is the first release so they are expecting more improvements in MongoDB using Wired Tiger.

Wired Tiger data is stored per collection rather than per database and is stored in B-Trees. Any new documents are added to a separate region of the data files and are moved in with the others in the background. When a document is updated the new version of the document is created as a new document and the old file is removed. Because of this there is no need for padding.  Wired Tiger can apparently perform this with relative ease.

There are two caches: the Wired Tiger Cache and the File System Cache. All changes to the Wired Tiger Cache are moved in to the File System Cache via a checkpoint. The checkpoint runs 60 seconds after the last checkpoint ended and the previous but one checkpoint is not deleted until the next checkpoint has finished. This means that you could turn off the journal but you still could potentially lose the last 60 seconds of data. Once in the File System Cache the changes are flushed to disk.

Wired Tiger has concurrency protocols that are the equivalent of document level locking.

In MMAP v1 the raw BSON is held in the same format in memory as it is stored on disk so there is no compression available. With Wired Tiger the data format is different in the Wired Tiger Cache than on the file system so compression is available. There are two type of compression you can enable in the Wired Tiger stored engine: Snappy or zLib which are both open source compression libraries. By default compression is Snappy which is less process intensive than zLib but won’t compress the data as much.

For more information on the MongoDB storage engine see the documentation on their website.