Getting Started with MongoDB

I’ve only ever used SQL Server and am only familiar with relational databases.  You’ve probably heard all the talk about NoSQL and I thought I’d have a quick look at MongoDB which saves data in JSON documents rather than in relational format.

I’ve just managed to install it and set it up as a service running on Windows and it seems to work with out any problems.  Here are the instructions I followed

http://docs.mongodb.org/manual/tutorial/install-mongodb-on-windows/

I extracted the contents of the download to

D:\mongodb

and saved the following in a batch file so I can just double click to connect.

D:\mongodb\bin\mongo.exe

Just type “exit” to exit.

There is a very quick example of how to add and extract data on this page

db.test.save( { a: 1 } )
db.test.find()

When you run the find() command you’ll see that you get an “_id” field back. This is the unique identifier for the document, equivalent to a primary key in SQL. You can set this value in the document you save, for example

db.test.save( {_id: 1, a: 1 } )

That’s as far as I’ve got so far but I’ve also just signed up to “M101J: MongoDB for Java Developers” in https://education.mongodb.com/. It’s been 14 years since I last looked at any Java code so it might be a struggle.

Advertisements

Parameters in Queries

We have some functionality within our application that generates some search strings for an entity before they are compiled into a JSON document and sent up to Elasticsearch.  We have one row in a table per entity and there is a StatusId column used to track the progress of the generation of the search strings.  There are a few different stored procedures that change the values in the StatusId column and recently we’ve been using a standard piece of code in each of these stored procedures that declares some variables with descriptive names to hold the different StatusIds.  These variables then get used within the stored procedure instead of hard coded values.

So instead of something like

UPDATE dbo.Test
SET StatusId = 2
WHERE StatusId = 1;

we use

DECLARE @Queue INT = 1;
DECLARE @Pickup INT = 2;
DECLARE @Process INT = 3;

UPDATE dbo.Test
SET StatusId = @Pickup
WHERE StatusId = @Queue;

I’ve just been looking into whether SQL Server behaves differently when executing these two queries and have found that it does.  There are potential performance issues where the optimiser gets the estimated number of rows very wrong when using the query with the parameters.

To demonstrate this lets create a table so follows:

CREATE TABLE dbo.Test
(
    Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
    StatusId INT NOT NULL,
    SomeText VARCHAR(100) NOT NULL
);
GO

Now we can populate this with 100,00 rows as follows:

INSERT INTO dbo.Test (StatusId, SomeText)
VALUES (1, 'Some test data');
GO 100000

So we now have 100,000 rows where the StatusId = 1.  Let’s update 100 rows to have a StatusId = 2.

UPDATE TOP (100) dbo.Test
SET StatusId = 2
WHERE StatusId = 1;

Let’s update the statistics on the table just to make sure we’re all up to date.

UPDATE STATISTICS dbo.Test;

Now we can run the two different types of query that yield the same results.

SELECT *
FROM dbo.Test
WHERE StatusId = 1;
GO
DECLARE @StatusId INT = 1;
SELECT *
FROM dbo.Test
WHERE StatusId = @StatusId;
GO

If we look at the execution plan we can see the first query without the parameter has managed to estimate the number of rows correctly.

ParametersInQueries01

The query with the parameters has estimated the number of rows as 50,000.

ParametersInQueries02

SQL Server has worked out that there are two possible values in the StatusId column and it’s parametrised the query based on either of those values.  If we update another 100 rows to have a third StatusId will we see the estimated number of rows changes.

UPDATE TOP (100) dbo.Test
SET StatusId = 3
WHERE StatusId = 1;
GO
UPDATE STATISTICS dbo.Test;
GO
SELECT *
FROM dbo.Test
WHERE StatusId = 1;
GO
DECLARE @StatusId INT = 1;
SELECT *
FROM dbo.Test
WHERE StatusId = @StatusId;
GO

ParametersInQueries03

As this is a simple SELECT from one table SQL Server has parametrised the query which can be seen in the execution plan.  Notice the SELECT statement near the top has StatusId = @1.

ParametersInQueries04

This means that SQL Server will use the same execution plan if the query remains the same but with a different value for the StatusId.  So if we change the query to return the 100 rows that are at StatusId = 2 the estimated number of rows is bad for each query.

SELECT *
FROM dbo.Test
WHERE StatusId = 2;
GO
DECLARE @StatusId INT = 2;
SELECT *
FROM dbo.Test
WHERE StatusId = @StatusId;
GO

For the first query we have the following:

ParametersInQueries05

This is only because SQL Server parametrised the query and used the same plan for StatusId = 1 and StatusId = 2 but there are lots of restrictions that stop SQL Server doing this by default.  For example we can add another table that contains the status and create a foreign key to our first table as follows.

CREATE TABLE dbo.Status
(
    Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
    StatusDescription VARCHAR(100) NOT NULL
);
GO
INSERT INTO dbo.Status (StatusDescription)
VALUES ('Queue'), ('Pickup'), ('Process');
GO
ALTER TABLE dbo.Test ADD CONSTRAINT FK_Test_StatusId FOREIGN KEY (StatusId) REFERENCES dbo.Status (Id);
GO

Using a JOIN prevent SQL Server from parametrising the query so we can run the following queries, one without using a parameter and one using a parameter.

SELECT *
FROM dbo.Test T
INNER JOIN dbo.Status S ON S.Id = T.StatusId
WHERE S.Id = 2;
GO
DECLARE @StatusId INT = 2;
SELECT *
FROM dbo.Test T
INNER JOIN dbo.Status S ON S.Id = T.StatusId
WHERE S.Id = @StatusId;
GO

We can see that both queries return the same 100 rows.  If we look at the execution plans we can see they are very different.  We can also see that the first query has not been parametrised and the plan contains the actual value of 2.

ParametersInQueries06

ParametersInQueries07

If we check the estimated and actual rows in the first execution plan we can see that the index seek and key lookup both have 100 for both values.  The estimated number of rows for the clustered index scan is still 33333.33 which is the number of rows divided by the number of distinct values in those rows.

Conclusion:

Each situation will be different but in most cases it looks like it is not a good idea to use a descriptive static variable in your code instead of using a hard coded value, i.e. one where the value of the parameter is always going to be the same.  SQL Server doesn’t know that the value in the variable is going to be the same every time and generates an execution plan based on all possible values.