Friday, September 18, 2015


MongoDB is a NoSQL document-oriented database which stores data as BSON (Binary Script Object Notation) documents.

Salient features of MongoDB are:
  1. Document Oriented - aggregates the data in minimal number of documents.
  2. Ad hoc queries - like regular expression search, by ranges or field is supported.
  3. Indexing - any field in the document can be indexed.
  4. Replication - high availability is supported by maintaining replicas of data in more than one replica set member. The data is eventually consistent between the replica members.
  5. Load Balancing - uses sharding (a shard is a master with one or more slaves) to distribute the data split into ranges (based on shard key) between multiple shards.
  6. File storage - supports storing a file not as a single document but split across multiple shards - GridFS feature of MongoDB comes built in and is used by NGNIX and lighthttpd.
  7. Aggregation - is similar to SQL GROUP BY clause.
  8. Capped collections - can be used to store data in insertion order and once specified size is reached, behaves like a circular queue. Similar to RRD.
  9. Server side Javascript execution - is supported.

MongoDB is one of the top performing NoSQL database. Benchmarks have reported MongoDB performance to be better than some other NoSQL DBs by as much as 25x.
Eventual Consistency - eventually all nodes in the cluster of NoSQL DB will have the same data as data may not be propagated if the network breaks down or the node goes down but eventually when node is up and the network is working then data will be consistent across all shards.
1. Collections - set of documents
2. Document - BSON document with dynamic schema - means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.
Important mongo shell commands:
>show dbs
>show collections
>{"name":"tutorials point"}) --- will insert a movie document in movie collection.
>db.createCollection("mycol", { capped : true, autoIndexID : true, size : 6142800, max : 10000 } )
RDBMS Where Clause Equivalents in MongoDB
To query the document on the basis of some condition, you can use following operations
OperationSyntaxExampleRDBMS Equivalent
Equality{:}db.mycol.find({"by":"tutorials point"}).pretty()where by = 'tutorials point'
Less Than{:{$lt:}}db.mycol.find({"likes":{$lt:50}}).pretty()where likes < 50
Less Than Equals{:{$lte:}}db.mycol.find({"likes":{$lte:50}}).pretty()where likes <= 50
Greater Than{:{$gt:}}db.mycol.find({"likes":{$gt:50}}).pretty()where likes > 50
Greater Than Equals{:{$gte:}}db.mycol.find({"likes":{$gte:50}}).pretty()where likes >= 50
Not Equals{:{$ne:}}db.mycol.find({"likes":{$ne:50}}).pretty()where likes != 50
AND condition:
>db.mycol.find({key1:value1, key2:value2}).pretty()
OR condition:
      $or: [
      {key1: value1}, {key2:value2}
'where likes>10 AND (by = 'tutorials point' OR title = 'MongoDB Overview')'
>db.mycol.find("likes": {$gt:10}, $or: [{"by": "tutorials point"}, {"title": "MongoDB Overview"}] }).pretty()
>db.mycol.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB Tutorial'}})

Replace old document with new data:
      "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point New Topic", "by":"Tutorials Point"
>db.mycol.remove({'title':'MongoDB Overview'})
>db.COLLECTION_NAME.remove(DELETION_CRITERIA,1) -- removes the first match
>db.mycol.remove() - remove all documents in the collection.

Projection: selectively display fields and not all fields of a document.
>db.mycol.find({},{"title":1,_id:0}) -- will display title and not _id (which is always displayed otherwise)
Sort: 1 is used for ascending order while -1 is used for descending order.
Index: Here key is the name of filed on which you want to create index and 1 is for ascending order. To create index in descending order you need to use -1.
> db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : 1}}}])
select by_user, count(*) from mycol group by by_user -- how many tutorials are written by each user (grouped by a user)

There is a list available aggregation expressions.
$sumSums up the defined value from all documents in the collection.db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : "$likes"}}}])
$avgCalculates the average of all given values from all documents in the collection.db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$avg : "$likes"}}}])
$minGets the minimum of the corresponding values from all documents in the collection.db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$min : "$likes"}}}])
$maxGets the maximum of the corresponding values from all documents in the collection.db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$max : "$likes"}}}])
$pushInserts the value to an array in the resulting document.db.mycol.aggregate([{$group : {_id : "$by_user", url : {$push: "$url"}}}])
$addToSetInserts the value to an array in the resulting document but does not create duplicates.db.mycol.aggregate([{$group : {_id : "$by_user", url : {$addToSet : "$url"}}}])
$firstGets the first document from the source documents according to the grouping. Typically this makes only sense together with some previously applied “$sort”-stage.db.mycol.aggregate([{$group : {_id : "$by_user", first_url : {$first : "$url"}}}])
$lastGets the last document from the source documents according to the grouping. Typically this makes only sense together with some previously applied “$sort”-stage.db.mycol.aggregate([{$group : {_id : "$by_user", last_url : {$last : "$url"}}}])
Replication: client application always interact with primary node and primary node then replicate the data to the secondary nodes.
  • All write operations goes to primary
  • Reads can happen from any of primary or secondary nodes.
  • Replication - is keeping multiple copies of the same data for HA and failover. One node is primary and others are secondary. Minimum 3 nodes needed to form a replica set.
  • Writes are always written to primary node in a replica set.
  • Within the replica there will be some delay when the writes to the primary node gets replicated to the secondary. In your application you may want to wait for the writes to be replicated. This is controlled by w flag - or write to replica flag. This is set to w:1 by default in drivers.
  • Mongodb keeps the data in memory and flushes it to disk periodically. If application wants to wait for the data to be written to disk then it can set J:1 (where, j is journal written to disk). W:1, J:1  is default for drivers.
  • Another setting is w:majority, which means write should propagate to majority nodes in the replica set. These w and j settings are called write concerns.
  • An application can set its read preference to read from secondary. Secondary's data may be stale so this is not recommended.

  • Following is how mongodb replica set is created:

    mkdir -p /data/rs1 /data/rs2 /data/rs3
    mongod --replSet m101 --logpath "1.log" --dbpath /data/rs1 --port 27017 --smallfiles --oplogSize 64 --fork

    mongod --replSet m101 --logpath "2.log" --dbpath /data/rs2 --port 27018 --smallfiles --oplogSize 64 --fork
    mongod --replSet m101 --logpath "3.log" --dbpath /data/rs3 --port 27019 --smallfiles --oplogSize 64 --fork
    mongo --port 27017
    Now you will create the replica set. Type the following commands into the mongo shell:
    config = { _id: "m101", members:[
              { _id : 0, host : "localhost:27017"},
              { _id : 1, host : "localhost:27018"},
              { _id : 2, host : "localhost:27019"} ]


    Ps -ef | grep mongod -- will show all the replica sets created on the localhost.
Setting up replica set:
mongod --port "PORT" --dbpath "YOUR_DB_DATA_PATH" --replSet "REPLICA_SET_INSTANCE_NAME"
Backup/Restore MongoDB data: 

Sharding - is a solution for horizontal scaling of mongodb. More shards (which are in turn made up of replica sets) can be added depending on the load on the system. A shard key needs to be an indexed key in the collection and should be present in all documents. The key need not be unique. It is used to identify the right shard to send the data to for persistence. Within the shard, replica sets will create copies of the data on replica set member nodes. So sharding helps in splitting the data based on a shard key in the document. Its like storing data in a hashmap. The better the key selection, the better the data will be divided among the shards.

Sharding is controlled via mongos router. Application connects to the mongos router which will listen to the 27017 port for example, and will know based on the shard key on which shard to insert the data to. For read/find operation mongos will query the primary node (of the replica set) in each shard and collate the result.

  1. Easy to setup and develop against in multiple languages, like Java, python etc. Good to build POC applications.
  2. Stable enough.
  3. Querying is very powerful.
  4. Pretty performant in querying and inserts. If the schema has embedded documents mostly then it is worth using it.
  5. JSON structures can model quite complex objects.
  6. Schema-less DB can be useful (at least it appears to be so in theory) as it can reduce the pain in migration, though you still need to write the migration scripts nevertheless but not have to worry about alter-ing the schema (as there is no DDL or schema definition in MongoDB, all schema gets realized at runtime and documents within the same collection can be quite dissimilar).


  1. By design, no referential integrity is supported:
    1. So no cascade deletion - we need to handle it in application.
  2. Very hard to design with only embedded documents in the schema. Most of the time we end up with having documents with References or links which is akin to relations in the RDBMS. This comes at a price that to build a transaction with rollback is too much work in the application. MongoDB only guarantees atomicity within a document's boundary. So when you have too most documents having references instead of embedded documents in them then consider using RDBMS.
  3. Hard to model many-to-many relations.

No comments:

15 sorting algorithms visualized in 5 minutes, with awesome arcade sounds

15 sorting algorithms visualized in 5 minutes, with awesome arcade sounds from r/programming