If you have any doubts in the below, contact us by dropping a mail to the Kung Fu Panda.
We will get back to you very soon.
Basics
cross platform document oriented database written in C++.
JSON like document with dynamic schemas.
General Public License and Apache License, Mongo is open source.
BSON is the binary encoding of JSON like documents. BSON adds support for data types like Date and Binary which aren't supported in JSON.
no transactions/locking are enabled in mongodb, updates to a document are atomic, though.
All writes happen on primary.
Secondary applies operation from primary by replication oplog (local.oplog.rs)
ObjectID is a 12 bytes BSON type which is a combination of 4 byte timestamp, 3 byte machine identifier, 2 byte process id, 3 byte counter, starting with a random value.
mongodb automatically uses all free memory on the machine as its cache.
monitors show mongodb using lot of memory, but its dynamic, and is released if some process needs it.
mongo uses memory mapped files, on 32 bit build, the total storage size is limited to 2 GB. so 64 bit builds are recommended for production.
mongo can aggregate data using aggregation framework and map reduce jobs using mongo's built in map reduce framework.
mongodb has built in analyser which finds slower than expected query/write.
mongo has the concept of "Capped collection" (fixed size collection), where elements are removed by LRU algo.
in mongo, you can also set TTL(time to live) on a collection, a background thread keeps removing expired elements.
When to use mongodb
when you need scalability, and high availability.
When your servers are highly distributed, and you need geospacial indexes(data should be served from a datacenter which is closer.)
if you need variable schema.
If your data size will really increase a lot, and you will need to scale(mongo does it by sharding)
If you don't need to do too many joins on tables(mongodb cannot join two tables, but allows embedding of a document inside another.)
If you don't mind a bit of data inconsistency(since mongodb doesn't have transactions and you cannot update two ).
particularly useful for storing and querying unstructured data
when your application is supposed to handle high insert load.
when you want to get an application up and running very quickly.
When to not use mongodb
When you need transactions in your database(because you cannot update two documents atomically in mongodb)
When you need data consistency, which can be achieved by normalizing the data(to third normal form) in relational DB but not in mongodb.
If your data size is not going to increase to a really large no(like 50-100 million or more, or may be table size is less than 2-3 GB), you are OK with a relational database, like mysql.
When you need strongly typed data, i.e. the data in a column is strictly of a particular type, like varchar, integer, etc
Mongodb Vs MySQL
Mongodb is highly scalable using sharding, but mysql is less scalable, because it allows joins and it cannot join between two tables, when the data resides in different databases/datacenters.
In mongodb, the data is prone to be inconsistent, but mysql schema(well designed) ensures data consistency.
mongodb does not allow atomic updates to more than one document, but mysql allows it(through transactions).
Regardless of anything, don't take anyone's word. You should always run comparison tests for your use case, before thinking about whether to use mongodb or mysql.
Mongodb and MySQL
Mongo and MySQL can be used together because each of them has its own benefits. In a common shopping application, catalog, user activity can be stored in the mongodb, and user workflow/checkout can be modelled using MySQL.
Why is mongodb faster
mongodb keeps all the indexes in memory, if the index size is greater than RAM, it will keep it on disk, and performance will be slower.
mongo inserts can be faster because it allows "fire and forget" policy for inserts/updates, it allows mongodb to ignore the result of an insert/update operation
CAP Theorum
Given three properties of computing systems, wiz consistency, availability and partition tolerance, a distributed computing system can provide two of the three features, but never all three.
Replication
Replica sets provide redundancy.
Mongo supports two types of replication
Replica Sets
preferred way of replication
primary and secondaries
preferred to have an odd no of total nodes, so that the election of a primary is easier
if you have even no of nodes, add one arbiter node, which will just participate in the election
if the primary goes down, one of the secondaries is elected as a primary
writes only happen on primary
secondary stay in sync with primary using oplog
replication is quite fast and normally only a few milliseconds are enough for syncing secondaries with primary
preferred to write to secondary, and read from secondaries or from all the nodes
sometimes it is recommended to put one of the nodes in a separate datacenter so that if one datacenter goes down, db is unaffected
Master/Slave replication
one master with one or more slaves
WriteConcern
Guarantee that mongodb provides when reporting on the success of a write operation.
Strength of write concern determine the level of guarantee.
In weak write concern, write operations return quickly, in some operations, they may not persist.
In stronger write concern, clients wait for confirmation.
Write concerns are of following types
Acknowledged : is the default write concern.
UnAcknowledged : essentially ignores the errors.
Journaled : acknowledges the write operation only after commiting the data to journal.
Replica Acknowledged : acknowledges after propagating to atleast one secondary.
Journaling
Journaling provides faster crash recovery
mongo writes to memory and in journal files.
clean shutdown removes all journal files, and changes are flushed to disk.
unclean shutdown keeps the journal
Journaling by default is disabled on 32 bit systems.
mongodb allows clients to read documents inserted or modified before it commits these transactions to disk, regardless of write concern, or journaling. As a result, applications may have the following behaviour.
for system with multiple concurrent readers and writers, clients will be able to read the results of a write operation before it returns.
if mongod terminates before journal commits, even if a write was successful, queries may have read data that will not exists after mongod restartes.
Commands for admins
db.serverStatus(); => checks server status wrt current db.
db.currentOp(); => current operation executing.
mongotop => top mongo queries.
mongostat =>
db.killOp(operationId) => kill an operation
db.collection.ensureIndex() => creates an index on collection
db.isMaster(); => returns whether the current db is master
rs.status(); => shows the replication status
db.collection.stats(); => will give you the size of the data, and the indexes.
{query}.explain(); =>gives the explain plan.
Optimization for small docs
use _id field which is explicit
use shorter field names.
Sharding
default chunk size is 64 MB, so sharding will happen only if the collection size is greater than that.
for sharding, mongos server wraps the mongod servers.
all requests for data come to mongos server, which finds the appropriate shard(s) which will server the request, and sends the request to it.
if required, mongos will combine data from two or more mongod servers to send the final response.
Each mongos instance maintains a pool of instances to the members of the replica set supporting the sharding cluster.
mongos instance holds a cache of config database that holds the metadata for the sharded cluster, which includes mappings of chunks to shards.
mongos updates its cache lazily when it requests for chunk and finds out its own info is out of date.
we can make 'flushRouterConfig' command against any mongos to force it to refresh its cache.
sharding on the ObjectId is possible, but not always the best option, since they are increasing timestamps, the system will eventually migrate this chunk to make everything more even, at a point, the inserts will only happen on one shard, affecting the throughput of inserts.
we should use hashed sharded keys.
hashed shard key should have good cardianality, ie a large no of different values,
GridFS
max document size in mongo db is 16 MB.
for bigger documents, we need to use GridFS in mongo.
also useful when you want to read info from large files without loading them whole.
one problem in GridFS is that the object in GridFS is not updated atomically.
Concurrency
mongo has reader-writer lock
allows multiple readers, but one writer.
write operation blocks the database, and no other read or write can happen.
multiple reads can access the db at the same time(assuming no writes are happening, in which case, no reads can happen)
mongo gives preference to a write lock than a read lock.
mongodb gotchas/catches
mongodb processes queries sequentially, and acquires a DB level lock(it has changed in version 2.6.0 where it takes a collection level lock). So, if a query is stuck(probably because it didnot use an index, and did a collection scan on the huge collection), it will really slow everything down.
There is no real way to correct the inconsistent data in mongodb, because the duplicated/multiplicated*, and there is no way to know which is the correct data.
best to keep the very short column names in mongodb, because mongo stores data in documents, unlike relational db, and the mongo db size and index size can be really reduced this way.
Misc
Sparse Index
if you have a large no of documents but only some of them contain a field and you are interested in them, then you need sparse index
normally an appropriate query returns documents with field as null or not present, but with sparse index, then it will return null fields, not missing fields.
Working Set
portion of the data that clients access most often.
can estimate the working set size by the 'workingSet' document in the output of db.serverStatus.
Memory Mapped files
data file which OS keeps in memory by calling mmap().
Page faults
occur when mongo needs to access data that is not in memory.
in hard page faults, mongo must get the data from the disk
in soft page fault, mongo just moves the data from one list to another, like from operating system file cache.
in production, there are hardly any soft page faults
Amount of RAM needed depends on
relationship between database storage and working set
OS's strategy for LRU (Least Recently Used)
impact of Journaling.
the no or rate of page faults
each db connection thread will need upto 1 MB of RAM.
Consistency
Reads from primary have strict consistency, because primary always have latest data
Reads from secondary have eventual consistency, because its only 'eventually', the data will be synched.
fsync
a system call that synchs all dirty, in memory pages to disk, mongo calls fsynch atleast every 60 secs.
Redo Log
In case of issues in Sharding, following should be checked.
manually added/removed data in a node, which could have different distribution wrt the shard key.
shard key has low cardianality and mongodb cannot split the chunks any further.
dataset is growing faster than balancer(very unlikely)