Skip to main content

MongoDB

Module mongodb

Certified

Important Capabilities

CapabilityStatusNotes
Table-Level LineageEnabled by default

This plugin extracts the following:

  • Databases and associated metadata
  • Collections in each database and schemas for each collection (via schema inference)

By default, schema inference samples 1,000 documents from each collection. Setting schemaSamplingSize: null will scan the entire collection. Moreover, setting useRandomSampling: False will sample the first documents found without random selection, which may be faster for large collections.

Note that schemaSamplingSize has no effect if enableSchemaInference: False is set.

Really large schemas will be further truncated to a maximum of 300 schema fields. This is configurable using the maxSchemaSize parameter.

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[mongodb]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: "mongodb"
config:
# Coordinates
connect_uri: "mongodb://localhost"

# Credentials
username: admin
password: password
authMechanism: "DEFAULT"

# Options
enableSchemaInference: True
useRandomSampling: True
maxSchemaSize: 300

sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

View All Configuration Options
Field [Required]TypeDescriptionDefaultNotes
authMechanism [✅]stringMongoDB authentication mechanism.None
connect_uri [✅]stringMongoDB connection URI.mongodb://localhost
enableSchemaInference [✅]booleanWhether to infer schemas.True
maxDocumentSize [✅]integer16793600
maxSchemaSize [✅]integerMaximum number of fields to include in the schema.300
options [✅]objectAdditional options to pass to pymongo.MongoClient().None
password [✅]stringMongoDB password.None
schemaSamplingSize [✅]integerNumber of documents to use when inferring schema size. If set to 0, all documents will be scanned.1000
useRandomSampling [✅]booleanIf documents for schema inference should be randomly selected. If False, documents will be selected from start.True
username [✅]stringMongoDB username.None
env [✅]stringThe environment that all assets produced by this connector belong toPROD
collection_pattern [✅]AllowDenyPatternregex patterns for collections to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
collection_pattern.allow [❓ (required if collection_pattern is set)]array(string)None
collection_pattern.deny [❓ (required if collection_pattern is set)]array(string)None
collection_pattern.ignoreCase [❓ (required if collection_pattern is set)]booleanWhether to ignore case sensitivity during pattern matching.True
database_pattern [✅]AllowDenyPatternregex patterns for databases to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
database_pattern.allow [❓ (required if database_pattern is set)]array(string)None
database_pattern.deny [❓ (required if database_pattern is set)]array(string)None
database_pattern.ignoreCase [❓ (required if database_pattern is set)]booleanWhether to ignore case sensitivity during pattern matching.True

Code Coordinates

  • Class Name: datahub.ingestion.source.mongodb.MongoDBSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for MongoDB, feel free to ping us on our Slack