Skip to main content

Create full representations of schemas using the probe info service.

Project description

# Mozilla Schema Generator

A library for generating full representations of Mozilla telemetry pings.

See [Mozilla Pipeline Schemas](https://www.github.com/mozilla-services/mozilla-pipeline-services)
for the more generic structure of pings. This library takes those generic structures and fills in
all of the probes we expect to see in the appropriate places.

## Telemetry Integration

There are two pings we are targeting for integration with this library:

1. [The Main Ping](http://gecko-docs.mozilla.org.s3.amazonaws.com/toolkit/components/telemetry/telemetry/data/main-ping.html)
is the historical Firefox Desktop ping, and contains many more than ten-thousand total pieces of data.
2. [The Glean Ping](https://github.com/mozilla/glean_parser) is the new ping-type being created for
more generic data collection.

This library takes the information for what should be in those pings from the [Probe Info Service](https://www.github.com/mozilla/probe-scraper).

## Data Store Integration

The primary use of the schemas is for integration with the
[Schema Transpiler](https://www.github.com/mozilla/jsonschema-transpiler).
The schemas that this repository generates can be transpiled into Avro and Bigquery. They define
the schema of the Avro and BigQuery tables that the [BQ Sink](https://www.github.com/mozilla/gcp-ingestion)
writes to.

### BigQuery Limitations and Splitting

BigQuery has a hard limit of ten thousand columns on any single table. This library
can take that limitation into account by splitting schemas into multiple tables. Each
table has some common information that are duplicated in every table, and then a set
of fields that are unique to that table. The join of these tables gives the full
set of fields available from the ping.

To decide on a table split, we include the `table_group` configuration in the configuration
file. For example, `payload/histograms` has `table_group: histograms`; this indicates that
there will be a table outputted with just histograms.

Currently, generates tables for:
- Histograms
- Keyed Histograms
- Scalars
- Keyed Scalars
- Everything else

If a single table expands beyond 9000 columns, we move the new fields to the next table.
For example, main_histograms_1 and main_histograms_2.

Note: Tables are only split if the `--split` parameter is provided.

## Validation

A secondary use-case of these schemas is for validation. The schemas produced are guaranteed to
be more correct, since they include explicit definitions of every metric and probe.

## Usage

### Main Ping

Generate the Full Main Ping schema:

```
mozilla-schema-generator generate-main-ping
```

Generate the Main Ping schema divided among tables (for BigQuery):
```
mozilla-schema-generator generate-main-ping --split --out-dir main-ping
```

To see a full list of options, run `mozilla-schema-generator generate-main-ping --help`.


### Glean

Generate all Glean ping schemas - one for each application, for each ping
that application sends:

```
mozilla-schema-generator generate-glean-ping
```

Write schemas to a directory:
```
mozilla-schema-generator generate-main-ping --out-dir main-ping
```

To see a full list of options, run `mozilla-schema-generator generate-glean-ping --help`.


## Configuration Files

Configuration files are default found in `/config`. You can also specify your own when running the generator.

Configuration files match certain parts of a ping to certain types of probes or metrics. The nesting
of the config file matches the ping it is filling in. For example, Glean stores probe types under
the `metrics` key, so the nesting looks like this:
```
{
"metrics": {
"string": {
<METRIC_ID>: {...}
}
}
}
```

While the generic schema doesn't include information about the specific `<METRIC_ID>`s being included,
the schema-generator does. To include the correct metrics that we would find in that section of the ping,
we would organize the `config.yaml` file like this:

```
metrics:
string:
match:
type: string
```

The `match` key indicates that we should fill-in this section of the ping schema with metrics,
and the `type: string` makes sure we only put string metrics in there. You can do an exact
match on any field available in the ping info from the [probe-info-service](https://probeinfo.telemetry.mozilla.org/glean/glean/metrics),
which also contains the [Desktop probes](https://probeinfo.telemetry.mozilla.org/firefox/all/main/all_probes).

There are a few additional keywords allowable under any field:
* `contains` - e.g. `process: contains: main`, indicates that the `process` field is an array
and it should only match those that include the entry `main`.
* `not` - e.g. `send_in_pings: not: glean_ping_info`, indicates that we should match
any field for `send_in_pings` _except_ `glean_ping_info`.

### `table_group` Key

This specific field is for indicating which table group that section of the ping should be included in when
splitting the schema. Currently we do not split the Glean ping, only the Main. See the section on [BigQuery
Limitations and Splitting](#bigquery-limitations-and-splitting) for more info.

## Development and Testing

Install requirements:
```
make install-requirements
```

Run tests:
```
make test
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mozilla-schema-generator-0.1.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

mozilla_schema_generator-0.1.0-py2.py3-none-any.whl (21.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file mozilla-schema-generator-0.1.0.tar.gz.

File metadata

  • Download URL: mozilla-schema-generator-0.1.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for mozilla-schema-generator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 687e3365f639fcc268171031f461db11596df44112935af578b1fdd70a4db98a
MD5 86eaef0db12116e75bdf9cac5f71516f
BLAKE2b-256 69a0948ae5dbd4151d9176a8cd2b16e8b0b9f1ce2ff56b2d940f26c632922d79

See more details on using hashes here.

Provenance

File details

Details for the file mozilla_schema_generator-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: mozilla_schema_generator-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 21.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for mozilla_schema_generator-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 eed8677363485177bc92b0e8e84e80e7181e8516b8f94a67c07374c9f507e4f1
MD5 7ceb637d00e46751f979cf86da8a290c
BLAKE2b-256 bd4873606185605bf19ea1cd99f24ff941b87793b1897889ec24b47852ffe460

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page