PUT

ATAC-seq data from putamen

Pipeline version: v1.4.1

Report generated at 2019-06-14 07:55:02

Paired-end: [True, True, True, True]

Pipeline type: ATAC-Seq

Genome: hg38.tsv

Peak caller: MACS2

Alignment

Flagstat (raw BAM)

	rep1 (PE)	rep2 (PE)	rep3 (PE)	rep4 (PE)
Total	95444864	190505688	163936258	146180152
Total(QC-failed)	0	0	0	0
Dupes	0	0	0	0
Dupes(QC-failed)	0	0	0	0
Mapped	95394462	190336491	163871444	146146064
Mapped(QC-failed)	0	0	0	0
% Mapped	99.9500	99.9100	99.9600	99.9800
Paired	54494188	105424180	98055646	76743000
Paired(QC-failed)	0	0	0	0
Read1	27247094	52712090	49027823	38371500
Read1(QC-failed)	0	0	0	0
Read2	27247094	52712090	49027823	38371500
Read2(QC-failed)	0	0	0	0
Properly Paired	54411510	105164530	97958260	76671268
Properly Paired(QC-failed)	0	0	0	0
% Properly Paired	99.8500	99.7500	99.9000	99.9100
With itself	54439110	105210878	97984506	76701606
With itself(QC-failed)	0	0	0	0
Singletons	4676	44105	6326	7306
Singletons(QC-failed)	0	0	0	0
% Singleton	0.0100	0.0400	0.0100	0.0100
Diff. Chroms	1996	2912	2773	1606
Diff. Chroms (QC-failed)	0	0	0	0

Marking duplicates (filtered BAM)

Filtered out (samtools view -F 1804):

read unmapped (0x4)
mate unmapped (0x8, for paired-end)
not primary alignment (0x100)
read fails platform/vendor quality checks (0x200)
read is PCR or optical duplicate (0x400)

	rep1 (PE)	rep2 (PE)	rep3 (PE)	rep4 (PE)
Unpaired Reads	0	0	0	0
Paired Reads	22663221	43049484	41623478	30404447
Unmapped Reads	0	0	0	0
Unpaired Dupes	0	0	0	0
Paired Dupes	19566	7133	12202	12249
Paired Opt. Dupes	0	0	0	0
% Dupes/100	0.0009	0.0002	0.0003	0.0004

Library complexity (filtered non-mito BAM)

	rep1 (PE)	rep2 (PE)	rep3 (PE)	rep4 (PE)
Total Reads (Pairs)	22660866	43048114	41620633	30403338
Distinct Reads (Pairs)	22639306	43037380	41606968	30389477
One Read (Pair)	22617847	43026900	41593368	30375764
Two Reads (Pairs)	21388	10381	13554	13652
NRF = Distinct/Total	0.9990	0.9998	0.9997	0.9995
PBC1 = OnePair/Distinct	0.9991	0.9998	0.9997	0.9995
PBC2 = OnePair/TwoPair	1057.5017	4144.7741	3068.7154	2225.0047

Mitochondrial reads are filtered out.

NRF (non redundant fraction)
PBC1 (PCR Bottleneck coefficient 1)
PBC2 (PCR Bottleneck coefficient 2)
PBC1 is the primary measure. Provisionally

0-0.5 is severe bottlenecking
0.5-0.8 is moderate bottlenecking
0.8-0.9 is mild bottlenecking
0.9-1.0 is no bottlenecking

Flagstat (filtered/deduped BAM)

Filtered and duplicates removed

	rep1 (PE)	rep2 (PE)	rep3 (PE)	rep4 (PE)
Total	45287310	86084702	83222552	60784396
Total(QC-failed)	0	0	0	0
Dupes	0	0	0	0
Dupes(QC-failed)	0	0	0	0
Mapped	45287310	86084702	83222552	60784396
Mapped(QC-failed)	0	0	0	0
% Mapped	100.0000	100.0000	100.0000	100.0000
Paired	45287310	86084702	83222552	60784396
Paired(QC-failed)	0	0	0	0
Read1	22643655	43042351	41611276	30392198
Read1(QC-failed)	0	0	0	0
Read2	22643655	43042351	41611276	30392198
Read2(QC-failed)	0	0	0	0
Properly Paired	45287310	86084702	83222552	60784396
Properly Paired(QC-failed)	0	0	0	0
% Properly Paired	100.0000	100.0000	100.0000	100.0000
With itself	45287310	86084702	83222552	60784396
With itself(QC-failed)	0	0	0	0
Singletons	0	0	0	0
Singletons(QC-failed)	0	0	0	0
% Singleton	0.0000	0.0000	0.0000	0.0000
Diff. Chroms	0	0	0	0
Diff. Chroms (QC-failed)	0	0	0	0

Peak calling

IDR (Irreproducible Discovery Rate) plots

Reproducibility QC and peak detection statistics

The number of peaks is capped at 300K for peak-caller MACS2

	overlap	IDR
Nt	209467	133096
N1	169347	101398
N2	177842	101555
N3	194335	110142
N4	177427	108763
Np	241767	177004
N optimal	241767	177004
N conservative	209467	133096
Optimal Set	ppr	ppr
Conservative Set	rep1-rep2	rep1-rep2
Rescue Ratio	1.1542	1.3299
Self Consistency Ratio	1.1476	1.0862
Reproducibility	pass	pass

Overlapping peaks

N1: Replicate 1 self-consistent overlapping peaks (comparing two pseudoreplicates generated by subsampling Rep1 reads)
N2: Replicate 2 self-consistent overlapping peaks (comparing two pseudoreplicates generated by subsampling Rep2 reads)
Nt: True Replicate consisten overlapping peaks (comparing true replicates Rep1 vs Rep2 )
Np: Pooled-pseudoreplicate consistent overlapping peaks (comparing two pseudoreplicates generated by subsampling pooled reads from Rep1 and Rep2 )
Self-consistency Ratio: max(N1,N2) / min (N1,N2)
Rescue Ratio: max(Np,Nt) / min (Np,Nt)
Reproducibility Test: If Self-consistency Ratio >2 AND Rescue Ratio > 2, then 'Fail' else 'Pass'

IDR (Irreproducible Discovery Rate) peaks

N1: Replicate 1 self-consistent IDR 0.1 peaks (comparing two pseudoreplicates generated by subsampling Rep1 reads)
N2: Replicate 2 self-consistent IDR 0.1 peaks (comparing two pseudoreplicates generated by subsampling Rep2 reads)
Nt: True Replicate consistent IDR 0.1 peaks (comparing true replicates Rep1 vs Rep2 )
Np: Pooled-pseudoreplicate consistent IDR 0.1 peaks (comparing two pseudoreplicates generated by subsampling pooled reads from Rep1 and Rep2 )
Self-consistency Ratio: max(N1,N2) / min (N1,N2)
Rescue Ratio: max(Np,Nt) / min (Np,Nt)
Reproducibility Test: If Self-consistency Ratio >2 AND Rescue Ratio > 2, then 'Fail' else 'Pass'

Enrichment

Strand cross-correlation measures

Performed on subsampled reads (25M)

	rep1	rep2	rep3	rep4
Reads	22641318	25000000	25000000	25000000
Est. Fragment Len.	0	0	0	0
Corr. Est. Fragment Len.	0.3525	0.3159	0.3204	0.3265
Phantom Peak	50	50	50	55
Corr. Phantom Peak	0.3021	0.2778	0.2733	0.2941
Argmin. Corr.	1500	1500	1500	1500
Min. Corr.	0.2014	0.2437	0.2472	0.2319
NSC	1.7502	1.2962	1.2962	1.4078
RSC	1.5011	2.1173	2.7987	1.5203

NOTE1: For SE datasets, reads from replicates are randomly subsampled.
NOTE2: For PE datasets, the first end of each read-pair is selected and the reads are then randomly subsampled.

Normalized strand cross-correlation coefficient (NSC) = col9 in outFile
Relative strand cross-correlation coefficient (RSC) = col10 in outFile
Estimated fragment length = col3 in outFile, take the top value

Fraction of reads in overlapping peaks

	rep1-rep2	rep1-rep3	rep1-rep4	rep2-rep3	rep2-rep4	rep3-rep4	rep1-pr	rep2-pr	rep3-pr	rep4-pr	ppr
Fraction of Reads in Peak	0.1689	0.1667	0.1694	0.1637	0.1645	0.1626	0.2272	0.1362	0.1354	0.1808	0.1813

ppr: Overlapping peaks comparing pooled pseudo replicates
rep1-pr: Overlapping peaks comparing pseudoreplicates from replicate 1
rep2-pr: Overlapping peaks comparing pseudoreplicates from replicate 2
repi-repj: Overlapping peaks comparing true replicates (rep i vs. rep j)

Fraction of reads in IDR peaks

	rep1-rep2	rep1-rep3	rep1-rep4	rep2-rep3	rep2-rep4	rep3-rep4	rep1-pr	rep2-pr	rep3-pr	rep4-pr	ppr
Fraction of Reads in Peak	0.1351	0.1230	0.1350	0.1210	0.1240	0.1147	0.1861	0.1049	0.0982	0.1458	0.1577

ppr: IDR peaks comparing pooled pseudo replicates
rep1-pr: IDR peaks comparing pseudoreplicates from replicate 1
rep2-pr: IDR peaks comparing pseudoreplicates from replicate 2
repi-repj: IDR peaks comparing true replicates (rep i vs. rep j)

ATAQC

Summary table

	rep1	rep2	rep3	rep4
Genome	GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz	GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz	GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz	GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
Paired/single-ended	Paired-ended	Paired-ended	Paired-ended	Paired-ended
Read length	50	51	50	51
Read count from sequencer	54494188	105424180	98055646	76743000
Read count successfully aligned	54443786	105254983	97990832	76708912
Read count after filtering for mapping quality	51391743	98086184	93540058	71493582
Read count after removing duplicate reads	51372177	98079051	93527856	71481333
Read count after removing mitochondrial reads (final read count)	45287310	86084702	83222552	60784396
Mapping quality > q30 (out of total)	51391743, 0.943068332351	98086184, 0.93039551268	93540058, 0.953948720097	71493582, 0.931597435597
Duplicates (after filtering)	19566, 0.000863	7133, 0.000166	12202, 0.000293	12249, 0.000403
Mitochondrial reads (out of total)	19077, 0.00019998016237	13372, 7.02545262327e-05	16778, 0.000102385135509	13041, 8.92326460465e-05
Duplicates that are mitochondrial (out of all dups)	36, 0.000919963201472	8, 0.000560773867938	24, 0.00098344533683	4, 0.000163278634991
Final reads (after all filters)	45287310, 0.831048441349	86084702, 0.81655557577	83222552, 0.848727792788	60784396, 0.792051340187
NRF = Distinct/Total	0.999049, OK	0.999751, OK	0.999672, OK	0.999544, OK
PBC1 = OnePair/Distinct	0.999052, OK	0.999756, OK	0.999673, OK	0.999549, OK
PBC2 = OnePair/TwoPair	1057.50173, OK	4144.774107, OK	3068.715361, OK	2225.004688, OK
Picard est library size	24207172875	33139326668	60580360021	26316749690
Fraction of reads in nfr	0.642993808036, OK	0.646909169167, OK	0.378866302083, out of range [0.4, inf]	0.614671225298, OK
Nfr / mono-nuc reads	2.15627959638, out of range [2.5, inf]	2.07666427821, out of range [2.5, inf]	0.931539832479, out of range [2.5, inf]	1.88572038667, out of range [2.5, inf]
Presence of nfr peak	OK	OK	OK	OK
Presence of mono-nuc peak	OK	OK	OK	OK
Presence of di-nuc peak	OK	OK	OK	OK
Naive overlap peaks	241767, OK	241767, OK	241767, OK	241767, OK
Idr peaks	177004, OK	177004, OK	177004, OK	177004, OK
Naive peak stats: min size	73.0000	73.0000	73.0000	73.0000
Naive peak stats: 25 percentile	343.0000	343.0000	343.0000	343.0000
Naive peak stats: 50 percentile (median)	532.0000	532.0000	532.0000	532.0000
Naive peak stats: 75 percentile	754.0000	754.0000	754.0000	754.0000
Naive peak stats: max size	2704.0000	2704.0000	2704.0000	2704.0000
Naive peak stats: mean	579.3118	579.3118	579.3118	579.3118
Idr peak stats: min size	73.0000	73.0000	73.0000	73.0000
Idr peak stats: 25 percentile	438.0000	438.0000	438.0000	438.0000
Idr peak stats: 50 percentile (median)	618.0000	618.0000	618.0000	618.0000
Idr peak stats: 75 percentile	829.0000	829.0000	829.0000	829.0000
Idr peak stats: max size	2704.0000	2704.0000	2704.0000	2704.0000
Idr peak stats: mean	660.5189	660.5189	660.5189	660.5189
Tss enrichment	17.5774	8.2167	8.9285	13.5067
Fraction of reads in universal dhs regions	11673624, 0.257794709654	16374593, 0.190220937091	16816239, 0.202077244275	15375755, 0.252964840913
Fraction of reads in blacklist regions	154, 3.40086208762e-06	401, 4.65835063951e-06	137, 1.64630048762e-06	197, 3.24108140771e-06
Fraction of reads in promoter regions	4531865, 0.100079531589	4352604, 0.0505634803664	5064565, 0.0608598235699	5546244, 0.0912478594467
Fraction of reads in enhancer regions	10857709, 0.239776434393	19100683, 0.221889473487	18755121, 0.225376385749	15744310, 0.259028377757
Fraction of reads in called peak regions	8426262, 0.186081525819	9032504, 0.104929104201	8168805, 0.0981628295969	8861093, 0.145784384641

Replicate 1

Sample Information

Sample
Genome	GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
Paired/Single-ended	Paired-ended
Read length	50

Summary

Read count from sequencer	54,494,188
Read count successfully aligned	54,443,786
Read count after filtering for mapping quality	51,391,743
Read count after removing duplicate reads	51,372,177
Read count after removing mitochondrial reads (final read count)	45,287,310

Note that all these read counts are determined using 'samtools view' - as such,
these are all reads found in the file, whether one end of a pair or a single
end read. In other words, if your file is paired end, then you should divide
these counts by two. Each step follows the previous step; for example, the
duplicate reads were removed after reads were removed for low mapping quality.

This bar chart also shows the filtering process and where the reads were lost
over the process. Note that each step is sequential - as such, there may
have been more mitochondrial reads which were already filtered because of
high duplication or low mapping quality. Note that all these read counts are
determined using 'samtools view' - as such, these are all reads found in
the file, whether one end of a pair or a single end read. In other words,
if your file is paired end, then you should divide these counts by two.

Alignment statistics

Bowtie alignment log

27247094 reads; of these:
  27247094 (100.00%) were paired; of these:
    41339 (0.15%) aligned concordantly 0 times
    20628768 (75.71%) aligned concordantly exactly 1 time
    6576987 (24.14%) aligned concordantly >1 times
    ----
    41339 pairs aligned concordantly 0 times; of these:
      6053 (14.64%) aligned discordantly 1 time
    ----
    35286 pairs aligned 0 times concordantly or discordantly; of these:
      70572 mates make up the pairs; of these:
        50402 (71.42%) aligned 0 times
        3820 (5.41%) aligned exactly 1 time
        16350 (23.17%) aligned >1 times
99.91% overall alignment rate

Samtools flagstat

95444864 + 0 in total (QC-passed reads + QC-failed reads)
40950676 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
95394462 + 0 mapped (99.95%:-nan%)
54494188 + 0 paired in sequencing
27247094 + 0 read1
27247094 + 0 read2
54411510 + 0 properly paired (99.85%:-nan%)
54439110 + 0 with itself and mate mapped
4676 + 0 singletons (0.01%:-nan%)
5318 + 0 with mate mapped to a different chr
1996 + 0 with mate mapped to a different chr (mapQ>=5)

Note that the flagstat command counts alignments, not reads. please 
use the read counts table to get accurate counts of reads at each
stage of the pipeline.

Filtering statistics

Mapping quality > q30 (out of total)	51,391,743	0.943
Duplicates (after filtering)	19,566	0.001
Mitochondrial reads (out of total)	19,077	0.000
Duplicates that are mitochondrial (out of all dups)	36	0.001
Final reads (after all filters)	45,287,310	0.831

Mapping quality refers to the quality of the read being aligned to that
particular location in the genome. A standard quality score is > 30.
Duplications are often due to PCR duplication rather than two unique reads
mapping to the same location. High duplication is an indication of poor
libraries. Mitochondrial reads are often high in chromatin accessibility
assays because the mitochondrial genome is very open. A high mitochondrial
fraction is an indication of poor libraries. Based on prior experience, a
final read fraction above 0.70 is a good library.

Library complexity statistics

ENCODE library complexity metrics

Metric	Result
NRF	0.999049 - OK
PBC1	0.999052 - OK
PBC2	1057.50173 - OK

The non-redundant fraction (NRF) is the fraction of non-redundant mapped reads
in a dataset; it is the ratio between the number of positions in the genome
that uniquely mapped reads map to and the total number of uniquely mappable
reads. The NRF should be > 0.8. The PBC1 is the ratio of genomic locations
with EXACTLY one read pair over the genomic locations with AT LEAST one read
pair. PBC1 is the primary measure, and the PBC1 should be close to 1.
Provisionally 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking,
0.8-0.9 is mild bottlenecking, and 0.9-1.0 is no bottlenecking. The PBC2 is
the ratio of genomic locations with EXACTLY one read pair over the genomic
locations with EXACTLY two read pairs. The PBC2 should be significantly
greater than 1.

Picard EstimateLibraryComplexity

24,207,172,875

Yield prediction

Preseq performs a yield prediction by subsampling the reads, calculating the
number of distinct reads, and then extrapolating out to see where the
expected number of distinct reads no longer increases. The confidence interval
gives a gauge as to the validity of the yield predictions.

Fragment length statistics

Metric	Result
Fraction of reads in NFR	0.642993808036 - OK
NFR / mono-nuc reads	2.15627959638 out of range [2.5, inf]
Presence of NFR peak	OK
Presence of Mono-Nuc peak	OK
Presence of Di-Nuc peak	OK

Open chromatin assays show distinct fragment length enrichments, as the cut
sites are only in open chromatin and not in nucleosomes. As such, peaks
representing different n-nucleosomal (ex mono-nucleosomal, di-nucleosomal)
fragment lengths will arise. Good libraries will show these peaks in a
fragment length distribution and will show specific peak ratios.

Peak statistics

Metric	Result
Naive overlap peaks	241767 - OK
IDR peaks	177004 - OK

Naive overlap peak file statistics

Min size	73.0
25 percentile	343.0
50 percentile (median)	532.0
75 percentile	754.0
Max size	2704.0
Mean	579.311800204

IDR peak file statistics

Min size	73.0
25 percentile	438.0
50 percentile (median)	618.0
75 percentile	829.0
Max size	2704.0
Mean	660.518858331

For a good ATAC-seq experiment in human, you expect to get 100k-200k peaks
for a specific cell type.

Sequence quality metrics

GC bias

Open chromatin assays are known to have significant GC bias. Please take this
into consideration as necessary.

Annotation-based quality metrics

Enrichment plots (TSS)

Open chromatin assays should show enrichment in open chromatin sites, such as
TSS's. An average TSS enrichment in human (hg19) is above 6. A strong TSS enrichment is
above 10. For other references please see https://www.encodeproject.org/atac-seq/

Annotated genomic region enrichments

Fraction of reads in universal DHS regions	11,673,624	0.258
Fraction of reads in blacklist regions	154	0.000
Fraction of reads in promoter regions	4,531,865	0.100
Fraction of reads in enhancer regions	10,857,709	0.240
Fraction of reads in called peak regions	8,426,262	0.186

Signal to noise can be assessed by considering whether reads are falling into
known open regions (such as DHS regions) or not. A high fraction of reads
should fall into the universal (across cell type) DHS set. A small fraction
should fall into the blacklist regions. A high set (though not all) should
fall into the promoter regions. A high set (though not all) should fall into
the enhancer regions. The promoter regions should not take up all reads, as
it is known that there is a bias for promoters in open chromatin assays.

Comparison to Roadmap DNase

This bar chart shows the correlation between the Roadmap DNase samples to
your sample, when the signal in the universal DNase peak region sets are
compared. The closer the sample is in signal distribution in the regions
to your sample, the higher the correlation.

Replicate 2

Sample Information

Sample
Genome	GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
Paired/Single-ended	Paired-ended
Read length	51

Summary

Read count from sequencer	105,424,180
Read count successfully aligned	105,254,983
Read count after filtering for mapping quality	98,086,184
Read count after removing duplicate reads	98,079,051
Read count after removing mitochondrial reads (final read count)	86,084,702

Note that all these read counts are determined using 'samtools view' - as such,
these are all reads found in the file, whether one end of a pair or a single
end read. In other words, if your file is paired end, then you should divide
these counts by two. Each step follows the previous step; for example, the
duplicate reads were removed after reads were removed for low mapping quality.

This bar chart also shows the filtering process and where the reads were lost
over the process. Note that each step is sequential - as such, there may
have been more mitochondrial reads which were already filtered because of
high duplication or low mapping quality. Note that all these read counts are
determined using 'samtools view' - as such, these are all reads found in
the file, whether one end of a pair or a single end read. In other words,
if your file is paired end, then you should divide these counts by two.

Alignment statistics

Bowtie alignment log

52712090 reads; of these:
  52712090 (100.00%) were paired; of these:
    129825 (0.25%) aligned concordantly 0 times
    39176936 (74.32%) aligned concordantly exactly 1 time
    13405329 (25.43%) aligned concordantly >1 times
    ----
    129825 pairs aligned concordantly 0 times; of these:
      8677 (6.68%) aligned discordantly 1 time
    ----
    121148 pairs aligned 0 times concordantly or discordantly; of these:
      242296 mates make up the pairs; of these:
        169197 (69.83%) aligned 0 times
        26920 (11.11%) aligned exactly 1 time
        46179 (19.06%) aligned >1 times
99.84% overall alignment rate

Samtools flagstat

190505688 + 0 in total (QC-passed reads + QC-failed reads)
85081508 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
190336491 + 0 mapped (99.91%:-nan%)
105424180 + 0 paired in sequencing
52712090 + 0 read1
52712090 + 0 read2
105164530 + 0 properly paired (99.75%:-nan%)
105210878 + 0 with itself and mate mapped
44105 + 0 singletons (0.04%:-nan%)
8712 + 0 with mate mapped to a different chr
2912 + 0 with mate mapped to a different chr (mapQ>=5)

Note that the flagstat command counts alignments, not reads. please 
use the read counts table to get accurate counts of reads at each
stage of the pipeline.

Filtering statistics

Mapping quality > q30 (out of total)	98,086,184	0.930
Duplicates (after filtering)	7,133	0.000
Mitochondrial reads (out of total)	13,372	0.000
Duplicates that are mitochondrial (out of all dups)	8	0.001
Final reads (after all filters)	86,084,702	0.817

Mapping quality refers to the quality of the read being aligned to that
particular location in the genome. A standard quality score is > 30.
Duplications are often due to PCR duplication rather than two unique reads
mapping to the same location. High duplication is an indication of poor
libraries. Mitochondrial reads are often high in chromatin accessibility
assays because the mitochondrial genome is very open. A high mitochondrial
fraction is an indication of poor libraries. Based on prior experience, a
final read fraction above 0.70 is a good library.

Library complexity statistics

ENCODE library complexity metrics

Metric	Result
NRF	0.999751 - OK
PBC1	0.999756 - OK
PBC2	4144.774107 - OK

The non-redundant fraction (NRF) is the fraction of non-redundant mapped reads
in a dataset; it is the ratio between the number of positions in the genome
that uniquely mapped reads map to and the total number of uniquely mappable
reads. The NRF should be > 0.8. The PBC1 is the ratio of genomic locations
with EXACTLY one read pair over the genomic locations with AT LEAST one read
pair. PBC1 is the primary measure, and the PBC1 should be close to 1.
Provisionally 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking,
0.8-0.9 is mild bottlenecking, and 0.9-1.0 is no bottlenecking. The PBC2 is
the ratio of genomic locations with EXACTLY one read pair over the genomic
locations with EXACTLY two read pairs. The PBC2 should be significantly
greater than 1.

Picard EstimateLibraryComplexity

33,139,326,668

Yield prediction

Preseq performs a yield prediction by subsampling the reads, calculating the
number of distinct reads, and then extrapolating out to see where the
expected number of distinct reads no longer increases. The confidence interval
gives a gauge as to the validity of the yield predictions.

Fragment length statistics

Metric	Result
Fraction of reads in NFR	0.646909169167 - OK
NFR / mono-nuc reads	2.07666427821 out of range [2.5, inf]
Presence of NFR peak	OK
Presence of Mono-Nuc peak	OK
Presence of Di-Nuc peak	OK

Open chromatin assays show distinct fragment length enrichments, as the cut
sites are only in open chromatin and not in nucleosomes. As such, peaks
representing different n-nucleosomal (ex mono-nucleosomal, di-nucleosomal)
fragment lengths will arise. Good libraries will show these peaks in a
fragment length distribution and will show specific peak ratios.

Peak statistics

Metric	Result
Naive overlap peaks	241767 - OK
IDR peaks	177004 - OK

Naive overlap peak file statistics

Min size	73.0
25 percentile	343.0
50 percentile (median)	532.0
75 percentile	754.0
Max size	2704.0
Mean	579.311800204

IDR peak file statistics

Min size	73.0
25 percentile	438.0
50 percentile (median)	618.0
75 percentile	829.0
Max size	2704.0
Mean	660.518858331

For a good ATAC-seq experiment in human, you expect to get 100k-200k peaks
for a specific cell type.

Sequence quality metrics

GC bias

Open chromatin assays are known to have significant GC bias. Please take this
into consideration as necessary.

Annotation-based quality metrics

Enrichment plots (TSS)

Open chromatin assays should show enrichment in open chromatin sites, such as
TSS's. An average TSS enrichment in human (hg19) is above 6. A strong TSS enrichment is
above 10. For other references please see https://www.encodeproject.org/atac-seq/

Annotated genomic region enrichments

Fraction of reads in universal DHS regions	16,374,593	0.190
Fraction of reads in blacklist regions	401	0.000
Fraction of reads in promoter regions	4,352,604	0.051
Fraction of reads in enhancer regions	19,100,683	0.222
Fraction of reads in called peak regions	9,032,504	0.105

Signal to noise can be assessed by considering whether reads are falling into
known open regions (such as DHS regions) or not. A high fraction of reads
should fall into the universal (across cell type) DHS set. A small fraction
should fall into the blacklist regions. A high set (though not all) should
fall into the promoter regions. A high set (though not all) should fall into
the enhancer regions. The promoter regions should not take up all reads, as
it is known that there is a bias for promoters in open chromatin assays.

Comparison to Roadmap DNase

This bar chart shows the correlation between the Roadmap DNase samples to
your sample, when the signal in the universal DNase peak region sets are
compared. The closer the sample is in signal distribution in the regions
to your sample, the higher the correlation.

Replicate 3

Sample Information

Sample
Genome	GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
Paired/Single-ended	Paired-ended
Read length	50

Summary

Read count from sequencer	98,055,646
Read count successfully aligned	97,990,832
Read count after filtering for mapping quality	93,540,058
Read count after removing duplicate reads	93,527,856
Read count after removing mitochondrial reads (final read count)	83,222,552

Note that all these read counts are determined using 'samtools view' - as such,
these are all reads found in the file, whether one end of a pair or a single
end read. In other words, if your file is paired end, then you should divide
these counts by two. Each step follows the previous step; for example, the
duplicate reads were removed after reads were removed for low mapping quality.

This bar chart also shows the filtering process and where the reads were lost
over the process. Note that each step is sequential - as such, there may
have been more mitochondrial reads which were already filtered because of
high duplication or low mapping quality. Note that all these read counts are
determined using 'samtools view' - as such, these are all reads found in
the file, whether one end of a pair or a single end read. In other words,
if your file is paired end, then you should divide these counts by two.

Alignment statistics

Bowtie alignment log

49027823 reads; of these:
  49027823 (100.00%) were paired; of these:
    48693 (0.10%) aligned concordantly 0 times
    38514404 (78.56%) aligned concordantly exactly 1 time
    10464726 (21.34%) aligned concordantly >1 times
    ----
    48693 pairs aligned concordantly 0 times; of these:
      4789 (9.84%) aligned discordantly 1 time
    ----
    43904 pairs aligned 0 times concordantly or discordantly; of these:
      87808 mates make up the pairs; of these:
        64814 (73.81%) aligned 0 times
        4520 (5.15%) aligned exactly 1 time
        18474 (21.04%) aligned >1 times
99.93% overall alignment rate

Samtools flagstat

163936258 + 0 in total (QC-passed reads + QC-failed reads)
65880612 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
163871444 + 0 mapped (99.96%:-nan%)
98055646 + 0 paired in sequencing
49027823 + 0 read1
49027823 + 0 read2
97958260 + 0 properly paired (99.90%:-nan%)
97984506 + 0 with itself and mate mapped
6326 + 0 singletons (0.01%:-nan%)
6528 + 0 with mate mapped to a different chr
2773 + 0 with mate mapped to a different chr (mapQ>=5)

Note that the flagstat command counts alignments, not reads. please 
use the read counts table to get accurate counts of reads at each
stage of the pipeline.

Filtering statistics

Mapping quality > q30 (out of total)	93,540,058	0.954
Duplicates (after filtering)	12,202	0.000
Mitochondrial reads (out of total)	16,778	0.000
Duplicates that are mitochondrial (out of all dups)	24	0.001
Final reads (after all filters)	83,222,552	0.849

Mapping quality refers to the quality of the read being aligned to that
particular location in the genome. A standard quality score is > 30.
Duplications are often due to PCR duplication rather than two unique reads
mapping to the same location. High duplication is an indication of poor
libraries. Mitochondrial reads are often high in chromatin accessibility
assays because the mitochondrial genome is very open. A high mitochondrial
fraction is an indication of poor libraries. Based on prior experience, a
final read fraction above 0.70 is a good library.

Library complexity statistics

ENCODE library complexity metrics

Metric	Result
NRF	0.999672 - OK
PBC1	0.999673 - OK
PBC2	3068.715361 - OK

The non-redundant fraction (NRF) is the fraction of non-redundant mapped reads
in a dataset; it is the ratio between the number of positions in the genome
that uniquely mapped reads map to and the total number of uniquely mappable
reads. The NRF should be > 0.8. The PBC1 is the ratio of genomic locations
with EXACTLY one read pair over the genomic locations with AT LEAST one read
pair. PBC1 is the primary measure, and the PBC1 should be close to 1.
Provisionally 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking,
0.8-0.9 is mild bottlenecking, and 0.9-1.0 is no bottlenecking. The PBC2 is
the ratio of genomic locations with EXACTLY one read pair over the genomic
locations with EXACTLY two read pairs. The PBC2 should be significantly
greater than 1.

Picard EstimateLibraryComplexity

60,580,360,021

Yield prediction

Preseq performs a yield prediction by subsampling the reads, calculating the
number of distinct reads, and then extrapolating out to see where the
expected number of distinct reads no longer increases. The confidence interval
gives a gauge as to the validity of the yield predictions.

Fragment length statistics

Metric	Result
Fraction of reads in NFR	0.378866302083 out of range [0.4, inf]
NFR / mono-nuc reads	0.931539832479 out of range [2.5, inf]
Presence of NFR peak	OK
Presence of Mono-Nuc peak	OK
Presence of Di-Nuc peak	OK

Open chromatin assays show distinct fragment length enrichments, as the cut
sites are only in open chromatin and not in nucleosomes. As such, peaks
representing different n-nucleosomal (ex mono-nucleosomal, di-nucleosomal)
fragment lengths will arise. Good libraries will show these peaks in a
fragment length distribution and will show specific peak ratios.

Peak statistics

Metric	Result
Naive overlap peaks	241767 - OK
IDR peaks	177004 - OK

Naive overlap peak file statistics

Min size	73.0
25 percentile	343.0
50 percentile (median)	532.0
75 percentile	754.0
Max size	2704.0
Mean	579.311800204

IDR peak file statistics

Min size	73.0
25 percentile	438.0
50 percentile (median)	618.0
75 percentile	829.0
Max size	2704.0
Mean	660.518858331

For a good ATAC-seq experiment in human, you expect to get 100k-200k peaks
for a specific cell type.

Sequence quality metrics

GC bias

Open chromatin assays are known to have significant GC bias. Please take this
into consideration as necessary.

Annotation-based quality metrics

Enrichment plots (TSS)

Open chromatin assays should show enrichment in open chromatin sites, such as
TSS's. An average TSS enrichment in human (hg19) is above 6. A strong TSS enrichment is
above 10. For other references please see https://www.encodeproject.org/atac-seq/

Annotated genomic region enrichments

Fraction of reads in universal DHS regions	16,816,239	0.202
Fraction of reads in blacklist regions	137	0.000
Fraction of reads in promoter regions	5,064,565	0.061
Fraction of reads in enhancer regions	18,755,121	0.225
Fraction of reads in called peak regions	8,168,805	0.098

Signal to noise can be assessed by considering whether reads are falling into
known open regions (such as DHS regions) or not. A high fraction of reads
should fall into the universal (across cell type) DHS set. A small fraction
should fall into the blacklist regions. A high set (though not all) should
fall into the promoter regions. A high set (though not all) should fall into
the enhancer regions. The promoter regions should not take up all reads, as
it is known that there is a bias for promoters in open chromatin assays.

Comparison to Roadmap DNase

This bar chart shows the correlation between the Roadmap DNase samples to
your sample, when the signal in the universal DNase peak region sets are
compared. The closer the sample is in signal distribution in the regions
to your sample, the higher the correlation.

Replicate 4

Sample Information

Sample
Genome	GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
Paired/Single-ended	Paired-ended
Read length	51

Summary

Read count from sequencer	76,743,000
Read count successfully aligned	76,708,912
Read count after filtering for mapping quality	71,493,582
Read count after removing duplicate reads	71,481,333
Read count after removing mitochondrial reads (final read count)	60,784,396

Note that all these read counts are determined using 'samtools view' - as such,
these are all reads found in the file, whether one end of a pair or a single
end read. In other words, if your file is paired end, then you should divide
these counts by two. Each step follows the previous step; for example, the
duplicate reads were removed after reads were removed for low mapping quality.

This bar chart also shows the filtering process and where the reads were lost
over the process. Note that each step is sequential - as such, there may
have been more mitochondrial reads which were already filtered because of
high duplication or low mapping quality. Note that all these read counts are
determined using 'samtools view' - as such, these are all reads found in
the file, whether one end of a pair or a single end read. In other words,
if your file is paired end, then you should divide these counts by two.

Alignment statistics

Bowtie alignment log

38371500 reads; of these:
  38371500 (100.00%) were paired; of these:
    35866 (0.09%) aligned concordantly 0 times
    27760554 (72.35%) aligned concordantly exactly 1 time
    10575080 (27.56%) aligned concordantly >1 times
    ----
    35866 pairs aligned concordantly 0 times; of these:
      5296 (14.77%) aligned discordantly 1 time
    ----
    30570 pairs aligned 0 times concordantly or discordantly; of these:
      61140 mates make up the pairs; of these:
        34088 (55.75%) aligned 0 times
        4660 (7.62%) aligned exactly 1 time
        22392 (36.62%) aligned >1 times
99.96% overall alignment rate

Samtools flagstat

146180152 + 0 in total (QC-passed reads + QC-failed reads)
69437152 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
146146064 + 0 mapped (99.98%:-nan%)
76743000 + 0 paired in sequencing
38371500 + 0 read1
38371500 + 0 read2
76671268 + 0 properly paired (99.91%:-nan%)
76701606 + 0 with itself and mate mapped
7306 + 0 singletons (0.01%:-nan%)
5342 + 0 with mate mapped to a different chr
1606 + 0 with mate mapped to a different chr (mapQ>=5)

Note that the flagstat command counts alignments, not reads. please 
use the read counts table to get accurate counts of reads at each
stage of the pipeline.

Filtering statistics

Mapping quality > q30 (out of total)	71,493,582	0.932
Duplicates (after filtering)	12,249	0.000
Mitochondrial reads (out of total)	13,041	0.000
Duplicates that are mitochondrial (out of all dups)	4	0.000
Final reads (after all filters)	60,784,396	0.792

Mapping quality refers to the quality of the read being aligned to that
particular location in the genome. A standard quality score is > 30.
Duplications are often due to PCR duplication rather than two unique reads
mapping to the same location. High duplication is an indication of poor
libraries. Mitochondrial reads are often high in chromatin accessibility
assays because the mitochondrial genome is very open. A high mitochondrial
fraction is an indication of poor libraries. Based on prior experience, a
final read fraction above 0.70 is a good library.

Library complexity statistics

ENCODE library complexity metrics

Metric	Result
NRF	0.999544 - OK
PBC1	0.999549 - OK
PBC2	2225.004688 - OK

The non-redundant fraction (NRF) is the fraction of non-redundant mapped reads
in a dataset; it is the ratio between the number of positions in the genome
that uniquely mapped reads map to and the total number of uniquely mappable
reads. The NRF should be > 0.8. The PBC1 is the ratio of genomic locations
with EXACTLY one read pair over the genomic locations with AT LEAST one read
pair. PBC1 is the primary measure, and the PBC1 should be close to 1.
Provisionally 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking,
0.8-0.9 is mild bottlenecking, and 0.9-1.0 is no bottlenecking. The PBC2 is
the ratio of genomic locations with EXACTLY one read pair over the genomic
locations with EXACTLY two read pairs. The PBC2 should be significantly
greater than 1.

Picard EstimateLibraryComplexity

26,316,749,690

Yield prediction

Preseq performs a yield prediction by subsampling the reads, calculating the
number of distinct reads, and then extrapolating out to see where the
expected number of distinct reads no longer increases. The confidence interval
gives a gauge as to the validity of the yield predictions.

Fragment length statistics

Metric	Result
Fraction of reads in NFR	0.614671225298 - OK
NFR / mono-nuc reads	1.88572038667 out of range [2.5, inf]
Presence of NFR peak	OK
Presence of Mono-Nuc peak	OK
Presence of Di-Nuc peak	OK

Open chromatin assays show distinct fragment length enrichments, as the cut
sites are only in open chromatin and not in nucleosomes. As such, peaks
representing different n-nucleosomal (ex mono-nucleosomal, di-nucleosomal)
fragment lengths will arise. Good libraries will show these peaks in a
fragment length distribution and will show specific peak ratios.

Peak statistics

Metric	Result
Naive overlap peaks	241767 - OK
IDR peaks	177004 - OK

Naive overlap peak file statistics

Min size	73.0
25 percentile	343.0
50 percentile (median)	532.0
75 percentile	754.0
Max size	2704.0
Mean	579.311800204

IDR peak file statistics

Min size	73.0
25 percentile	438.0
50 percentile (median)	618.0
75 percentile	829.0
Max size	2704.0
Mean	660.518858331

For a good ATAC-seq experiment in human, you expect to get 100k-200k peaks
for a specific cell type.

Sequence quality metrics

GC bias

Open chromatin assays are known to have significant GC bias. Please take this
into consideration as necessary.

Annotation-based quality metrics

Enrichment plots (TSS)

Open chromatin assays should show enrichment in open chromatin sites, such as
TSS's. An average TSS enrichment in human (hg19) is above 6. A strong TSS enrichment is
above 10. For other references please see https://www.encodeproject.org/atac-seq/

Annotated genomic region enrichments

Fraction of reads in universal DHS regions	15,375,755	0.253
Fraction of reads in blacklist regions	197	0.000
Fraction of reads in promoter regions	5,546,244	0.091
Fraction of reads in enhancer regions	15,744,310	0.259
Fraction of reads in called peak regions	8,861,093	0.146

Signal to noise can be assessed by considering whether reads are falling into
known open regions (such as DHS regions) or not. A high fraction of reads
should fall into the universal (across cell type) DHS set. A small fraction
should fall into the blacklist regions. A high set (though not all) should
fall into the promoter regions. A high set (though not all) should fall into
the enhancer regions. The promoter regions should not take up all reads, as
it is known that there is a bias for promoters in open chromatin assays.

Comparison to Roadmap DNase

This bar chart shows the correlation between the Roadmap DNase samples to
your sample, when the signal in the universal DNase peak region sets are
compared. The closer the sample is in signal distribution in the regions
to your sample, the higher the correlation.