# seqkit **Repository Path**: all_create_code/seqkit ## Basic Information - **Project Name**: seqkit - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-10-30 - **Last Updated**: 2022-06-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation - **Documents:** [http://bioinf.shenwei.me/seqkit](http://bioinf.shenwei.me/seqkit) ([**Usage**](http://bioinf.shenwei.me/seqkit/usage/), [**FAQ**](http://bioinf.shenwei.me/seqkit/faq/), [**Tutorial**](http://bioinf.shenwei.me/seqkit/tutorial/), and [**Benchmark**](http://bioinf.shenwei.me/seqkit/benchmark/)) - **Source code:** [https://github.com/shenwei356/seqkit](https://github.com/shenwei356/seqkit) [![GitHub stars](https://img.shields.io/github/stars/shenwei356/seqkit.svg?style=social&label=Star&?maxAge=2592000)](https://github.com/shenwei356/seqkit) [![license](https://img.shields.io/github/license/shenwei356/seqkit.svg?maxAge=2592000)](https://github.com/shenwei356/seqkit/blob/master/LICENSE) [![Build Status](https://travis-ci.org/shenwei356/seqkit.svg?branch=master)](https://travis-ci.org/shenwei356/seqkit) - **Latest version:** [![Latest Version](https://img.shields.io/github/release/shenwei356/seqkit.svg?style=flat?maxAge=86400)](https://github.com/shenwei356/seqkit/releases) [![Github Releases](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/total.svg?maxAge=3600)](http://bioinf.shenwei.me/seqkit/download/) [![Cross-platform](https://img.shields.io/badge/platform-any-ec2eb4.svg?style=flat)](http://bioinf.shenwei.me/seqkit/download/) [![Anaconda Cloud](https://anaconda.org/bioconda/seqkit/badges/version.svg)](https://anaconda.org/bioconda/seqkit) - **[Please cite](#citation):** [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pone.0163962-blue.svg?style=flat)](https://doi.org/10.1371/journal.pone.0163962) - **Others**: [![check in Biotreasury](https://img.shields.io/badge/Biotreasury-collected-brightgreen)](https://biotreasury.rjmart.cn/#/tool?id=10081) ## Features - **Easy to install** ([download](http://bioinf.shenwei.me/seqkit/download/)) - Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64) - Light weight and out-of-the-box, no dependencies, no compilation, no configuration - **Easy to use** - Ultrafast (see [technical-details](http://bioinf.shenwei.me/seqkit/usage/#technical-details-and-guides-for-use) and [benchmark](http://bioinf.shenwei.me/seqkit/benchmark)) - Seamlessly parsing both FASTA and FASTQ formats - Supporting (`gzip`/`xz`/`zstd` compressed) STDIN/STDOUT and input/output file, easily integrated in pipe - Reproducible results (configurable rand seed in `sample` and `shuffle`) - Supporting custom sequence ID via regular expression - Supporting [Bash/Zsh completion](http://bioinf.shenwei.me/seqkit/download/#shell-completion) - **Versatile commands** ([usages and examples](http://bioinf.shenwei.me/seqkit/usage/)) - Practical functions supported by [37 subcommands](#subcommands) ## Installation Go to [Download Page](http://bioinf.shenwei.me/seqkit/download) for more download options and changelogs, or install via conda: conda install -c bioconda seqkit ## Subcommands |category |command |function |input |strand-sensitivity|multi-threads|popularity | |:----------------|:------------------------------------------------------------------|:---------------------------------------------------------------------------------------|:--------------|:-----------------|:------------|:--------------| |basic |[seq](https://bioinf.shenwei.me/seqkit/usage/#seq) |transform sequences: extract ID/seq, filter by length/quality, remove gaps… |FASTA/Q | | |★★★★★ | | |[stats](https://bioinf.shenwei.me/seqkit/usage/#stats) |simple statistics: #seqs, min/max_len, N50, Q20%, Q30%… |FASTA/Q | |✓ |★★★★★ | | |[sum](https://bioinf.shenwei.me/seqkit/usage/#sum) |compute message digest for all sequences in FASTA/Q files |FASTA/Q |+ or both |✓ | | | |[subseq](https://bioinf.shenwei.me/seqkit/usage/#subseq) |extract subsequences or flanking sequences by region/gtf/bed, |FASTA/Q |+ or/and - | |★★★ | | |[sliding](https://bioinf.shenwei.me/seqkit/usage/#sliding) |extract subsequences in sliding windows |FASTA/Q |+ only | |★★ | | |[faidx](https://bioinf.shenwei.me/seqkit/usage/#faidx) |create FASTA index file and extract subsequence (with more features than samtools faidx)|FASTA |+ or/and - | | | | |[watch ](https://bioinf.shenwei.me/seqkit/usage/#watch ) |monitoring and online histograms of sequence features |FASTA/Q | | | | | |[sana](https://bioinf.shenwei.me/seqkit/usage/#sana) |sanitize broken single line FASTQ files |FASTQ | | | | | |[scat ](https://bioinf.shenwei.me/seqkit/usage/#scat ) |real time concatenation and streaming of fastx files |FASTA/Q | |✓ | | |format conversion|[fq2fa](https://bioinf.shenwei.me/seqkit/usage/#fq2fa) |convert FASTQ to FASTA |FASTQ | | |★★ | | |[fa2fq](https://bioinf.shenwei.me/seqkit/usage/#fa2fq) |retrieve corresponding FASTQ records by a FASTA file |FASTA/Q | | | | | |[fx2tab](https://bioinf.shenwei.me/seqkit/usage/#fx2tab-tab2fx) |convert FASTA/Q to tabular format |FASTA/Q | | |★★ | | |[tab2fx](https://bioinf.shenwei.me/seqkit/usage/#fx2tab-tab2fx) |convert tabular format to FASTA/Q format |FASTA/Q | | | | | |[convert](https://bioinf.shenwei.me/seqkit/usage/#convert) |convert FASTQ quality encoding between Sanger, Solexa and Illumina |FASTA/Q | | | | | |[translate](https://bioinf.shenwei.me/seqkit/usage/#translate) |translate DNA/RNA to protein sequence |FASTA/Q |+ or/and - | |★★ | |searching |[grep](https://bioinf.shenwei.me/seqkit/usage/#grep) |search sequences by ID/name/sequence/sequence motifs, mismatch allowed |FASTA/Q |+ and - |partly, -m |★★★★★ | | |[locate](https://bioinf.shenwei.me/seqkit/usage/#locate) |locate subsequences/motifs, mismatch allowed |FASTA/Q |+ and - |partly, -m |★★★★★ | | |[amplicon](https://bioinf.shenwei.me/seqkit/usage/#amplicon) |extract amplicon (or specific region around it), mismatch allowed |FASTA/Q |+ and - |partly, -m |★ | | |[fish](https://bioinf.shenwei.me/seqkit/usage/#fish) |look for short sequences in larger sequences |FASTA/Q |+ and - | | | |set operation |[sample](https://bioinf.shenwei.me/seqkit/usage/#sample) |sample sequences by number or proportion |FASTA/Q | | |★★★★ | | |[rmdup](https://bioinf.shenwei.me/seqkit/usage/#rmdup) |remove duplicated sequences by ID/name/sequence |FASTA/Q |+ and - | |★★★ | | |[common](https://bioinf.shenwei.me/seqkit/usage/#common) |find common sequences of multiple files by id/name/sequence |FASTA/Q |+ and - | | | | |[duplicate](https://bioinf.shenwei.me/seqkit/usage/#duplicate) |duplicate sequences N times |FASTA/Q | | |★ | | |[split](https://bioinf.shenwei.me/seqkit/usage/#split) |split sequences into files by id/seq region/size/parts (mainly for FASTA) |FASTA preffered| | |★ | | |[split2](https://bioinf.shenwei.me/seqkit/usage/#split2) |split sequences into files by size/parts (FASTA, PE/SE FASTQ) |FASTA/Q | | |★★ | | |[head](https://bioinf.shenwei.me/seqkit/usage/#head) |print first N FASTA/Q records |FASTA/Q | | | | | |[head-genome](https://bioinf.shenwei.me/seqkit/usage/#head-genome) |print sequences of the first genome with common prefixes in name |FASTA/Q | | | | | |[range](https://bioinf.shenwei.me/seqkit/usage/#range) |print FASTA/Q records in a range (start:end) |FASTA/Q | | | | | |[pair](https://bioinf.shenwei.me/seqkit/usage/#pair) |match up paired-end reads from two fastq files |FASTA/Q | | | | |edit |[concat](https://bioinf.shenwei.me/seqkit/usage/#concat) |concatenate sequences with same the ID from multiple files |FASTA/Q |+ only | |★★★ | | |[replace](https://bioinf.shenwei.me/seqkit/usage/#replace) |replace name/sequence by regular expression |FASTA/Q |+ only | |★★ | | |[restart](https://bioinf.shenwei.me/seqkit/usage/#restart) |reset start position for circular genome |FASTA/Q |+ only | |★ | | |[mutate](https://bioinf.shenwei.me/seqkit/usage/#mutate) |edit sequence (point mutation, insertion, deletion) |FASTA/Q |+ only | | | | |[rename](https://bioinf.shenwei.me/seqkit/usage/#rename) |rename duplicated IDs |FASTA/Q | | |★ | |ordering |[sort](https://bioinf.shenwei.me/seqkit/usage/#sort) |sort sequences by id/name/sequence/length |FASTA preffered| | |★★ | | |[shuffle](https://bioinf.shenwei.me/seqkit/usage/#shuffle) |shuffle sequences |FASTA preffered| | | | |BAM processing |[bam](https://bioinf.shenwei.me/seqkit/usage/#bam) |monitoring and online histograms of BAM record features |BAM | | | | Notes: - Strand-sensitivity: - `+ only`: only processing on the positive/forward strand. - `+ and -`: searching on both strands. - `+ or/and -`: depends on users' flags/options/arguments. - Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads. - Popularity: Bases on statistics of 227 publications citing seqkit since 2020. ## Citation **W Shen**, S Le, Y Li\*, F Hu\*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. ***PLOS ONE***. [doi:10.1371/journal.pone.0163962](https://doi.org/10.1371/journal.pone.0163962). ## Contributors - [Wei Shen](https://github.com/shenwei356) - [Botond Sipos](https://github.com/bsipos): `bam`, `scat`, `fish`, `sana`, `watch`. - [others](https://github.com/shenwei356/seqkit/graphs/contributors) ## Acknowledgements We thank [Lei Zhang](https://github.com/jameslz) for testing SeqKit, and also thank [Jim Hester](https://github.com/jimhester/), author of [fasta_utilities](https://github.com/jimhester/fasta_utilities), for advice on early performance improvements of for FASTA parsing and [Brian Bushnell](https://twitter.com/BBToolsBio), author of [BBMaps](https://sourceforge.net/projects/bbmap/), for advice on naming SeqKit and adding accuracy evaluation in benchmarks. We also thank Nicholas C. Wu from the Scripps Research Institute, USA for commenting on the manuscript and [Guangchuang Yu](http://guangchuangyu.github.io/) from State Key Laboratory of Emerging Infectious Diseases, The University of Hong Kong, HK for advice on the manuscript. We thank [Li Peng](https://github.com/penglbio) for reporting many bugs. We appreciate [Klaus Post](https://github.com/klauspost) for his fantastic packages ( [compress](https://github.com/klauspost/compress) and [pgzip](https://github.com/klauspost/pgzip) ) which accelerate gzip file reading and writing. ## Contact [Create an issue](https://github.com/shenwei356/seqkit/issues) to report bugs, propose new functions or ask for help. ## License [MIT License](https://github.com/shenwei356/seqkit/blob/master/LICENSE) ## Starchart Stargazers over time