Skip to content

twarc2

twarc2 is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object that was returned from the Twitter API. Since Twitter's introduction of their v2 API the JSON representation of a tweet is conditional on the types of fields and expansions that are requested. twarc2 does the work of requesting the highest fidelity representation of a tweet by requesting all the available data for tweets.

Tweets are streamed or stored as line-oriented JSON. twarc2 will handle Twitter API's rate limits for you. In addition to letting you collect tweets twarc can also help you collect users and hydrate tweet ids. It also has a collection of plugins you can use to do things with the collected JSON data (such as converting it to CSV).

twarc2 was developed as part of the Documenting the Now project which was funded by the Mellon Foundation.

Install

Before using twarc you will need to register an application at apps.twitter.com. Once you've created your application, note down the consumer key, consumer secret and then click to generate an access token and access token secret. With these four variables in hand you are ready to start using twarc.

  1. install Python 3
  2. pip install twarc:
    pip install --upgrade twarc

Homebrew (macOS only)

For macOS users, you can also install twarc via Homebrew:

$ brew install twarc

Windows

If you installed with pip and see a "failed to create process" when running twarc try reinstalling like this:

python -m pip install --upgrade --force-reinstall twarc

Quickstart:

First you're going to need to tell twarc about your application API keys and grant access to one or more Twitter accounts:

twarc2 configure

Then try out a search:

twarc2 search blacklivesmatter search.jsonl

Or maybe you'd like to collect tweets as they happen?

twarc2 filter blacklivesmatter stream.jsonl

See below for the details about these commands and more.

Configure

Once you've got your Twitter developer access set up you can tell twarc what they are with the configure command.

twarc2 configure

This will store your credentials in your home directory so you don't have to keep entering them in. You can most of twarc's functionality by simply configuring the bearer token, but if you want it to be complete you can enter in the API key and API secret.

You can also the keys in the system environment (CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET) or using command line options (--consumer-key, --consumer-secret, --access-token, --access-token-secret).

This uses Twitter's tweets/search/recent and tweets/search/all endpoints to download pre-existing tweets matching a given query. This command will search for any tweets mentioning blacklivesmatter from the 7 days.

twarc2 search blacklivesmatter tweets.jsonl

If you have access to the Academic Research Product Track you can search the full archive of tweets by using the --archive option.

twarc2 search --archive blacklivesmatter tweets.jsonl

The queries can be a lot more expressive than matching a single term. For example this query will search for tweets containing either blacklivesmatter or blm that were sent to the user \@deray.

twarc2 search 'blacklivesmatter OR blm to:deray' tweets.jsonl

The best way to get familiar with Twitter's search syntax is to consult Twitter's Building queries for Search Tweets documentation.

You also should definitely check out Igor Brigadir's excellent reference guide to the Twitter Search syntax: Advanced Search on Twitter. There are lots of hidden gems in there that the advanced search form doesn't make readily apparent.

Limit

Because there is a 500,000 tweet limit (5 million for Academic Research Track) you may want to limit the number of tweets you retrieve by using --limit:

twarc2 search --limit 5000 blacklivesmatter tweets.jsonl

Time

You can also limit to a particular time range using --start-time and/or --end-time, which can be especially useful in conjunction with --archive when you are searching for historical tweets.

twarc2 search --start-time 2014-07-17 --end-time 2014-07-24 '"eric garner"' tweets.jsonl

If you leave off --start-time or --end-time it will be open on that side. So for example to get all "eric garner" tweets before 2014-07-24 you would just leave off the --start-time:

twarc2 search --end-time 2014-07-24 '"eric garner"' tweets.jsonl

Stream

The stream command will use Twitter's API tweets/search/stream endpoint to collect tweets as they happen. In order to use it you first need to create one or more [rules]. For example:

twarc2 stream-rules add blacklivesmatter

You can list your active stream rules:

twarc2 stream-rules list

And you can collect the data from the stream, which will bring down any tweets that match your rules:

twarc2 stream stream.jsonl

When you want to stop you use ctrl-c. This only stops the stream but doesn't delete your stream rule. To remove a rule you can:

twarc2 stream-rules delete blacklivesmatter

Sample

Use the sample command to listen to Twitter's tweets/sample/stream API for a "random" sample of recent public statuses.

twarc2 sample sample.jsonl

Users

If you have a file of user ids you can fetch the user metadata for them with the users command:

twarc users users.txt users.jsonl

If the file contains usernames instead of user ids you can use the --usernames option:

twarc2 users --usernames users.txt users.jsonl

Followers

You can fetch the followers of an account using the followers command:

twarc2 followers deray users.jsonl

Following

To get the users that a user is following you can use following:

twarc2 following deray users.jsonl

The result will include exactly one user id per line. The response order is reverse chronological, or most recent followers first.

Timeline

The timeline command will use Twitter's user timeline API to collect the most recent tweets posted by the user indicated by screen_name.

twarc2 timeline deray tweets.jsonl

Conversation

You can retrieve a conversation thread using the tweet ID at the head of the conversation:

twarc2 conversation 266031293945503744 > conversation.jsonl

Dehydrate

The dehydrate command generates an id list from a file of tweets:

twarc2 dehydrate tweets.jsonl tweet-ids.txt

Hydrate

twarc's hydrate command will read a file of tweet identifiers and write out the tweet JSON for them using Twitter's tweets API endpoint:

twarc2 hydrate ids.txt tweets.jsonl

Twitter API's Terms of Service discourage people from making large amounts of raw Twitter data available on the Web. The data can be used for research and archived for local use, but not shared with the world. Twitter does allow files of tweet identifiers to be shared, which can be useful when you would like to make a dataset of tweets available. You can then use Twitter's API to hydrate the data, or to retrieve the full JSON for each identifier. This is particularly important for verification of social media research.

Command Line Usage

Below is what you see when you run twarc2 --help.

twarc2

Collect data from the Twitter V2 API.

Usage:

twarc2 [OPTIONS] COMMAND [ARGS]...

Options:

  --consumer-key TEXT         Twitter app consumer key (aka "App Key")
  --consumer-secret TEXT      Twitter app consumer secret (aka "App Secret")
  --access-token TEXT         Twitter app access token for user
                              authentication.
  --access-token-secret TEXT  Twitter app access token secret for user
                              authentication.
  --bearer-token TEXT         Twitter app access bearer token.
  --app-auth / --user-auth    Use application authentication or user
                              authentication. Some rate limits are higher with
                              user authentication, but not all endpoints are
                              supported.  [default: app-auth]
  -l, --log TEXT
  --verbose
  --metadata / --no-metadata  Include/don't include metadata about when and
                              how data was collected.  [default: metadata]
  --config FILE               Read configuration from FILE.
  --help                      Show this message and exit.

compliance-job

Create, retrieve and list batch compliance jobs for Tweets and Users.

Usage:

twarc2 compliance-job [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

create

Create a new compliance job and upload tweet IDs.

Usage:

twarc2 compliance-job create [OPTIONS] {tweets|users} INFILE [OUTFILE]

Options:

  --job-name TEXT     A name or tag to help identify the job.
  --wait / --no-wait  Wait for the job to finish and download the results.
                      Wait by default.
  --hide-progress     Hide the Progress bar. Default: show progress.
  --help              Show this message and exit.

download

Download the compliance job with the specified ID.

Usage:

twarc2 compliance-job download [OPTIONS] JOB [OUTFILE]

Options:

  --wait / --no-wait  Wait for the job to finish and download the results.
                      Wait by default.
  --hide-progress     Hide the Progress bar. Default: show progress.
  --help              Show this message and exit.

get

Returns status and download information about the job ID.

Usage:

twarc2 compliance-job get [OPTIONS] JOB

Options:

  --verbose      Show all URLs and metadata.
  --json-output  Return the raw json content from the API.
  --help         Show this message and exit.

list

Returns a list of compliance jobs by job type and status.

Usage:

twarc2 compliance-job list [OPTIONS] [[tweets|users]]

Options:

  --status [created|in_progress|complete|failed]
                                  Filter by job status. Only one of 'created',
                                  'in_progress', 'complete', 'failed' can be
                                  specified. If not set, returns all.
  --verbose                       Show all URLs and metadata.
  --json-output                   Return the raw json content from the API.
  --help                          Show this message and exit.

configure

Set up your Twitter app keys.

Usage:

twarc2 configure [OPTIONS]

Options:

  --help  Show this message and exit.

conversation

Retrieve a conversation thread using the tweet id.

Usage:

twarc2 conversation [OPTIONS] TWEET_ID [OUTFILE]

Options:

  --since-id INTEGER              Match tweets sent after tweet id
  --until-id INTEGER              Match tweets sent prior to tweet id
  --start-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets created after UTC time (ISO
                                  8601/RFC 3339), e.g.  2021-01-01T12:31:04
  --end-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets sent before UTC time (ISO
                                  8601/RFC 3339)
  --archive                       Search the full archive (requires Academic
                                  Research track)
  --limit INTEGER                 Maximum number of tweets to save
  --max-results INTEGER           Maximum number of tweets per API response
  --hide-progress                 Hide the Progress bar. Default: show
                                  progress, unless using pipes.
  --help                          Show this message and exit.

conversations

Fetch the full conversation threads that the input tweets are a part of. Alternatively the input can be a line oriented file of conversation ids.

Usage:

twarc2 conversations [OPTIONS] [INFILE] [OUTFILE]

Options:

  --limit INTEGER               Maximum number of tweets to return
  --conversation-limit INTEGER  Maximum number of tweets to return per-
                                conversation
  --archive                     Use the Academic Research project track access
                                to the full archive
  --hide-progress               Hide the Progress bar. Default: show progress,
                                unless using pipes.
  --help                        Show this message and exit.

counts

Return counts of tweets matching a query.

Usage:

twarc2 counts [OPTIONS] QUERY [OUTFILE]

Options:

  --since-id INTEGER              Count tweets sent after tweet id
  --until-id INTEGER              Count tweets sent prior to tweet id
  --start-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Count tweets created after UTC time (ISO
                                  8601/RFC 3339), e.g.  2021-01-01T12:31:04
  --end-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Count tweets sent before UTC time (ISO
                                  8601/RFC 3339)
  --archive                       Count using the full archive (requires
                                  Academic Research track)
  --granularity [day|hour|minute]
                                  Aggregation level for counts. Can be one of:
                                  day, hour, minute. Default is hour.
  --limit INTEGER                 Maximum number of days of results to save
                                  (minimum is 30 days)
  --text                          Output the counts as human readable text
  --csv                           Output counts as CSV
  --hide-progress                 Hide the Progress bar. Default: show
                                  progress, unless using pipes.
  --help                          Show this message and exit.

dehydrate

Extract tweet or user IDs from a dataset.

Usage:

twarc2 dehydrate [OPTIONS] [INFILE] [OUTFILE]

Options:

  --id-type [tweets|users]  IDs to extract - either 'tweets' or 'users'.
  --hide-progress           Hide the Progress bar. Default: show progress,
                            unless using pipes.
  --help                    Show this message and exit.

flatten

"Flatten" tweets, or move expansions inline with tweet objects and ensure that each line of output is a single tweet.

Usage:

twarc2 flatten [OPTIONS] [INFILE] [OUTFILE]

Options:

  --hide-progress  Hide the Progress bar. Default: show progress, unless using
                   pipes.
  --help           Show this message and exit.

followers

Get the followers for a given user.

Usage:

twarc2 followers [OPTIONS] USER [OUTFILE]

Options:

  --limit INTEGER  Maximum number of followers to save. Increments of 1000.
  --hide-progress  Hide the Progress bar. Default: show progress
  --help           Show this message and exit.

following

Get the users that a given user is following.

Usage:

twarc2 following [OPTIONS] USER [OUTFILE]

Options:

  --limit INTEGER  Maximum number of friends to save. Increments of 1000.
  --hide-progress  Hide the Progress bar. Default: show progress
  --help           Show this message and exit.

hydrate

Hydrate tweet ids.

Usage:

twarc2 hydrate [OPTIONS] [INFILE] [OUTFILE]

Options:

  --hide-progress  Hide the Progress bar. Default: show progress, unless using
                   pipes.
  --help           Show this message and exit.

mentions

Retrieve max of 800 of the most recent tweets mentioning the given user.

Usage:

twarc2 mentions [OPTIONS] USER_ID [OUTFILE]

Options:

  --since-id INTEGER              Match tweets sent after tweet id
  --until-id INTEGER              Match tweets sent prior to tweet id
  --start-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets created after time (ISO
                                  8601/RFC 3339), e.g.  2021-01-01T12:31:04
  --end-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets sent before time (ISO 8601/RFC
                                  3339)
  --hide-progress                 Hide the Progress bar. Default: show
                                  progress
  --help                          Show this message and exit.

sample

Fetch tweets from the sample stream.

Usage:

twarc2 sample [OPTIONS] [OUTFILE]

Options:

  --limit INTEGER  Maximum number of tweets to save
  --help           Show this message and exit.

search

Search for tweets.

Usage:

twarc2 search [OPTIONS] QUERY [OUTFILE]

Options:

  --since-id INTEGER              Match tweets sent after tweet id
  --until-id INTEGER              Match tweets sent prior to tweet id
  --start-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets created after UTC time (ISO
                                  8601/RFC 3339), e.g.  2021-01-01T12:31:04
  --end-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets sent before UTC time (ISO
                                  8601/RFC 3339)
  --archive                       Search the full archive (requires Academic
                                  Research track). Defaults to searching the
                                  entire twitter archive if --start-time is
                                  not specified.
  --limit INTEGER                 Maximum number of tweets to save
  --max-results INTEGER           Maximum number of tweets per API response
  --hide-progress                 Hide the Progress bar. Default: show
                                  progress, unless using pipes.
  --help                          Show this message and exit.

searches

Execute each search in the input file, one at a time.

The infile must be a file containing one query per line. Each line will be passed through directly to the Twitter API - unlike the timelines command quotes will not be removed.

Input queries will be deduplicated - if the same literal query is present in the file, it will still only be run once.

It is recommended that this command first be run with --counts-only, to check that each of the queries is retrieving the volume of tweets expected, and to avoid consuming quota unnecessarily.

Usage:

twarc2 searches [OPTIONS] [INFILE] [OUTFILE]

Options:

  --since-id INTEGER              Match tweets sent after tweet id
  --until-id INTEGER              Match tweets sent prior to tweet id
  --start-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets created after UTC time (ISO
                                  8601/RFC 3339), e.g.  2021-01-01T12:31:04
  --end-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets sent before UTC time (ISO
                                  8601/RFC 3339)
  --archive                       Search the full archive (requires Academic
                                  Research track). Defaults to searching the
                                  entire twitter archive if --start-time is
                                  not specified.
  --limit INTEGER                 Maximum number of tweets to save *per
                                  search*, ignored if --counts-only is
                                  specified.
  --hide-progress                 Hide the Progress bar. Default: show
                                  progress, unless using pipes.
  --counts-only                   Only retrieve counts of tweets matching the
                                  search, not the tweets themselves. outfile
                                  will be a CSV containing the counts for all
                                  of the queries in the input file.
  --combine-queries               Merge consecutive queries into a single OR
                                  query. For example, if the three rows in
                                  your file are: banana, apple, pear then a
                                  single query ((banana) OR (apple) OR (pear))
                                  will be issued.
  --granularity [day|hour|minute]
                                  Aggregation level for counts (only used when
                                  --count-only is used). Can be one of: day,
                                  hour, minute. Default is day.
  --help                          Show this message and exit.

stream

Fetch tweets from the live stream.

Usage:

twarc2 stream [OPTIONS] [OUTFILE]

Options:

  --limit INTEGER  Maximum number of tweets to return
  --help           Show this message and exit.

stream-rules

List, add and delete rules for your stream.

Usage:

twarc2 stream-rules [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

add

Create a new stream rule to match a value. Rules can be grouped with optional tags.

Usage:

twarc2 stream-rules add [OPTIONS] VALUE

Options:

  --tag TEXT  a tag to help identify the rule
  --help      Show this message and exit.

delete

Delete the stream rule that matches a given value.

Usage:

twarc2 stream-rules delete [OPTIONS] VALUE

Options:

  --help  Show this message and exit.

delete-all

Delete all stream rules!

Usage:

twarc2 stream-rules delete-all [OPTIONS]

Options:

  --help  Show this message and exit.

list

List all the active stream rules.

Usage:

twarc2 stream-rules list [OPTIONS]

Options:

  --help  Show this message and exit.

timeline

Retrieve recent tweets for the given user.

Usage:

twarc2 timeline [OPTIONS] USER_ID [OUTFILE]

Options:

  --limit INTEGER                 Maximum number of tweets to return
  --since-id INTEGER              Match tweets sent after tweet id
  --until-id INTEGER              Match tweets sent prior to tweet id
  --exclude-retweets              Exclude retweets from timeline
  --exclude-replies               Exclude replies from timeline
  --start-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets created after time (ISO
                                  8601/RFC 3339), e.g.  2021-01-01T12:31:04
  --end-time [%Y-%m-%d|%Y-%m-%dT%H:%M:%S]
                                  Match tweets sent before time (ISO 8601/RFC
                                  3339)
  --use-search                    Use the search/all API endpoint which is not
                                  limited to the last 3200 tweets, but
                                  requires Academic Product Track access.
  --hide-progress                 Hide the Progress bar. Default: show
                                  progress, unless using pipes.
  --help                          Show this message and exit.

timelines

Fetch the timelines of every user in an input source of tweets. If the input is a line oriented text file of user ids or usernames that will be used instead.

The infile can be:

- A file containing one user id per line (either quoted or unquoted)
- A JSONL file containing tweets collected in the Twitter API V2 format

Usage:

twarc2 timelines [OPTIONS] [INFILE] [OUTFILE]

Options:

  --limit INTEGER           Maximum number of tweets to return
  --timeline-limit INTEGER  Maximum number of tweets to return per-timeline
  --use-search              Use the search/all API endpoint which is not
                            limited to the last 3200 tweets, but requires
                            Academic Product Track access.
  --exclude-retweets        Exclude retweets from timeline
  --exclude-replies         Exclude replies from timeline
  --hide-progress           Hide the Progress bar. Default: show progress,
                            unless using pipes.
  --help                    Show this message and exit.

tweet

Look up a tweet using its tweet id or URL.

Usage:

twarc2 tweet [OPTIONS] TWEET_ID [OUTFILE]

Options:

  --pretty  Pretty print the JSON
  --help    Show this message and exit.

users

Get data for user ids or usernames.

Usage:

twarc2 users [OPTIONS] [INFILE] [OUTFILE]

Options:

  --usernames
  --hide-progress  Hide the Progress bar. Default: show progress, unless using
                   pipes.
  --help           Show this message and exit.

version

Return the version of twarc that is installed.

Usage:

twarc2 version [OPTIONS]

Options:

  --help  Show this message and exit.