How to Improve your Twitter Followers Count (Part I)

So far I’ve shared several articles of my work with the Twitter API and my direct personal experience with the platform itself. In 2022, my goal is to really increase my Twitter follower count significantly. It still amazes me to see Twitter accounts with tens of thousands of followers. People who may or may not be known on other platforms have a significant following on Twitter.

Based purely on anecdotal experience, here are the things I believe will increase your Twitter follower count:

  1. Have a high follower count on other platforms (I.e., Instagram, TikTok, Facebook, etc.)
  2. Someone who is very well known finds a tweet you made and retweets it. Like the President or Oprah retweeting one of your tweets.
  3. Be a celebrity, politician or influencer who is in the public.

Okay, let’s say you have none of the things above. I will give you some hints on how you can improve your twitter following (Caveat: All this advice from someone (me) who has very low follower numbers).

You can take a look at my numbers and ask yourself, “Why would I take advice from someone who has low numbers themselves?” I can certainly understand that sentiment. But the reason why you should follow this article series, is that I have been doing some significant research into this and putting a plan in place to increase my followers count. So, if you’re not putting in a strategy yourself, what does it hurt to take a few pointers? I do recommend reading more articles on the topic, however.

Also, another reason to follow this advice is that I have used Machine Learning and Data Science techniques to study this subject. I will be adding more of my research for these articles on Twitter followership soon. But for now, I recommend that you read my current Data Science research including, “Creating Twitter Sentiment Association Analysis using the Association Rules and Recommender System Methods” and Apriori Association Analysis using R. These articles show how to use the Twitter API and R to do Twitter data analysis. In particular, learning about association rules and association analysis to get a idea how to find patterns and relationships between twitter followers and twitter followership and follower retweets. I will soon be presenting visualizations that will expand on this concept of twitter follower counts.

Another article of mine to read is more anecdotal but very important when it comes to taking control of your twitter feed. It’s very easy to get completely absorbed by twitter conversations and comments. To be quite honest with you, Twitter is a very depressing place at times. People who have high twitter follower counts will tweet out misinformation, insults, slurs, racism, sexism, homophobia, etc. and as someone who doesn’t like to perpetuate such things on the internet, you can really feel powerless and helpless to the slew of followers co-signing onto someone’s tweets with comments and retweets and likes. Please read my article on this Twitter is a Social Media Engagement Multiplier about more on this.

So finally, here are the ways an average user can increase their follower count.

  1. TWEET A LOT! I mean a whole lot. Try to get at least two thousand tweets in about six months. Use hash tags always.
  2. If you only want to tweet about specific niche topics such as business, engineering or technology, make sure to follow the most popular accounts and add to the conversation. Retweet popular posts. Most of these topics are very non-toxic and you will find people who are of the same mindset and are objective.
  3. Politics is truly the third rail on Twitter. If you care about this, my recommendation is be prepared to get into depressing twitter battles. And be prepared to stand your ground and go all into it, if you can take it. People are passionate about politics on twitter, and tweeting a lot about it on a daily basis will grow your audience. Follow both people you align with and people you are diametrically opposed to.
  4. Only follow these types of accounts: A) People or enterprises that are popular and newsworthy on Twitter. B) Accounts that have a 1:1 to 4:5 twitter follower to twitter following count. In other words, follow accounts which have as many followers as they are following. Meaning, if you follow them, they are likely to follow you back (See number 5 for the exception to this rule).
  5. The exception to rule 4 is that if it’s a popular or newsworthy account, or an account that represents a business or think tank, it’s important to follow them because it will give you insight to what topics are important on the platform.

Please stay tuned for more insight into this topics in Part 2.

Programming R for Human Genome Variation Data

Human Genome variation analysis is a popular biomedical and biological typical used for finding disease, developing treatments and discovering a wide array of human genetic variation that can study the impact of disease and medical treatments.

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

The following libraries needed to load and process VCF files.

install.packages("vcfR")
library(vcfR)
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)
library(Rsamtools)
library(pasillaBamSubset)

To load VCF files. Because of their size, VCF file are typically zipped

vcf_file <- read.vcfR('file1.vcf.gz',verbose=FALSE)

Techniques for processing VCF files include:

chrom <- create.chromR(name='Supercontig',vcf=vcf_file,seq=dna,ann=gff)

#Extract GenoTypes

vcf_file_1 <- extract_gt_tidy(vcf_file, format_fields=NULL, format_types=TRUE, dot_is_NA=TRUE, alleles=TRUE, allele.sep="/", gt_column_prepend="gt_", verbose=TRUE)
str(vcf_file_1)
 

system.time(write.csv(vcf_file_1,'out_combine.indel.vcf.gz.csv',row.names=FALSE,col.names=FALSE))

extract_info_tidy(vcf_file, info_fields=NULL, info_types=TRUE, info_sep=";")

 
#Extract the VCF Header Information

extract_info_tidy(vcf_snp, info_fields=NULL, info_types=TRUE, info_sep=";")

A great way to time how long it takes load genome data is to use system.time. For example:

system.time(write.csv(fdta_combine,'fdta_out3.csv',row.names=FALSE))

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large.

install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)

These are the files that align sequencing data with referencing genome data and be converted to CSV files.

fq.file <- file.path("D:/temp","fastq_file.fq.gz")
fdta1 <- readFastq(fq.file)
head(fdta1,100)
summary(fdta1)
str(fdta1)
fdta_sample_a <- fdta1[1:10,]
summary(fdta_sample_a)
system.time(write.csv(fdta_sample_a,'out_2.csv',row.names=FALSE))

BAM Files

BAM files contain the RAW genomic data an are typically very large. Along with a wide array of tools that can read BAM files, R has many functions that can process BAM data. BAM files also come with an index file that makes it easier to find information with the larger BAM files.

To load the BAM file libraries, you can install them directory into R or download the Bioconductor packages from https://www.bioconductor.org/.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")
(bf <- BamFile("D:/temp/raw1.bam"))
(bf <- BamFile("D:/temp/raw1.bam",yieldSize=1000))

seqinfo(bf)
(sl <- seqlengths(bf))
#quickBamFlagSummary(bf)  -- Realloc cound not re-allocate memory problem
(gr <- GRanges("chr4",IRanges(1, sl["chr4"])))
countBam(bf, param=ScanBamParam(which = gr))

reads <- scanBam(BamFile("D:/temp/raw2.bam", yieldSize=5))
class(reads)
names(reads[[1]])
reads[[1]]$pos # the aligned start position
reads[[1]]$rname # the chromosome
reads[[1]]$strand # the strand
reads[[1]]$qwidth # the width of the read
reads[[1]]$seq # the sequence of the read

gr <- GRanges("chr4",IRanges(500000, 700000))
reads <- scanBam(bf, param=ScanBamParam(what=c("pos","strand"), which=gr))

hist(reads[[1]]$pos)

readsByStrand <- split(reads[[1]]$pos, reads[[1]]$strand)
myHist <- function(x) table(cut(x, 50:70 * 10000 ))
tab <- sapply(readsByStrand, myHist)
barplot(t(tab))

(ga <- readGAlignments(bf)) # allocation of memory issue.  If the BAM file is too large.

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
#load library
library(Rsamtools)
#read in entire BAM file
bam <- scanBam("wgEncodeRikenCageHchCellPapAlnRep1.bam")
#names of the BAM fields
names(bam[[1]])
# [1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq" "cigar"
# [9] "mrnm" "mpos" "isize" "seq" "qual"
#distribution of BAM flags
table(bam[[1]]$flag)
# 0 4 16
#1472261 775200 1652949
#function for collapsing the list of lists into a single list
#as per the Rsamtools vignette
.unlist <- function (x){
## do.call(c, …) coerces factor to integer, which is undesired
x1 <- x[[1L]]
if (is.factor(x1)){
structure(unlist(x), class = "factor", levels = levels(x1))
} else {
do.call(c, x)
}
}
#store names of BAM fields
bam_field <- names(bam[[1]])
#go through each BAM field and unlist
list <- lapply(bam_field, function(y) .unlist(lapply(bam, "[[", y)))
#store as data frame
bam_df <- do.call("DataFrame", list)
names(bam_df) <- bam_field
dim(bam_df)
#[1] 3900410 13
view raw read_bam.R hosted with ❤ by GitHub

Twitter is a Social Media Engagement Multiplier

With the resignation of CEO Jack Dorsey as the executive leader of Twitter, I began to reflect upon the platform and what exactly the brand stands for. Twitter has been widely criticized for being a megaphone for extremism, hatred and anti-democratic ideology. My personal experience with Twitter has been one of desperate persuasiveness as I try to engage multiple people at once on issues that I care about. It’s very easy to get emotionally addicted to Twitter and invest a ton of emotional capital into it.

Microblogging, when it is healthy, can be a platform that engages multiple people at once very quickly who have varying points-of-view or advocacy, and watch as it get’s retweeted, liked or shared across social media. But people also use it to spread misinformation, harmful caricatures in real time, and watch as it becomes viral. My personal experience with Twitter has been like walking into a room with dozens of people arguing and trying to ask a question or bring a different point-of-view, and then quickly being dismissed, or insulted and at times being pushed out the room and the door shut behind me.

Ironically, this is exactly what happened to me once in real life. At a university I tried to insert myself into a conversation or topic that the vast majority of the participants didn’t think I should be involved in. And quite literally, the door was shut in front of me. It was humiliatingly painful; but I was very young and didn’t understand that I didn’t belong. Growing up, I was always taught that one of the greatest things about our country was diversity. Diversity of ideas, diversity of people, etc.

As I grew older, I realized quickly that the reality is far less ideal or utopian. Although we say we want diversity of ideas; really we want only our ideas to be accepted. And people who are different in race, culture, language, gender, identity are not always welcomed in the same spaces. That is a lot like twitter today. This became even more painfully evident when Twitter Spaces was launched. It quickly became a land mind as people battled it out in such racist hosting rooms as “Are there too many Black women in public?”, “Should White People Exist” and “Should Black People Exist?”.

As a data scientist, I studied extensively, the nature of associations on twitter and how people influence others based on who they follow and their own followership. For more information on this, read my article on association analysis in Twitter (for more information read my article on Apriori association analysis as a supplement to my Twitter article). What it taught me is that Twitter at its most beneficial is a “multiplier”. By multiplier, I am referring to Twitter’s ability to take information presented by someone on the platform, be it a blog, image, tweet, etc., and multiply that content to tens, hundreds, and even thousands of people near instantaneously better than any other platform.

So say for instance, you write a blog on your website. You may have hundreds and even thousands of people who have subscribed to your website. But that blog, in terms of engagement will likely not grow at the rate at which the reference to that article in a tweet would grow, keeping all other variables constant. For instance, if you have one hundred subscribers on your blog, and one hundred followers on twitter. The twitter reference will multiply your blog’s engagement. The same can be said for other platforms such as LinkedIn and Facebook (Meta).

My rule of thumb for Twitter now is to use it as a catalyst to bring more people to my site. Twitter is a multiplier and should not be used to have conversations. Express your ideas, then leave them there for people to engage. Remember that as a content creator, any engage – even if it’s negative – is a win! I also recommend creating a developer account and download Twitter feed data via the Twitter API. Twitter is really a great platform to understand this “multiplier” effect of social media.

I’d love to hear people’s comments on this. I’m open to have a conversation anytime on the topic. BTW, this article will be tweeted as well.

How to Transition from a Database Administrator Job to a Data Science or Data Engineering Job

I’ve been a Database Administrator for over 20 years. Throughout the 1990’s and 2000’s, database administration had become a somewhat lucrative, in-demand job for many people working in Information Technology. Even today, the role of Database Administrators (DBAs) is critical for daily operational goals and maintaining customer applications. Recently, there has been a major shift in what employers are looking for in job candidates for IT positions. Less companies are hosting their own databases; and the need for big data systems in the cloud have created more opportunities for people with skills in cloud architecture, data pipelines architecture and data science tools.

That being said, I feel like this shift has put a lot of DBAs in a precarious position. Being a dedicated DBA is challenging and very time consuming and requires a very broad set of skills. Being a DBA is a full time job in of itself, and database administration does not easily translate to data science or data engineering, so if you want to work towards a job role as a data engineer or data scientist, you probably have to take that initiative on your own and do off-hour work to acquire those skills. Data science is the ability to create meaningful business actions from sometimes messy, uncoordinated data. Data engineering is the ability to take very large volumes of data and make it readily available to business stakeholders regardless of the type of data, where it is stored, or how it is stored. Most DBAs spend their time making sure that the bare metal (local or NAS) storage or provisioned storage of data is consistent, available, and secured with an “engine” that can easily query or perform transactions on this data. All the mechanisms needed to do this quickly, reliably and efficiently with no data loss is the challenge most DBAs face on a daily basis.

This is the very high-level comparison between the fields. But there are some very powerful nuisances that need to be taken into consideration if you want to change roles. For one, being a DBA doesn’t necessarily mean that you understand how to work with data. Data is messy, and one of the strengths of a data scientist is his or her ability to take data and clean it, transform it, removing duplicates, removing anomalies, etc. You then need to have the ability to sample and partition data, create models and score your model. Many data scientists possess knowledge in mathematics and statistics that allow them to perform deep learning or complex machine learning and data analysis tasks.

One common bridge to go from database administration to data science and data engineering is SQL. SQL is a very powerful language for querying data in a relational databases. SQL is also considered one the most popular languages for data science. There are many functions available in SQL to perform data science functionality in databases. SQL is a powerful language this is by far the most popular way to extract data from a database and deliver it to the business.

Most DBAs have had some exposure to SQL, with another group who have had training in programming procedural structured languages like T-SQL, PL/SQL, PL/pgSQL amount others. Therefore, transitioning to languages such as Python and R typically used in data science is less of a journey than starting with little programming experience at all. Both languages have libraries that utilize SQL and database commands.

Along with learning Python and R, learning many of the popular data science and mathematics libraries such as SciKitLearn and NumPy is also helpful. R is a great language to practice data science techniques as well. Look for the many online resources for learning data science. Visit my articles on the data science conferences and data science resources. Take online classes on LinkedIn Learning, Udemy, Datacamp and Coursera which all have starting tracks for data science. A lot of success in moving into a new role involves self-learning. Particularly if you are in a job position that doesn’t have data science work to build skills.

For data engineering, it’s strongly recommended that you start a cloud account in Google Cloud, AWS and Azure. They offer “pay-as-you-go” options and are subscription based services based on the amount of compute time you accumulate. And with the many of the open data sets available free to the public, you can easily build test data pipelines in the cloud on your free time. You can also build pipelines in the cloud to help with you current DBA role. Most companies are transitioning to the cloud and offer their employees cloud access.

Post-graduate education is another path DBAs can take. There are many post-graduate and certificate programs in data science and big data engineering, with many more coming online. And these programs are flexible enough where you can learn outside your normal work hours.

Tools and Methods to Analyze Variant and Genotyping of Human Genome Data 12/1/2021

Recently I’ve been tasked with analyzing Biological and Genomic data. I’ve learned a lot about tools and libraries for R and Python. As part of this analysis, I’m helping scientists analyze genome variations and perform analysis of genotypes. The human genome has 23 chromosomes. That’s about 3 billion base pairs that contain around 30,000 genes. Every base has pair that can be coded with 2 bits. This equates to around 750 megabytes of data. The data that I have been analyzing is several terabytes of genome sequences for around eight humans. Because the data is so massive, there are several high-throughput tools available to perform genotyping and variation discovery that I will cover in a series of articles in the next few months.

One characteristic is that repetitive DNA sequences comprise approximately 50% of the human genome. Genome size 3,100 Mbps (mega-basepairs) per haploid genome. A base pair is two chemical bases bonded to one another forming the “rung” in the DNA. DNA strands look someone like ladder twisted around.

Variations

Variations include differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions. One such variation is Single-nucleotide polymorphism (SNPs). It’s a type of copy number variation (CNV), Variations in DNA are actually a normal part of human genetics and can sometimes be a sign of the body adapting to various changes within the sequences or even adaptation for protecting and adapting.

SNP can be any nucleic acid substitution:

  1. Transition
    1. Interchange of the purine (Adenine/Guanine)
    2. Pyrimidine (Cytosine/Thymine) nucleic acids
  2. Transversion
    1. Interchange of purine and pyrimidine nucleic acid

Since variation discovery is very important in the biological sciences, many tools have been developed to assist in creating medicines and treatments for all type of mutations within cells.

The International HapMap Project was develop a describe of variation patterns in the human genome that finds variations that impact health, responses to drugs and an individual’s environment. Variations include small-scale and large-scale variations.

Copy number variation (CNV). With the number of copies of a particular gene varies from one individual to the next. Following the completion of he Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes. They are categorized as long repeats or short repeats.

Insertions and Deletions (InDel) are a type of CNV: Insertion-deletion mutations refer to insertion and/or deletion of nucleotides into genomic DNA and include events less that 1Kb in length.

Other Definitions

Length of the base pairs (bp). One bp corresponds to approximately 3.4 A (340 pm) of length along the strand, and to roughly 618 or 643 daltons for DNA and RNA respectively.

Kilobase (kb) is a unit of measurement in molecular biology equal to 1000 base pairs of DNA or RNA.

Data Analysis

Most of analysis is performed in R. Here are some of the analysis done using Genome libraries:

install.packages("vcfR")
library(vcfR)
library(Rsamtools)
library(pasillaBamSubset)
# prepare for transaction data
install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)

library(pasillaBamSubset)
library(Rsamtools)
library(ggrepel)
install.packages("factoextra")
library(factoextra)

The above libraries are standard R libraries for analyzing Genomic data. Later in this document, I will discuss the multiple tools that produce the files necessary for these libraries.

The most advanced libraries can be downloaded from the BiocManager website.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")

Visualization

Function

Patient1A

Using R with Human Genome Variation Data

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large and represents the raw sequencing data.

BAM or CRAM file formats

These are the files that align sequencing data with referencing genome data.

Genomics Tools

Genome data is very large, and contains millions of base pairs for chromosomes. Although this data can be loaded into R, the complexity of looking at individual genes and chromosomes can be very daunting. One tool that makes reading genomic data more visual is Integrative Genomics Viewer or (IGV). IVG is a visualization tool that zooms in to the gene and chromosome level at the base length.

IGV efficiently pulls in BAM file indexes to locate genomic data.

Other tools for Genomics, structural biology and molecular biology is DNASTAR Lasergene Structural Biology Suite and Spartan.

Another tool for visualization is the Variant Effect Predictor (VEP), which determines the effect of variants on genes.

Other tools include SnpEff and SnpSuft. SnpEff provides genetic variant annotation and function effect prediction. It also annotates and predicts the effect of genetic variants on genetic variants on genes and protein.

SnpSift annotates genomic variants using database, filters, and annotated variants. Once you annotated your files using SnpEff, you can use SnpFift to help you filter large genomic datasets in order to find the most significant variants for your experiment. Microsoft Genomics: All SnpEff & SnpSift genomic database are kindly hosted by Microsoft Genomics and Azure

Microsoft Genomics service provides a cloud hosted solution that makes it easy to variant call your genomic samples. The service takes in genomic samples as two paired end read fastq (.fq.gz) files and produces .bam, .bai, or.vcf files, along with the associated log files.

The process uses a BWA / GATK data pipeline where Microsoft has improve the efficiency of both BWA and GATK producing results faster and with less overhead. There is also a secondary analysis .

GATK is the Genome Analysis Toolkit also used for variant discovery using a data pipeline which can be scaled in the Azure or Google cloud. GATK is a framework for Variant Discovery with high-throughput sequencing data.

Another tool is the CNVnator is a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

https://www.microsoft.com/en-us/genomics/